Recently a team of researchers from Google has identified a common cause for the failures of AI models, pointing to underspecification as one of the primary reasons that machine learning models often perform quite differently in the real-world than they do during testing and development.
Machine learning models often fail when tackling tasks in a real-world setting, even if the models perform optimally in the lab. There are many reasons that the mismatch between training/ development and real-world performance occurs. One of the most common reasons that AI models fail during real-world tasks is a concept known as data shift. Data shift refers to a fundamental difference between the type of data used to develop a machine learning model and the data fed into the model during application. As an example, computer vision models trained on high-quality image data will struggle to perform when fed data captured by low-quality cameras found in the model’s day-to-day environment.
According to MIT Technology Review, a team of 40 different researchers at Google have identified another reason that the performance of a machine learning model can vary so drastically. The problem is “‘underspecification”, a statistical concept that describes issues where observed phenomena have many possible causes, not all of which are accounted for by the model. According to the leader of the study Alex D'Amour, the problem is witnessed in many machine learning models saying that the phenomenon “happens all over the place”.
The typical method of training a machine learning model involves feeding the model a large amount of data that it can analyze and extract relevant patterns from. Afterwards, the model is fed examples it hasn't seen and asked to predict the nature of those examples based on the features that it has learned. Once the model has achieved a certain level of accuracy, the training is usually considered complete.
According to the Google research team, more needs to be done to ensure the models can truly generalize to non-training data. The classic method of training machine learning models will produce various models that may all pass their tests, yet these models will differ in small ways that seem insignificant but aren't. Different nodes in the models will have different random values assigned to them, or the training data could be selected or represented in different ways. These variations are small and often arbitrary, and if they don’t have a huge impact on how the models perform during training, they are easy to overlook. However, when the impact of all these small changes accumulates, they can lead to major variations in real-world performance.
This underspecification is problematic because it means that, even if the training process is capable of producing good models, it can also produce a poor model and the difference wouldn’t be discovered until the model exited production and entered use.
In order to assess the impact of underspecification, the research team examined a number of different models. Every model was trained using the same training process, and then the models were then subjected to a series of tests in order to highlight differences in performance. In one instance, 50 different versions of an image recognition system were trained on the ImageNet dataset. The models were all the same save for the neural network values that they were randomly assigned during the start of training. The stress tests used to determine differences in the models were conducted using ImageNet-C, a variation on the original dataset consisting of images altered through contrast or brightness adjustment. The models were also tested on ObjectNet, a series of images featuring everyday objects in unusual orientations and contexts. Even though all 50 models had approximately the same performance on the training dataset, performance fluctuated widely when the models were run through the stress tests.
The research team found similar results occurred when they trained and stress-tested two different NLP systems, as well as when they tested various other computer vision models. In each case, the models diverged wildly from each other even though the training process for all of the models was the same.
According to D’Amour, machine learning researchers and engineers need to be doing a lot more stress testing before releasing models into the wild. This can be hard to do, given that stress tests need to be tailored to specific tasks using data from the real world, data which can be hard to come by for certain tasks and contexts. One potential solution to the problem of underspecification is to produce many models at one time and then test the models on a series of real-world tasks, picking the model that consistently shows the best results. Developing models this way takes a lot of time and resources, but the trade-off could be worth it, especially for AI models used in medical contexts or other areas where safety is a prime concern. As D’Amour explained via MIT Technology Review:
“We need to get better at specifying exactly what our requirements are for our models. Because often what ends up happening is that we discover these requirements only after the model has failed out in the world.”