By Ingo Mierswa, Founder, President & Chief Data Scientist at RapidMiner.
Data science has made some great progress in the last couple of years and many organizations are using advanced analysis or machine learning models to get to deeper insights about the processes and, in some cases, even to predict likely outcomes for the future. For other “sciences,” it is often not clear if a project will be successful or not, and there have been reports that as many as 87% of data science projects never make it into production. While a 100% success rate cannot be expected, there are some patterns in data science projects leading to higher success rates than should be deemed acceptable in the field. Those problematic patterns seem to exist independently of any particular industry or use case, which suggests that there is a universal problem in data science that must be addressed.
Measuring the success of machine learning
Data scientists who create machine learning (ML) models rely on well-defined mathematical criteria to measure how well such models perform. Which of those criteria is applied mainly depends on the type of model. Let’s assume a model should predict classes or categories for new situations — for example, if a customer is going to churn or not. In situations like these, data scientists would use measurements such as accuracy (how often the model is correct) or precision (how often customers are actually churning if we predict churn).
Data scientists need objective criteria like this because part of their job is to optimize those evaluative criteria to produce the best model. In fact, next to preparing the data to be ready for modeling, the building and tuning of those models is where data scientists spend most of their time.
The downside of this is that data scientists are not actually focusing much on putting those models into production, which is an issue for more than one reason. First and foremost, models that don’t produce successful results can’t be used towards generating business impact for the organizations deploying them. Secondly, because these organizations have spent time and money developing, training and operationalizing models that have not successfully produced results when run against “real world” data, they are more likely than not to deem ML and other data science tools as useless to their organization and decline to move forward with future data science initiatives.
The truth is that data scientists simply enjoy the tweaking of models and spend a lot of time on this. But without business impact, this time is not spent wisely, which is particularly painful given how scarce of a resource data scientists are in today’s world.
The Netflix prize and production failure
We’ve seen this phenomenon of overinvesting in model building and not in the operationalization of models play out in recent years. The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for movies. If you were to give a new movie a high rating, you likely enjoyed this movie – so using this rating system, Netflix will recommend certain titles to you and if you enjoy the recommended content, you will likely stay longer as a customer of Netflix. The grand prize was the sum of 1M USD, given to the team that was able to improve Netflix’s own algorithm by at least 10%.
The challenge started in 2006 and over the following three years, the contributions of over 40,000 data science teams globally led to an impressive improvement of more than 10% for title recommendation success. However, the models of the winning team never were operationalized. Netflix said that “the increase in accuracy did not seem to justify the effort needed to bring those models into production.”
Why optimal is not always optimal
Model accuracy and other data science criteria have long been used as the metric for measuring a model’s success before putting the model in question into production. As we have seen, many models never even make it to this stage – which is a waste of resources, both in terms of energy as well as time spent.
But there are more problems with this culture of overinvestment in model tweaking. The first is an inadvertent overfitting to the test data, which will result in models that look good to the managing data scientist, but actually underperform once in production – sometimes even causing harm. This happens for two reasons:
- There is a well-known discrepancy between testing error and that which you will see in production
- Business impact and data science performance criteria are often correlated, but “optimal” models do not always deliver the biggest impact
The first point above is also called “overfitting to the test set.” It’s a well-known phenomenon, especially among participants of data science contests like those from Kaggle. For these competitions, you can see a stronger version of this phenomenon already between the public and the private leaderboards. In fact, a participant could win the public leaderboard in a Kaggle competition without ever even reading the data. Similarly, the winner of the private leaderboard and the overall competition may not have produced a model that can maintain its performance on any other dataset than that which it has been evaluated on.
Accuracy does not equal business impact
For too long we have accepted this practice, which leads to the slow adaptation of models to test data sets. As a result, what looks like the best model turns out to be mediocre at best:
- Measurements like predictive accuracy often do not equal business impact
- An improvement of accuracy by 1% cannot be translated into 1% better business outcome
- There are cases in which a low-performing model outperforms others, with regard to business impact
- Other factors such as maintenance, scoring speed, or robustness against changes over time (called “resilience”) must be taken into account, too.
This last point is particularly important. The best models will not just win competitions or look good in the data science lab but will hold up in production and perform well on a variety of test sets. These models are what we refer to as resilient models.
Drift and the importance of resilience
All models deteriorate over time. The only question is how fast this happens and how well the model still performs under the changed circumstances. The reason for this deterioration is the fact that the world is not static. Therefore, the data to which the model is applied also changes over time. If these changes happen slowly, we call this “concept drift.” If the changes happen abruptly, we call this “concept shift.” For example, customers may change their consumption behavior slowly over time, having been influenced by trends and/or marketing. Propensity models may no longer work at a certain point. These changes can be drastically accelerated in certain situations. COVID-19, for example, has driven the sale of articles like toilet paper and disinfectants — an unexpected sharp increase in particular products which can throw such a model completely off course.
A resilient model may not be the best model based on measures like accuracy or precision but will perform well on a wider range of data sets. For this reason, it will also perform better over a longer period of time and is therefore better able to deliver sustained business impact.
Linear and other types of simple models are often more resilient because it’s more difficult to overfit them to a specific test set or moment in time. More powerful models can and should be used as “challengers” for a simpler model, allowing data scientists to see whether it can also hold up over time. But this should be employed at the end point, not the beginning of the modeling journey.
While a formal KPI for measuring resilience has not yet been introduced into the field of data science, there are several ways in which data scientists can evaluate how resilient their models are:
- Smaller standard deviations in a cross-validation run mean that the model performance depended less on the specifics of the different test sets
- Even if data scientists are not performing full cross validations, they may use two different data sets for tests and validation. Less discrepancy between error rates for the test and validation data sets indicate higher resilience
- If the model is properly monitored in production, error rates can be seen over time. The consistency of error rates over time is a good sign for model resilience.
- If the model monitoring solution of choice accounts for drift, data scientists should also pay attention to how well the model is impacted by that input drift.
Changing the Culture of Data Science
After a model has been deployed in the operationalization stage, there are still threats to a model’s accuracy. The last two points above regarding model resilience already require proper monitoring of models in production. As a starting point for a change of culture in data science, companies are well-advised to invest in proper model monitoring and to begin to hold data scientists accountable for the lack of performance after models are put into production. This will immediately change the culture from a model-building culture to a value-creating-and-sustaining culture for the field of data science.
As recent world events have shown us, the world changes quickly. Now, more than ever, we need to build resilient models — not just accurate ones — to capture meaningful business impact over time. Kaggle, for example, is hosting a challenge to galvanize data scientists around the world to help build model solutions to use in the global fight against COVID-19. I anticipate that the most successful models produced as a result of this challenge will be the most resilient, not the most accurate, as we’ve seen how rapidly COVID-19 data can change in a single day.
Data science should be about finding the truth, not producing the “best” model. By holding ourselves to a higher standard of resilience over accuracy, data scientists will be able to deliver more business impact for our organizations and help to positively shape the future.