By: Dattaraj Rao, Chief Data Scientist, Persistent Systems
As with any system that depends on data inputs, Machine Learning (ML) is subject to the axiom of “garbage-in-garbage-out.” Clean and accurately labeled data is the foundation for building any ML model. An ML training algorithm understands patterns from the ground-truth data and from there, learns ways to generalize on unseen data. If the quality of your training data is low, then it will be very difficult for the ML algorithm to continuously learn and extrapolate.
Think about it in terms of training a pet dog. If you fail to properly train the dog with fundamental behavioral commands (inputs) or do it incorrectly/inaccurately, you can never expect the dog to learn and expand through observation into more complex positive behaviors because the underlying inputs were absent or flawed, to begin with. Proper training is time-intensive and even costly if you bring in an expert, but the payoff is great if you do it right from the start.
When training an ML model, creating quality data requires a domain expert to spend time annotating the data. This may include selecting a window with the desired object in an image or assigning a label to a text entry or a database record. Particularly for unstructured data like images, videos, and text, annotation quality plays a major role in determining model quality. Usually, unlabeled data like raw images and text is abundant – but labeling is where effort needs to be optimized. This is the human-in-the-loop part of the ML lifecycle and usually is the most expensive and labor-intensive part of any ML project.
Data annotation tools like Prodigy, Amazon Sagemaker Ground Truth, NVIDIA RAPIDS, and DataRobot human-in-the-loop are constantly improving in quality and providing intuitive interfaces for domain experts. However, minimizing the time needed by domain experts to annotate data is still a significant challenge for enterprises today – especially in an environment where data science talent is limited yet in high demand. This is where two new approaches to data preparation come into play.
Active learning is a method where an ML model actively queries a domain expert for specific annotations. Here, the focus is not on getting a complete annotation on unlabeled data, but just getting the right data points annotated so that model can learn better. Take for example healthcare & life sciences, a diagnostic company that specializes in early cancer detection to help clinicians make informed data-driven decisions about patient care. As part of their diagnosis process, they need to annotate CT scan images with tumors that need to be highlighted.
After the ML model learns from a few images with tumor blocks marked, with active learning, the model will then only ask users to annotate images where it is unsure of the presence of a tumor. These will be boundary points, which when annotated will increase the confidence of the model. Where the model is confident above a particular threshold, it will do a self-annotation rather than asking the user to annotate. This is how active learning tries to help build accurate models while reducing the time and effort required to annotate data. Frameworks like modAL can help to increase classification performance by intelligently querying domain experts to label the most informative instances.
Weak supervision is an approach where noisy and imprecise data or abstract concepts can be used to provide indications for labeling a large amount of unsupervised data. This approach usually makes use of weak labelers and tries to combine these in an ensemble approach to build quality annotated data. The effort is to try to incorporate domain knowledge into an automated labeling activity.
For example, if an Internet Service Provider (ISP) needed a system to flag email datasets as spam or not spam, we could write weak rules such as checking for phrases like “offer”, “congratulations”, “free”, etc., which mostly are associated with spam emails. Other rules could be emails from specific patterns of source addresses that can be searched by regular expressions. These weak functions could then be combined by a weak supervision framework like Snorkel and Skweak to build improved quality training data.
ML at its core is about helping companies scale processes exponentially in ways that are physically impossible to achieve manually. However, ML is not magic and still relies on humans to a) set up and train the models properly from the start and b) intervene when needed to ensure the model doesn’t become so far skewed to where the results are no longer useful and may be counterproductive or negative.
The goal is to find ways that help streamline and automate parts of the human involvement to increase time-to-market and results but while staying in the guardrails of optimal accuracy. It is universally accepted that getting quality annotated data is the most expensive but extremely important part of a ML project. This is an evolving space, and a lot of effort is underway to reduce time spent by domain experts and improve the quality of data annotations. Exploring and leveraging active learning and weak supervision is a solid strategy to achieve this across multiple industries and use cases.