Alex Ratner is the CEO & Co-Founder of Snorkel AI, a company born out of the Stanford AI lab.
Snorkel AI makes AI development fast and practical by transforming manual AI development processes into programmatic solutions. Snorkel AI enables enterprises to develop AI that works for their unique workloads using their proprietary data and knowledge 10-100x faster.
What initially attracted you to computer science?
There are two very exciting aspects of computer science when you’re young. One, you get to learn as fast as you want from tinkering and building, given the instant feedback, rather than having to wait for a teacher. Two, you get to building a lot without having to ask anyone for permission!
I got into programming when I was a young kid for these reasons. I also loved the precision it required. I enjoyed the process of abstracting complex processes and routines, and then encoding them in a modular way.
Later, as an adult, I made my way back into computer science professionally via a job in consulting where I was tasked with writing scripts to do some basic analyses of the patent corpus. I was fascinated by how much human knowledge—anything anyone had ever deemed patentable—was readily available, yet so inaccessible because it was so hard to do even the simplest analysis over complex technical text and multi-modal data.
This is what led me back down the rabbit hole, and eventually back to grad school at Stanford, focusing on NLP, which is the area of using ML/AI on natural language.
You first started and led the Snorkel open-source project while at Stanford, could you walk us through the journey of these early days?
Back then we were, like many in the industry, focused on developing new algorithms and—i.e. all the “fancy” machine learning stuff that people in the community did research and published papers on.
However, we were always very committed to grounding this in real-world problems—mostly with doctors and scientists at Stanford. But every time we pitched a new model or algorithm, the response became “sure, we'd try that, but we'd need all this labeled training data we don't have time to create!”
We were seeing that the big unspoken problem was around the process of labeling and curating that training data—so we shifted all of our focus to that, which is how the Snorkel project and the idea of “data-centric AI” started.
Snorkel has a data-centric AI approach, could you define what this means and how it differs from model-centric AI development?
Data-centric AI means focusing on building better data to build better models.
This stands in contrast to—but works hand-in-hand with—model-centric AI. In model-centric AI, data scientists or researchers assume the data is static and pour their energy into adjusting model architectures and parameters to achieve better results.
Researchers still do great work in model-centric AI, but off-the-shelf models and auto ML techniques have improved so much that model choice has become commoditized at production time. When that’s the case, the best way to improve these models is to supply them with more and better data.
What are the core principles of a data-centric AI approach?
The core principle of data-centric AI is simple: better data builds better models.
In our academic work, we’ve called this “data programming.” The idea is that if you feed a robust enough model enough examples of inputs and expected outputs, the model learns how to duplicate those patterns.
This presents a bigger challenge than you might expect. The vast majority of data has no labels—or, at least, no useful labels for your application. Labeling that data by hand requires tedium, time, and human effort.
Having a labeled data set also does not guarantee quality. Human error creeps in everywhere. Each incorrect example in your ground truth will degrade the performance of the final model. No amount of parameter tuning can paper over that reality. Researchers have even found incorrectly-labeled records in foundational open source data sets.
Could you elaborate on what it means for Data-Centric AI to be programmatic?
Manually labeling data presents serious challenges. Doing so requires a lot of human hours, and sometimes those human hours can be expensive. Medical documents, for example, can only be labeled by doctors.
In addition, manual labeling sprints often amount to single-use projects. Labelers annotate the data according to a rigid schema. If a business’ needs shift and call for a different set of labels, labelers must start again from scratch.
Programmatic approaches to data-centric AI minimize both of these problems. Snorkel AI’s programmatic labeling system incorporates diverse signals—from legacy models to existing labels to external knowledge bases—to develop probabilistic labels at scale. Our primary source of signal comes from subject matter experts who collaborate with data scientists to build labeling functions. These encode their expert judgment into scalable rules, allowing the effort invested into one decision to impact dozens or hundreds of data points.
This framework is also flexible. Instead of starting from scratch when business needs change, users add, remove, and adjust labeling functions to apply new labels in hours instead of days.
How does this data-centric approach enable rapid scaling of unlabeled data?
Our programmatic approach to data-centric AI enables rapid scaling of unlabeled data by amplifying the impact of each choice. Once subject matter experts establish an initial, small set of ground truth, they begin collaborating with data scientists for rapid iteration. They define a few labeling functions, train a quick model, analyze the impact of their labeling functions, and then add, remove, or tweak labeling functions as needed.
Each cycle improves model performance until it meets or exceeds the project’s goals. This can reduce months of data labeling work to just hours. On one Snorkel research project, two of our researchers labeled 20,000 documents in a single day—a volume that could have taken manual labelers ten weeks or longer.
Snorkel offers multiple AI solutions including Snorkel Flow, Snorkel GenGlow and Snorkel Foundry. What are the differences between these offerings?
The Snorkel AI suite enables users to create labeling functions (e.g., looking for keywords or patterns in documents) to programmatically label millions of data points in minutes, rather than manually tagging one data point at a time.
It compresses the time required for companies to translate proprietary data into production-ready models and begin extracting value from them. Snorkel AI allows enterprises to scale human-in-the-loop approaches by efficiently incorporating human judgment and subject-matter expert knowledge.
This leads to more transparent and explainable AI, equipping enterprises to manage bias and deliver responsible outcomes.
Getting down to the nuts and bolts, Snorkels AI enables Fortune 500 enterprises to:
- Develop high-quality labeled data to train models or enhance RAG;
- Customize LLMs with fine-tuning;
- Distill LLMs into specialized models that are much smaller and cheaper to operate;
- Build domain and task- specific LLMs with pre-training.
You’ve written some groundbreaking papers, in your opinion which is your most important paper?
One of the key papers was the original one on data programming (labeling training data programmatically) and on the one for Snorkel.
What is your vision for the future of Snorkel?
I see Snorkel becoming a trusted partner for all large enterprises that are serious about AI.
Snorkel Flow should become a ubiquitous tool for data science teams at large enterprises—whether they’re fine-tuning custom large language models for their organizations, building image classification models, or building simple, deployable logistic regression models.
Regardless of what kind of models a business needs, they will need high-quality labeled data to train it.
Thank you for the great interview, readers who wish to learn more should visit Snorkel AI,