Interviews
Engy Ziedan, PhD, Chief Scientific Officer and Co-Founder of Protege – Interview Series

Engy Ziedan, PhD, Chief Scientific Officer and Co-Founder of Protege, is an applied microeconomist whose work sits at the intersection of learning science, behavioral economics, and large-scale data analytics, bringing academic rigor to the rapidly evolving AI data layer. With a background spanning roles as an Assistant Professor at Indiana University and previously at Tulane University, her research has focused on health policy, incentives, and real-world outcomes using complex datasets. At Protege, she applies causal inference and econometric methods to ensure that training data systems are measurable, reproducible, and scientifically validated. She also leads DataLab, the company’s research arm, where she oversees interdisciplinary teams of economists, machine learning researchers, and domain experts working to improve how AI datasets are designed, evaluated, and deployed, treating data not as a core driver of model performance and reliability.
Protege is an AI data platform focused on unlocking high-quality, real-world datasets at scale to address one of the biggest bottlenecks in modern AI development: data quality. Through its DataLab initiative, the company is building a research-driven framework for dataset creation, evaluation, and benchmarking, helping AI systems perform more reliably in real-world environments. The platform works across industries such as healthcare, media, and scientific research, producing structured datasets and benchmarks that reflect real-world complexity rather than synthetic approximations. By combining scientific methodology with commercial applications, Protege aims to elevate data to the same level of importance as models and compute, positioning itself as critical infrastructure for the next generation of AI systems.
Your academic work spans health economics, causal inference, and large real-world datasets, and you have now helped build a company focused on the data layer powering AI. What experiences in your research and career led you to help create Protege, and how did those insights shape the company’s vision and its ability to secure early funding?
My academic training as an economist was the foundation for everything that followed. What I am trained in and what I teach is core econometric techniques. The core of what economists are trained to do is understand bias, classical and non-classical measurement error, and the downstream consequences of both, which turned out to be exactly what the AI data space was missing. That grounding is not specific to healthcare or even to data science in the traditional sense. It is about understanding what happens to a model when the inputs feeding it are systematically off. What the AI Research field now calls algorithmic bias is, at its core, the same problem economists have been wrestling with for decades: a biased regression. When you bring someone into data curation who has been trained to think that way, the data they produce carries that rigor by default.
As for the company vision, I want to be authentic here about how it actually started. When you are three people starting out, there is no document with a grand vision. There is just doing the thing. The real signal was that what we were producing was resonating. So we just did more of that.
Protege recently introduced DataLab as a new research institution focused on advancing the science of AI data. What specific challenges in today’s AI ecosystem convinced you that datasets and evaluation needed a dedicated research effort?
The problem DataLab was built to solve is one economists have a name for: the market for lemons. Economist George Akerlof’s “Market for Lemons” problem describes a used car market where buyers cannot tell the good cars from the bad “lemon” cars before purchase, so they end up paying the average price. When that happens, sellers of genuinely good cars have no incentive to participate because the market does not reward them appropriately, and quality spirals downward over time. That is precisely what has been happening in the data market across certain sectors of AI, where it was hard to tell good training data from bad data.
Data quality is extraordinarily difficult to assess before you actually activate it. You have to have deep domain knowledge, significant time, and even then, you can be tricked. So, for model builders, that asymmetric information problem slows down the entire pipeline. It makes procurement painful, it undervalues the people producing genuinely good data, and it erodes trust in the market overall. Benchmarks often fail to capture the complexity of real use cases, where static responses do not reflect longitudinal, multimodal decision making.
DataLab was created to be the mechanism that restores market trust in the true value of the data before someone acquires it. By understanding its domain, its context, and its flaws, and by closing that loop in a rigorous, repeatable way. That is not a procurement function. It is a scientific challenge at the core, grounded in quality, representation, contamination control, and safety. That’s why we believe data needs to have its own dedicated research effort.
For years, the industry conversation has centered on models and compute. Why do you believe the next phase of progress in AI will depend more on the quality, structure, and evaluation of data?
You can think of compute as a function of model size multiplied by data. Data is a core component. So scaling compute on terrible data is not progress; it is a waste.
There is an ongoing debate in the field about whether model size versus data quality contributes more to gains in intelligence. In any market segment, the first datasets to be collected and used are always the easiest data to find. That is just how markets work. The datasets that would move the needle further are harder to surface, harder to structure, and harder to evaluate. Not including these datasets has been a limiting factor.
Healthcare is a clear example. The models we have today perform at roughly the level of a medical resident, and that is impressive. But they are not performing at the level of a chief attending yet. That’s because what a senior clinician knows comes from years of accumulated experience that is extraordinarily difficult to capture in the low-hanging fruit data that has been easy to find and label. That gap is not a model architecture problem – it is a data problem.
DataLab is already collaborating with several frontier AI companies. From your discussions with these labs, what are the most common weaknesses you see in how training and evaluation datasets are currently designed?
The most honest answer is that it is very time-consuming to evaluate the data. I’m pretty sure that if you’re a researcher who trains a model on a dataset and you have not actually sat down and read the data, the same way you would read a newspaper, you are probably making a serious mistake. And to be fair, most researchers do make that effort. The problem is that doing that well, at scale, is genuinely hard.
Consider what a thorough evaluation actually requires. You need to assess whether the data is unbiased, whether it has been censored in ways that are not obvious, and whether there are toxic or otherwise problematic elements embedded in it. To do any of that credibly, you need real domain knowledge. You need to understand where the data comes from, what it looks like in the real world, how it was collected, and by whom. By the time you have assembled all of those components and worked through them carefully, three to four weeks have passed. And then you have to do it again for the next dataset.
That friction compounds across an organization. It slows down training pipelines, it creates pressure to cut corners on evaluation, and it means that the weaknesses in a dataset often only become visible after a model has already been built on top of them. The challenge is not that people do not care about data quality. It is that the infrastructure and tooling to evaluate it rigorously, quickly, and repeatedly simply have not existed.
You often describe the need to treat data as a scientific discipline. What changes when organizations begin approaching dataset design and evaluation with the same rigor applied to other scientific fields?
When organizations start treating data with the same rigor applied to other scientific fields, the first thing that changes is the culture. The clearest model for what that looks like comes from economics in the 1980s, with a turning point known as the credibility revolution. Social science at the time would publish almost anything — a hypothesis, a handful of supporting examples, and a conclusion drawn from a time series trend. Researchers began saying, “Don’t show me a time-series trend, show me quasi-experimentation.” That led to more counterfactuals and treated versus untreated comparisons that could actually isolate cause and effect.
The core lesson is that it is very easy to trick yourself into thinking you have good data when you do not. The antidote is a culture of falsification and robustness checks – actively trying to break your own findings, running the tests that might make your results look bad, not just the ones that confirm what you hoped to see. If you skip that step, you are not doing science. You are telling a story you already wanted to tell.
That is the difference rigor actually makes, and it applies directly to dataset design and evaluation. The question is not whether your dataset looks good on the surface. The question is whether you ran the checks that could have shown you it was not, and whether you reported those results honestly. Two teams can work with the same raw material, and the one that builds in falsification from the start will produce something fundamentally more reliable. Scientific integrity means being willing to find out where you might be wrong.
Benchmarking plays a major role in how the industry measures progress in AI systems. Where do current evaluation frameworks fall short, and what new approaches might produce more reliable assessments of model performance?
The benchmarking market is expanding rapidly, and that is genuinely encouraging. The work being done spans a wide spectrum — from internal validity, where the goal is to design evaluations rigorous enough that you actually believe the result, to external validity, where models are tested in live deployment conditions and assessed on how useful the model has been. There is important work happening across the entire range, and the simplest answer is that we just need more of them.
But the deeper problem is not the quantity of benchmarks – it’s that generally everyone is building them in a different way. There is no standard for how they are built, so the outcome measures vary quite a lot, and it’s hard to provide a credible evaluation of them. I used to have a Professor in Public Economics who used to say, “You never know what happened in the back room.” That phrase captures the benchmarking problem precisely. A lab might test a model against seventy outcomes and then publish only the top thirty and say that the model is excellent at these 30 things. Right now, it’s up to the model providers to convey what happened in the back room.
An umpire for rigor is needed. Publication bias in scientific research has demonstrated repeatedly that selective reporting shapes the perception of what works. That same dynamic is playing out in AI evaluation. The solution is not to ask model providers to be more transparent because they have every incentive to present their results favorably. What the field needs is a set standard for evaluation design and reporting, developed and enforced outside the organizations whose models are being assessed. Without that, benchmarking will continue to measure what labs want to show rather than what models actually do.
DataLab focuses on partnerships with researchers, the development of new datasets and data products, and academic research. How do these areas work together to create measurable improvements in AI systems?
DataLab’s partnership with AI researchers, our development of data products, and our own academic research are all parts of a system working towards creating symmetric information in the data market. Right now, the data market has the same problem as any market with asymmetric information: the people acquiring data cannot reliably assess its quality before they use it, and the people producing good data are not adequately rewarded for it.
Our work with AI researchers at model providers puts DataLab directly inside the data layer of model development. That proximity matters because the people building the models are the ones who know exactly where the data is failing them – which capabilities are not developing as expected, which evaluations keep producing results that do not hold up in deployment. Working alongside them means the feedback is immediate and specific rather than secondhand and generalized.
We conduct academic research and work with domain experts to bring in an independent layer of scrutiny, asking questions about a dataset that someone with a stake in the outcome would not think to ask. The data products are where that thinking gets stress-tested in the market.
The measurable improvement comes from closing that loop repeatedly. We build something, run the falsification checks, find out where it breaks, and then feed that back into the research. A dataset that has gone through that cycle is fundamentally different from one that has not — not because the raw material was better to begin with, but because the process was designed to find the problems rather than overlook them.
Your research background includes working with complex real-world datasets such as electronic health records, claims data, and imaging data. How has that experience influenced your perspective on building trustworthy datasets for AI?
Working with electronic health records, claims data, and imaging data makes one thing immediately clear: none of it was created for the purpose you are using it for. Clinical notes were written for billing. Claims data was generated for reimbursement. Imaging was captured for diagnosis. Every one of these datasets is a proxy — a record of what a system needed to document, not a precise measurement of what you actually want to know. That gap between what the data is and what you need it to be is where most of the hard work lives.
That experience shaped a very specific instinct: before you do anything else with a dataset, you have to understand the original purpose for it. Who collected the data, under what incentives, with what gaps, and for what original purpose was it intended for? A claims dataset that looks comprehensive may systematically underrepresent populations who interact with the healthcare system less frequently. An imaging dataset that looks clean may have been preprocessed in ways that removed exactly the signal that matters most for the question you are trying to answer.
The practical implication for building trustworthy datasets is that scale is not a substitute for design. A large dataset built without attention to provenance only becomes more confidently wrong as it grows. What actually builds trust is repeated auditing, honest documentation of limitations, and domain expertise that can tell you what the data cannot see, not just what it can.
Protege’s broader vision involves linking diverse datasets across domains such as clinical notes, genomics, imaging, and claims data. What new possibilities does multimodal data create for AI, and what safeguards are needed to manage the associated risks?
The world is multimodal. You would never receive a clinical diagnosis based on text alone. Other attributes matter, such as imaging results, lab values, claims history, genomic markers, etc. Even all of those combined are not a perfect representation of what is happening in a person’s body. I once worked with a researcher who put it well: all healthcare data is not a perfect proxy; it’s just a proxy for health. The implication is that the more modalities you can thoughtfully link together, the closer you get to the underlying reality you are actually trying to model.
When AI systems are trained on multimodal data, they are able to reason across the same layered, longitudinal picture that clinicians work from.
The safeguards question is where the stakes become very concrete. The probability that any dataset becomes visible on the internet at some point is not negligible — recent security breaches have made that clear. And anyone who has spent serious time reading medical records understands just how sensitive that information is. What people share with their physicians can break careers, damage relationships, and cause real harm if it ever becomes public.
At Protege, one principle that follows from this is that we do not self-certify our own data. We use a third-party certifier at arm’s length, even though we are legally permitted to do it ourselves. The reasoning is straightforward: the optimization function is not simply to maximize data utility. It is to maximize data utility subject to a privacy constraint.
As AI systems become more integrated into high-stakes industries, what standards should emerge around dataset design, evaluation, and transparency to ensure that future AI systems are both reliable and safe?
The conversation in AI on standards tends to focus on technical failure modes, such as a prompt that produces an inaccurate answer or a model that behaves unexpectedly in deployment. Those matter, and the field has made real progress in thinking through data documentation, evaluation rigor, and privacy constraints, but there is a broader standard that the industry has not yet found an honest way to discuss, and it is the one with the most consequences for the most people.
AI is reshaping work. You have many synonyms for the word “Work” – it’s a way to earn a living, but it’s also people’s purpose in life. The optimist’s version of this story points to the caveman who learned to build a knife, then watched manufacturing make that skill obsolete, and went on to develop entirely new expertise across generations. The arc of human labor has always bent toward adaptation. But that framing gets harder to apply when the person being displaced does not have decades of runway or the educational foundation to pivot into expertise that does not yet exist. The honest version of this conversation acknowledges both things at once.
What the industry needs is not just technical standards for datasets and benchmarks. It needs a willingness to ask what tasks are being replaced, at what pace, and what the downstream effects are on the people and communities involved. That is a standard, too.
Do these labor productivity standards belong alongside documentation requirements and evaluation frameworks? We are not in a position at DataLab to have that conversation alone. We sell data at Protege, which means we are not a neutral party. But we are also part of this economy, and so are our families. The least we can do is be honest about the complexity, name the tradeoff clearly, and push for the kind of cross-sector dialogue that a question this consequential actually requires.
Thank you for the great interview, readers who wish to learn more should visit Protege, DataLab initiative or Engy Ziedan personal website.












