Interviews
Bobby Samuels, Co-Founder and CEO of Protege – Interview Series

Bobby Samuels leads Protege’s strategy and execution across product, go-to-market, and capital formation. He co-founded Protege in 2024 and has served as CEO since inception. Under his leadership, Protege has raised $35M in funding and scaled to $30M in GMV in its first full year of business. Previously, Bobby was General Manager of Privacy Hub at Datavant, where he helped drive the company’s growth leading up to its $7.0B merger with Ciox Health to create the largest neutral health data ecosystem in the U.S. Earlier, he led partnerships at LiveRamp, where he developed expertise in building neutral data networks. Bobby holds an M.B.A. from the Stanford Graduate School of Business and an A.B. from Harvard College, where he was President of The Harvard Crimson. He brings deep expertise in regulated data exchange and translating complex infrastructure into trusted AI enablement for enterprise partners.
Protege is a data-infrastructure company that connects owners of high-value, proprietary datasets with developers building AI models, offering a governed and privacy-first way to license and access training data at scale. Founded in 2024, the platform focuses on unlocking multimodal data—such as medical records, imaging, video, and audio—that is traditionally difficult for AI teams to source, while giving data providers full control over privacy, compliance, and monetization. For AI builders, Protege streamlines discovery and acquisition through a curated catalog and tools for filtering and combining datasets, helping accelerate development across healthcare, media, and other sectors. In essence, the company aims to become the trusted data layer for AI, reducing one of the biggest bottlenecks in modern model development.
What inspired you to found Protege, and how did your experiences leading data, privacy, and organizational transformation initiatives at Datavant — as well as earlier roles at LiveRamp — shape your vision for building it?
My experience at Datavant showed me both the power and the complexity of connecting data responsibly at scale. Datavant built a platform that helped link sensitive health information while maintaining patient privacy, and it became clear to me that well-governed data can drive massive societal progress. But when it is not, it can do real harm.
As AI accelerated, I saw the same pattern repeating: a focus on compute and AI architectures, but not so much on the data driving the models themselves. Our hypothesis is that the next massive bottleneck is access to the right data. I wanted to build a data infrastructure layer that makes data sharing safe, transparent, and mutually beneficial for data holders and AI builders, while also providing AI data-specific expertise to support research-driven AI advancements. That is what led to Protege.
Protege describes itself as building the “backbone of the AI data economy.” How do you define that layer, and what does true data infrastructure for AI look like in practice?
Protege is the connective tissue that lets data owners and AI developers collaborate safely and efficiently. True data infrastructure for AI does more than store or move data; it verifies provenance, manages permissions, and ensures that every dataset is used ethically and with consent. In practice, it’s a single platform where content holders can license data confidently and be properly compensated accordingly, and AI builders can access the crucial datasets across industries, domains, modalities, and formats that they need to train and evaluate models responsibly.
One of your core missions is ensuring models are trained on licensed, representative, and consent-based datasets. How does Protege operationalize ethical sourcing at scale?
We operationalize ethics through systems, not slogans. With every data and content source that we aggregate and deliver, we ensure that the rights holders are maintaining ownership with clear licensing terms and privacy protections
Our platform combines our human, research-oriented expertise with data pipelines and systems that scale to deliver the rights-protected data. We also work with our data buying-customers to ensure that the data is representative of real-world populations and reflective of real world use cases. By addressing both data suppliers and data purchasers with clarity and consistency, we’re able to maintain compliance, fairness, and trust.
The AI industry has long been driven by a “scrape first, ask later” mentality. How do you see transparent data licensing reshaping relationships between data providers and AI developers?
Transparency turns extraction into collaboration. Instead of scraping, AI companies have the option to ethically license data from vetted data providers, which creates better incentives for both sides. Data providers gain revenue and control and AI developers get cleaner, higher-quality datasets without the legal and IP.
This shift builds trust, which goes on to unlock speed in AI development. When organizations see that AI can be built responsibly with clear consent and compensation for data rights holders, this unlocks more use cases and data needs. This creates more demand for high-quality datasets, starting a natural flywheel: the best data sources attract buyers, and the buyers attract more high-fidelity data sources. Everyone benefits.
Synthetic data is often seen as a solution to privacy and bias challenges. Where do you think the right balance lies between synthetic and real-world datasets, especially in highly regulated sectors like healthcare?
Synthetic data is useful for testing and augmentation, but it cannot entirely replace the full nuance and complexity of real-world activities that generate the training and evaluation data. This is especially true in healthcare, where long-term patient care history and outcomes within the context of the care approach matters.
We fundamentally believe that AI that hasn’t been trained on the full complexity of the real world cannot suddenly be able to produce synthetic data that is representative of the real world. Likely, the right balance will be a hybrid approach, where we’ll need a ton of more useful, high-quality data sources that are currently siloed and need to be unlocked, and then combine it with AI-generated synthetic data for specific use cases.
How does Protege enable organizations to share valuable real-world data securely, without exposing proprietary information, patient data, or intellectual property?
Security and privacy are built into every step of the journey. Whether it’s through our internal systems or our de-identification and privacy partners that verify our data transfers, we ensure that our data stays within the intended boundaries.
In healthcare, that means adherence to privacy and compliance frameworks for all our data transfers. In media, it means ensuring content is licensed only for intended uses on pre-agreed upon licensing terms and term lengths.
As foundation models continue to evolve, what will define the next generation of high-quality training data pipelines?
Three principles will lead: provenance, precision, and purpose.
Provenance means full traceability to source and terms. Precision means curation for specific modalities or use cases rather than generic corpora of data – or data that isn’t fully reflective of real world situations. Purpose means aligning data selection with real concrete outcomes, not just vanity benchmarks.
Together, these create a path towards using high quality data to drive better models.
How do emerging regulations like the EU AI Act and upcoming U.S. frameworks influence Protege’s approach to compliance and cross-border data collaboration?
These regulations validate our approach that we based the company on. They emphasize transparency, provenance, and risk management, which are embedded in our products and platform by default.
We believe that future AI opportunities must protect rights holders and maintain strict privacy controls. By treating these as non-negotiables, we help data partners and clients move forward with confidence and trust in the ever-changing AI landscape. Our goal is to make responsible AI development not just the right thing to do, but the easier thing to do.
What role do you see data transparency and provenance playing in rebuilding public trust in AI systems?
Trust begins with traceability. When people understand where data came from and how it is being used, they are more likely to trust AI outcomes.
Transparency and provenance create accountability from the data owner to the model developer to the end user. They turn AI from a black box into something more understandable and explainable.
After 20x growth and a $25M Series A, how are you balancing rapid scaling with maintaining Protege’s ethical and security commitments — and what’s next as you continue shaping how organizations train AI models responsibly?
Ethics and security are the foundation that allows us to scale. Every new process, partnership, and product is measured against operating as if others were watching. If everyone saw how we operate and the decisions we made, I would want them to be proud.
As we look forward to 2026, we are expanding our reach into new domain areas beyond healthcare and media, as well as creating new data products such as evaluation data for benchmarking as AI organizations strive to better measure AI performance for real-world use cases. Our aim is to be the single trusted platform for real-world AI data and expertise, built to power AI progress for the long run.
Thank you for the great interview, readers who wish to learn more should visit Protege.












