Connect with us

Interviews

Rebecca Qian, Co-Founder and CTO of Patronus AI – Interview Series

mm

Rebecca Qian is the Co-Founder and CTO of Patronus AI, with nearly a decade of experience building production machine learning systems at the intersection of NLP, embodied AI, and infrastructure. At Facebook AI, she worked across research and deployment, training FairBERTa, a large language model designed with fairness objectives, developing a demographic-perturbation model to rewrite Wikipedia content, and leading semantic parsing for robotic assistants. She also built human-in-the-loop pipelines for embodied agents and created infrastructure tooling such as Continuous Contrast Set Mining, which was adopted across Facebook’s infrastructure teams and presented at ICSE. She has contributed to open-source projects including FacebookResearch/fairo and the Droidlet semantic parsing notebooks. As a founder, she now focuses on scalable oversight, reinforcement learning, and deploying safe, environment-aware AI agents.

Patronus AI is a San Francisco-based company that provides a research-driven platform for evaluating, monitoring, and optimizing large language models (LLMs) and AI agents to help developers ship reliable generative AI products with confidence. The platform offers automated evaluation tools, benchmarking, analytics, custom datasets, and agent-specific environments that identify performance issues such as hallucinations, security risks, or logic failures, enabling teams to continuously improve and troubleshoot AI systems across real-world use cases. Patronus serves enterprise customers and technology partners by empowering them to score model behavior, detect errors at scale, and enhance trustworthiness and performance in production AI applications.

You have a deep background building ML systems at Facebook AI, including work on FairBERTa and human-in-the-loop pipelines. How did that experience shape your perspective on real-world AI deployment and safety?

Working at Meta AI made me focus on what it takes to make models reliable in practice—especially around responsible NLP. I worked on fairness-focused language modeling like training LLMs with fairness objectives, and I saw firsthand how difficult it is to evaluate and interpret model outputs. That’s shaped how I think about safety. If you can’t measure and understand model behavior, it’s hard to deploy AI confidently in the real world.

What motivated you to transition from research engineering into entrepreneurship, co-founding Patronus AI, and what problem felt most urgent to solve at the time?

Evaluation became a blocker in AI at the time. I left Meta AI in April to start Patronus with Anand because I’d seen firsthand how hard it is to evaluate and interpret AI output. And once generative AI started moving into enterprise workflows, it was obvious this was no longer just a lab problem. 

We kept hearing the same thing from enterprises. They wanted to adopt LLMs, but they couldn’t reliably test them, monitor them, or understand failure modes like hallucinations, especially in regulated industries where there’s very little tolerance for errors. 

So the urgent problem, at the start, was building a way to automate and scale model evaluation—scoring models in real-world scenarios, generating adversarial test cases, and benchmarking—so teams could deploy with confidence instead of guesswork

Patronus recently introduced generative simulators as adaptive environments for AI agents. What limitations in existing evaluation or training approaches led you to this direction?

We kept seeing a growing mismatch between how AI agents are evaluated and how they’re expected to perform in the real world. Traditional benchmarks measure isolated capabilities at a fixed point in time, but real work is dynamic. Tasks get interrupted, requirements change mid-execution, and decisions compound over long horizons. Agents can look strong on static tests and still fail badly once deployed. As agents improve, they also saturate fixed benchmarks, which causes learning to plateau. Generative simulators emerged as a way to replace static tests with living environments that adapt as the agent learns.

How do you see generative simulators changing the way AI agents are trained and evaluated compared to static benchmarks or fixed datasets?

The shift is that benchmarks stop being tests and start becoming environments. Instead of presenting a fixed set of questions, the simulator generates the assignment, the surrounding conditions, and the evaluation logic on the fly. As the agent behaves and improves, the environment adapts. That collapses the traditional boundary between training and evaluation. You’re no longer asking whether an agent passes a benchmark, but whether it can operate reliably over time in a dynamic system.

From a technical standpoint, what are the core architectural ideas behind generative simulators, particularly around task generation, environment dynamics, and reward structures?

At a high level, generative simulators combine reinforcement learning with adaptive environment generation. The simulator can create new tasks, update the rules of the world dynamically, and evaluate an agent’s actions in real time. A key component is what we call a curriculum adjuster, which analyzes agent behavior and modifies the difficulty and structure of scenarios to keep learning productive. Reward structures are designed to be verifiable and domain-specific, so agents are guided toward correct behavior rather than superficial shortcuts.

As the AI evaluation and agent tooling space becomes more crowded, what most clearly differentiates Patronus’s approach?

Our focus is on ecological validity. We design environments that mirror real human workflows, including interruptions, context switches, tool use, and multi-step reasoning. Rather than optimizing agents to look good on predefined tests, we’re focused on exposing the kinds of failures that matter in production. The simulator evaluates behavior over time, not just outputs in isolation.

What types of tasks or failure modes benefit most from simulator-based evaluation compared to conventional testing?

Long-horizon, multi-step tasks benefit the most. Even small per-step error rates can compound into major failure rates on complex tasks, which static benchmarks fail to capture. Simulator-based evaluation makes it possible to surface failures related to staying on track over time, handling interruptions, coordinating tool use, and adapting when conditions change mid-task.

How does environment-based learning change the way you think about AI safety, and do generative simulators introduce new risks such as reward hacking or emergent failure modes?

Environment-based learning actually makes many safety issues easier to detect. Reward hacking tends to thrive in static environments where agents can exploit fixed loopholes. In generative simulators, the environment itself is a moving target, which makes those shortcuts harder to sustain. That said, careful design is still required around rewards and oversight. The advantage of environments is that they give you much more control and visibility into agent behavior than static benchmarks ever could.

Looking five years ahead, where do you see Patronus AI in terms of both technical ambition and industry impact?

We believe environments are becoming foundational infrastructure for AI. As agents move from answering questions to doing real work, the environments where they learn will shape how capable and reliable they become. Our long-term ambition is to turn real-world workflows into structured environments that agents can learn from continuously. The traditional separation between evaluation and training is collapsing, and we think that shift will define the next wave of AI systems.

Thank you for the great interview, readers who wish to learn more should visit Patronus AI.

Antoine is a visionary leader and founding partner of Unite.AI, driven by an unwavering passion for shaping and promoting the future of AI and robotics. A serial entrepreneur, he believes that AI will be as disruptive to society as electricity, and is often caught raving about the potential of disruptive technologies and AGI.

As a futurist, he is dedicated to exploring how these innovations will shape our world. In addition, he is the founder of Securities.io, a platform focused on investing in cutting-edge technologies that are redefining the future and reshaping entire sectors.