āđ€āļŠāļ·āđˆāļ­āļĄāļ•āđˆāļ­āļāļąāļšāđ€āļĢāļē

āļĢāļĩāđ€āļšāļ„āļāđ‰āļē āđ€āļ‰āļĩāļĒāļ™ āļœāļđāđ‰āļĢāđˆāļ§āļĄāļāđˆāļ­āļ•āļąāđ‰āļ‡āđāļĨāļ° CTO āļ‚āļ­āļ‡ Patronus AI – āļšāļ—āļŠāļąāļĄāļ āļēāļĐāļ“āđŒāļžāļīāđ€āļĻāļĐ

āļšāļ—āļŠāļąāļĄāļ āļēāļĐāļ“āđŒ

āļĢāļĩāđ€āļšāļ„āļāđ‰āļē āđ€āļ‰āļĩāļĒāļ™ āļœāļđāđ‰āļĢāđˆāļ§āļĄāļāđˆāļ­āļ•āļąāđ‰āļ‡āđāļĨāļ° CTO āļ‚āļ­āļ‡ Patronus AI – āļšāļ—āļŠāļąāļĄāļ āļēāļĐāļ“āđŒāļžāļīāđ€āļĻāļĐ

mm

Rebecca Qian is the Co-Founder and CTO of Patronus AI, with nearly a decade of experience building production machine learning systems at the intersection of NLP, embodied AI, and infrastructure. At Facebook AI, she worked across research and deployment, training FairBERTa, a large language model designed with fairness objectives, developing a demographic-perturbation model to rewrite Wikipedia content, and leading semantic parsing for robotic assistants. She also built human-in-the-loop pipelines for embodied agents and created infrastructure tooling such as Continuous Contrast Set Mining, which was adopted across Facebook’s infrastructure teams and presented at ICSE. She has contributed to open-source projects including FacebookResearch/fairo and the Droidlet semantic parsing notebooks. As a founder, she now focuses on scalable oversight, reinforcement learning, and deploying safe, environment-aware AI agents.

āļœāļđāđ‰āļ­āļļāļ›āļ–āļąāļĄāļ āđŒ AI is a San Francisco-based company that provides a research-driven platform for evaluating, monitoring, and optimizing large language models (LLMs) and AI agents to help developers ship reliable generative AI products with confidence. The platform offers automated evaluation tools, benchmarking, analytics, custom datasets, and agent-specific environments that identify performance issues such as hallucinations, security risks, or logic failures, enabling teams to continuously improve and troubleshoot AI systems across real-world use cases. Patronus serves enterprise customers and technology partners by empowering them to score model behavior, detect errors at scale, and enhance trustworthiness and performance in production AI applications.

You have a deep background building ML systems at Facebook AI, including work on FairBERTa and human-in-the-loop pipelines. How did that experience shape your perspective on real-world AI deployment and safety?

Working at Meta AI made me focus on what it takes to make models reliable in practice—especially around responsible NLP. I worked on fairness-focused language modeling like training LLMs with fairness objectives, and I saw firsthand how difficult it is to evaluate and interpret model outputs. That’s shaped how I think about safety. If you can’t measure and understand model behavior, it’s hard to deploy AI confidently in the real world.

What motivated you to transition from research engineering into entrepreneurship, co-founding Patronus AI, and what problem felt most urgent to solve at the time?

Evaluation became a blocker in AI at the time. I left Meta AI in April to start Patronus with Anand because I’d seen firsthand how hard it is to evaluate and interpret AI output. And once generative AI started moving into enterprise workflows, it was obvious this was no longer just a lab problem. 

We kept hearing the same thing from enterprises. They wanted to adopt LLMs, but they couldn’t reliably test them, monitor them, or understand failure modes like hallucinations, especially in regulated industries where there’s very little tolerance for errors. 

So the urgent problem, at the start, was building a way to automate and scale model evaluation—scoring models in real-world scenarios, generating adversarial test cases, and benchmarking—so teams could deploy with confidence instead of guesswork

Patronus recently introduced generative simulators as adaptive environments for AI agents. What limitations in existing evaluation or training approaches led you to this direction?

We kept seeing a growing mismatch between how AI agents are evaluated and how they’re expected to perform in the real world. Traditional benchmarks measure isolated capabilities at a fixed point in time, but real work is dynamic. Tasks get interrupted, requirements change mid-execution, and decisions compound over long horizons. Agents can look strong on static tests and still fail badly once deployed. As agents improve, they also saturate fixed benchmarks, which causes learning to plateau. Generative simulators emerged as a way to replace static tests with living environments that adapt as the agent learns.

How do you see generative simulators changing the way AI agents are trained and evaluated compared to static benchmarks or fixed datasets?

The shift is that benchmarks stop being tests and start becoming environments. Instead of presenting a fixed set of questions, the simulator generates the assignment, the surrounding conditions, and the evaluation logic on the fly. As the agent behaves and improves, the environment adapts. That collapses the traditional boundary between training and evaluation. You’re no longer asking whether an agent passes a benchmark, but whether it can operate reliably over time in a dynamic system.

From a technical standpoint, what are the core architectural ideas behind generative simulators, particularly around task generation, environment dynamics, and reward structures?

At a high level, generative simulators combine reinforcement learning with adaptive environment generation. The simulator can create new tasks, update the rules of the world dynamically, and evaluate an agent’s actions in real time. A key component is what we call a curriculum adjuster, which analyzes agent behavior and modifies the difficulty and structure of scenarios to keep learning productive. Reward structures are designed to be verifiable and domain-specific, so agents are guided toward correct behavior rather than superficial shortcuts.

As the AI evaluation and agent tooling space becomes more crowded, what most clearly differentiates Patronus’s approach?

Our focus is on ecological validity. We design environments that mirror real human workflows, including interruptions, context switches, tool use, and multi-step reasoning. Rather than optimizing agents to look good on predefined tests, we’re focused on exposing the kinds of failures that matter in production. The simulator evaluates behavior over time, not just outputs in isolation.

What types of tasks or failure modes benefit most from simulator-based evaluation compared to conventional testing?

Long-horizon, multi-step tasks benefit the most. Even small per-step error rates can compound into major failure rates on complex tasks, which static benchmarks fail to capture. Simulator-based evaluation makes it possible to surface failures related to staying on track over time, handling interruptions, coordinating tool use, and adapting when conditions change mid-task.

How does environment-based learning change the way you think about AI safety, and do generative simulators introduce new risks such as reward hacking or emergent failure modes?

Environment-based learning actually makes many safety issues easier to detect. Reward hacking tends to thrive in static environments where agents can exploit fixed loopholes. In generative simulators, the environment itself is a moving target, which makes those shortcuts harder to sustain. That said, careful design is still required around rewards and oversight. The advantage of environments is that they give you much more control and visibility into agent behavior than static benchmarks ever could.

Looking five years ahead, where do you see Patronus AI in terms of both technical ambition and industry impact?

We believe environments are becoming foundational infrastructure for AI. As agents move from answering questions to doing real work, the environments where they learn will shape how capable and reliable they become. Our long-term ambition is to turn real-world workflows into structured environments that agents can learn from continuously. The traditional separation between evaluation and training is collapsing, and we think that shift will define the next wave of AI systems.

āļ‚āļ­āļšāļ„āļļāļ“āļŠāļģāļŦāļĢāļąāļšāļšāļ—āļŠāļąāļĄāļ āļēāļĐāļ“āđŒāļ—āļĩāđˆāļ”āļĩ āļœāļđāđ‰āļ­āđˆāļēāļ™āļ—āļĩāđˆāļ•āđ‰āļ­āļ‡āļāļēāļĢāđ€āļĢāļĩāļĒāļ™āļĢāļđāđ‰āđ€āļžāļīāđˆāļĄāđ€āļ•āļīāļĄāļ„āļ§āļĢāđ€āļĒāļĩāđˆāļĒāļĄāļŠāļĄ āļœāļđāđ‰āļ­āļļāļ›āļ–āļąāļĄāļ āđŒ AI.

āļ­āļ­āļ‡āļ•āļ§āļ™āđ€āļ›āđ‡āļ™āļœāļđāđ‰āļ™āļģāļ—āļĩāđˆāļĄāļĩāļ§āļīāļŠāļąāļĒāļ—āļąāļĻāļ™āđŒāđāļĨāļ°āļŦāļļāđ‰āļ™āļŠāđˆāļ§āļ™āļœāļđāđ‰āļāđˆāļ­āļ•āļąāđ‰āļ‡ Unite.AI āļœāļđāđ‰āđ€āļ›āļĩāđˆāļĒāļĄāļ”āđ‰āļ§āļĒāļ„āļ§āļēāļĄāļĄāļļāđˆāļ‡āļĄāļąāđˆāļ™āļ­āļĒāđˆāļēāļ‡āđ„āļĄāđˆāļĨāļ”āļĨāļ°āđƒāļ™āļāļēāļĢāļāļģāļŦāļ™āļ”āđāļĨāļ°āļŠāđˆāļ‡āđ€āļŠāļĢāļīāļĄāļ­āļ™āļēāļ„āļ•āļ‚āļ­āļ‡ AI āđāļĨāļ°āļŦāļļāđˆāļ™āļĒāļ™āļ•āđŒ āđ€āļ‚āļēāđ€āļ›āđ‡āļ™āļœāļđāđ‰āļ›āļĢāļ°āļāļ­āļšāļāļēāļĢāļ•āđˆāļ­āđ€āļ™āļ·āđˆāļ­āļ‡āļ—āļĩāđˆāđ€āļŠāļ·āđˆāļ­āļ§āđˆāļē AI āļˆāļ°āļŠāļĢāđ‰āļēāļ‡āļāļēāļĢāđ€āļ›āļĨāļĩāđˆāļĒāļ™āđāļ›āļĨāļ‡āļ„āļĢāļąāđ‰āļ‡āđƒāļŦāļāđˆāđƒāļŦāđ‰āļāļąāļšāļŠāļąāļ‡āļ„āļĄāđ€āļŠāđˆāļ™āđ€āļ”āļĩāļĒāļ§āļāļąāļšāļžāļĨāļąāļ‡āļ‡āļēāļ™āđ„āļŸāļŸāđ‰āļē āđāļĨāļ°āļĄāļąāļāļ–āļđāļāļžāļđāļ”āļ–āļķāļ‡āļ­āļĒāđˆāļēāļ‡āļ­āļ­āļāļĢāļŠāļ­āļ­āļāļŠāļēāļ•āļīāđ€āļāļĩāđˆāļĒāļ§āļāļąāļšāļĻāļąāļāļĒāļ āļēāļžāļ‚āļ­āļ‡āđ€āļ—āļ„āđ‚āļ™āđ‚āļĨāļĒāļĩāļ—āļĩāđˆāļžāļĨāļīāļāđ‚āļ‰āļĄāđāļĨāļ° AGI

āđƒāļ™āļāļēāļ™āļ°āļ—āļĩāđˆāđ€āļ›āđ‡āļ™ āļœāļđāđ‰āđ€āļ›āđ‡āļ™āđ€āļˆāđ‰āļēāļĒāļąāļ‡āļĄāļēāđ„āļĄāđˆāļ–āļķāļ‡āđ€āļ‚āļēāļ­āļļāļ—āļīāļĻāļ•āļ™āđ€āļžāļ·āđˆāļ­āļŠāļģāļĢāļ§āļˆāļ§āđˆāļēāļ™āļ§āļąāļ•āļāļĢāļĢāļĄāđ€āļŦāļĨāđˆāļēāļ™āļĩāđ‰āļˆāļ°āļāļģāļŦāļ™āļ”āđ‚āļĨāļāļ‚āļ­āļ‡āđ€āļĢāļēāļ­āļĒāđˆāļēāļ‡āđ„āļĢ āļ™āļ­āļāļˆāļēāļāļ™āļĩāđ‰ āđ€āļ‚āļēāļĒāļąāļ‡āđ€āļ›āđ‡āļ™āļœāļđāđ‰āļāđˆāļ­āļ•āļąāđ‰āļ‡ āļŦāļĨāļąāļāļ—āļĢāļąāļžāļĒāđŒ.ioāđāļžāļĨāļ•āļŸāļ­āļĢāđŒāļĄāļ—āļĩāđˆāļĄāļļāđˆāļ‡āđ€āļ™āđ‰āļ™āļāļēāļĢāļĨāļ‡āļ—āļļāļ™āđƒāļ™āđ€āļ—āļ„āđ‚āļ™āđ‚āļĨāļĒāļĩāļĨāđ‰āļģāļŠāļĄāļąāļĒāļ—āļĩāđˆāļāļģāļĨāļąāļ‡āļāļģāļŦāļ™āļ”āļ­āļ™āļēāļ„āļ•āđƒāļŦāļĄāđˆāđāļĨāļ°āļ›āļĢāļąāļšāđ€āļ›āļĨāļĩāđˆāļĒāļ™āļĢāļđāļ›āđāļšāļšāļ āļēāļ„āļŠāđˆāļ§āļ™āļ•āđˆāļēāļ‡āđ† āļ—āļąāđ‰āļ‡āļŦāļĄāļ”