AI agents are moving beyond simple answers and into multi-step work. That shift creates a harder question for model providers and startups: how do they know an agent can complete complex jobs reliably before it acts on behalf of users?
Patronus AI is building one answer. The San Francisco-based startup creates simulated digital environments where agents can be tested after training, including replicas of websites and internal systems.
A funding round built around agent reliability
Patronus AI announced a $50 million Series B round led by Greenfield Partners. Notable Capital, Lightspeed, Datadog, and Samsung also participated in the round.
The financing brings Patronus AI’s total funding to $70 million. The company was founded in 2023 by former Meta AI researchers Anand Kannappan and Rebecca Qian.
Investor interest follows rapid growth. Patronus’ revenue has grown 15-fold over the past year, and Glenn Solomon, a managing director at Notable Capital, said virtually every frontier AI lab and many emerging startups are now customers.
That demand reflects a growing gap in AI development. Benchmarks can show that a model performs well on a defined test, including tests aimed at agent behavior. But a high benchmark score does not prove that an AI agent can handle a wide range of real-world jobs correctly.
Why benchmarks are not enough
AI labs often use benchmarks to demonstrate model performance. For agents, the challenge is different from answering a question because the system may need to navigate tools, make choices, and complete several steps in sequence.
Before agents are trusted to book trips or conduct financial analysis for users, builders want stronger evidence that they behave reliably. Patronus AI’s approach is to place agents inside controlled digital worlds where their actions can be observed and evaluated.
The company calls these systems “digital world models.” They are designed to reproduce the kinds of websites and internal systems an agent might need to operate inside.
In those environments, agents are stress-tested after training through reinforcement learning. The process rewards successful task completion and penalizes errors, giving model makers a way to test performance across different scenarios.
What digital world models test
Patronus AI is currently providing simulated digital worlds for software engineering and finance. Those areas are a starting point because some tasks can be checked and verified.
Kannappan said the company is focused today on problems that are verifiable, meaning problems that can be immediately checked. He also said there are many areas that are “very non-verifiable or very hard to verify.”
The distinction matters. In a verifiable task, an evaluator can determine whether the agent reached the right result. In harder-to-verify work, the challenge is not only whether the agent finished, but whether its process and output can be trusted.
The environments are meant to expose failures that a benchmark might miss. Solomon said AI agents can take shortcuts, causing them to fail at completing a task correctly. He said Patronus is strong at identifying those shortcuts and holding models accountable.
The company compares its method to how Waymo trained autonomous cars by first creating synthetic worlds. In those worlds, vehicles could face rare hazards, including severe weather or a child running after a ball.
For AI agents, the hazards are not physical roads. They are the unpredictable paths an agent may take while trying to finish a digital task.
Longer tasks raise the difficulty
Patronus AI is not only testing short agent actions. Kannappan said the company wants to create environments where an agent can operate for “10 hours or 10 days or 10 weeks.”
That goal points to a core issue in agent development. As tasks stretch over longer periods, there are more chances for errors, shortcuts, or incomplete work. A system that appears capable in a brief test may behave differently when a job requires extended execution.
For model providers and agent startups, simulated environments offer a place to find those problems before users depend on the system. They can test agents against controlled but varied conditions, then use the results to improve performance.
The approach also creates a clearer separation between model capability and model reliability. An agent may be sophisticated enough to attempt a task, but the more important question is whether it can complete that task correctly across many situations.
Competition comes from inside AI labs
Patronus AI sees its main competition as the internal teams that AI labs have already built to evaluate agent behavior. Those teams exist because model makers have a strong need to understand how agents act after training.
The company’s model also differs from human-data firms such as Mercor and Surge, which help model makers with reinforcement learning. Patronus AI evaluates how agents behave without human involvement.
That distinction is central to its pitch. Instead of relying on humans to judge every step, Patronus builds environments where agent behavior can be tested directly.
As AI agents become more capable, the pressure to prove reliability is rising. Patronus AI’s new funding shows that investors and customers see simulated digital worlds as one way to answer that pressure before agents take on more complex work for users.