TechCrunch AI September 21, 2025 TERMINATOR

Inside the RL environment race to train AI agents

AI labs are looking beyond static datasets as they try to build more capable AI agents. RL environments simulate software work so agents can practice multistep tasks, but cost, complexity and reward hacking remain major questions.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

The story mildly leans Terminator because it focuses on training AI agents to become more capable and autonomous, though it emphasizes current limits and technical challenges rather than imminent danger.

Inside the RL environment race to train AI agents

AI agents are still far from the autonomous software workers that Big Tech CEOs have described for years. Products such as OpenAI's ChatGPT Agent and Perplexity's Comet show what is possible, but they also make the limits of today's agent technology easy to see.

That gap is pushing AI labs, startups and investors toward a training approach known as reinforcement learning environments, or RL environments. Instead of only learning from static data, agents can practice tasks inside simulated software workspaces and receive feedback when they succeed or fail.

Why AI labs want training grounds for agents

RL environments are designed to imitate the kinds of places where AI agents are supposed to work. A simulated browser, a software tool or an enterprise application can become a training ground where an agent is asked to complete a multistep task.

The basic idea is simple. An agent gets a goal, acts inside the environment, and receives a reward signal when it completes the task. In one example from the source, an environment could simulate a Chrome browser and ask an agent to buy a pair of socks on Amazon.

That example sounds small, but it shows why these environments matter. An agent might choose the wrong menu, misunderstand a product page or buy too many socks. Because developers cannot predict every possible mistake, the environment has to be flexible enough to handle unexpected behavior while still producing useful feedback.

This is what makes RL environments different from static datasets. A dataset can show examples. An environment can test whether an agent can act, recover and finish a task when the path is not perfectly scripted.

A new market forms around RL environments

Jennifer Li, general partner at Andreessen Horowitz, told TechCrunch that all the big AI labs are building RL environments internally. She also said labs are looking to outside vendors because creating high-quality environments and evaluations is complex.

That demand has created an opening for startups and data-labeling companies. Mechanize and Prime Intellect are among the newer companies trying to build around RL environments. Larger data-labeling businesses such as Mercor and Surge are also shifting more attention from static datasets toward interactive simulations.

The stakes are large enough that leaders at Anthropic have discussed spending more than $1 billion on RL environments over the next year, according to The Information. Investors and founders are watching for a company that could become the “Scale AI for environments,” a reference to the $29 billion data-labeling business that helped support the chatbot era.

Surge CEO Edwin Chen told TechCrunch he has seen a “significant increase” in demand for RL environments within AI labs. Surge, which reportedly generated $1.2 billion in revenue last year from work with AI labs including OpenAI, Google, Anthropic and Meta, has created an internal organization focused on RL environments.

Mercor, valued at $10 billion, is also pitching investors on RL environments for domain-specific tasks such as coding, healthcare and law. CEO Brendan Foody told TechCrunch that “few understand how large the opportunity around RL environments truly is.”

Different bets from Scale AI, Mechanize and Prime Intellect

The companies entering this market are not all taking the same approach. Scale AI, Surge and Mercor bring existing relationships with AI labs and experience in the broader data business. That gives them resources and distribution, but it also means they are adapting from earlier waves of AI training work.

Scale AI has had to adjust after Meta invested $14 billion and hired away its CEO. Google and OpenAI dropped Scale AI as a data provider, and the company faces competition for data-labeling work inside Meta. Even so, Scale AI is still trying to build for agents and environments.

Chetan Rane, Scale AI's head of product for agents and RL environments, framed the move as another adaptation. He said Scale AI adapted in autonomous vehicles, then adapted again after ChatGPT, and is now adapting to agents and environments.

Mechanize is taking a more focused route. The startup was founded roughly six months ago with the goal of “automating all jobs,” but co-founder Matthew Barnett told TechCrunch it is beginning with RL environments for AI coding agents. The company wants to provide a small number of robust environments rather than a broad set of simple ones.

Prime Intellect is aiming beyond the largest labs. Backed by AI researcher Andrej Karpathy, Founders Fund and Menlo Ventures, it has launched an RL environments hub that it hopes can serve as a “Hugging Face for RL environments.” The goal is to give open source developers access to resources similar to those used by major AI labs while selling access to compute.

The open question: can RL environments scale?

The biggest uncertainty is whether RL environments can become a reliable engine of AI progress. Reinforcement learning has already played a role in important recent advances, including OpenAI's o1 and Anthropic's Claude Opus 4. Those advances matter because earlier methods for improving AI models are showing diminishing returns.

RL environments fit into a broader bet that more reinforcement learning, more data and more computational resources can keep pushing AI forward. They may be especially relevant for agents because they let models act with tools and computers, not just produce text responses.

But that also makes the work expensive and difficult. Prime Intellect researcher Will Brown said training generally capable agents in RL environments can be more computationally expensive than previous AI training techniques. That creates opportunity not only for environment builders, but also for GPU providers that can power the process.

Skeptics see serious risks. Ross Taylor, a former AI research lead with Meta and co-founder of General Reasoning, told TechCrunch that RL environments are vulnerable to reward hacking, where AI models find ways to get rewarded without truly completing the intended task. He also said people are underestimating how difficult it is to scale environments.

OpenAI's Head of Engineering for its API business, Sherwin Wu, said in a recent podcast that he was “short” on RL environment startups. He pointed to the competitiveness of the field and the difficulty of serving AI labs while research changes quickly.

Karpathy has also split his view. He has called RL environments a potential breakthrough and invested in Prime Intellect, but he has voiced caution about reinforcement learning more broadly. In a post on X, he said, “I am bullish on environments and agentic interactions but I am bearish on reinforcement learning specifically.”

For now, RL environments look like one of the clearest attempts to make AI agents more dependable. They give agents a place to practice the messy work of using software. Whether that becomes the next great AI training market, or a difficult niche that resists easy scaling, remains the question every lab, startup and investor in the space is trying to answer.