AI agents now pass 16.1 percent of freelance tasks

The Remote Labor Index says Fable 5 has reached a 16.1 percent automation rate on real freelance projects, far above the 2.5 percent top score when the benchmark launched. The results show rapid progress, but most work still fails to meet professional quality, and AI judges remain too generous.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 1 ►

The story mainly signals faster, more capable autonomous AI agents encroaching on professional work, though current reliability limits keep the risk mild.

AI agents now pass 16.1 percent of freelance tasks

AI agents are getting better at completing real freelance work, but the latest results also show why professional judgment still matters. The Remote Labor Index now reports a top automation rate of 16.1 percent, meaning the best system can deliver work judged at least as good as a paid human professional on that share of tested projects.

That is a major shift from the benchmark's launch, when the leading AI agent reached only 2.5 percent. The new numbers suggest fast movement in remote work automation, while also making clear that most commercially valuable freelance jobs remain outside the reach of current systems.

What the Remote Labor Index Measures

The Remote Labor Index, or RLI, is designed around practical freelance work rather than abstract tests. It tracks whether AI agents can finish real, commercially valuable projects at a standard that a paying client would accept.

The benchmark includes work across 3D and CAD, architecture, graphic design, video and animation, audio, data analysis, and web apps. In total, it covers 240 projects worth a combined $144,000, sourced from 358 verified freelancers.

Each AI result is compared with a gold standard created by a paid professional. Human evaluators at the Center for AI Safety score the outputs, and the RLI was developed together with Scale Labs.

The key measure is the automation rate. In plain terms, that is the share of projects where the AI result is rated at least as good as the human-made reference work.

Fable 5 Sets The New High Mark

The latest RLI results put Fable 5 at 16.1 percent, the highest automation rate recorded so far. That is roughly double the 8.3 percent reached by Opus 4.8, while GPT-5.5 comes in at 6.3 percent.

All three systems outperform every earlier model tested on the benchmark. The previous best result was Opus 4.6 running on the Claude Cowork framework, which reached 4.17 percent.

The authors say the frontier has more than quadrupled in under eight months. That does not mean AI agents are ready to replace most freelance professionals, but it does show that the best systems are moving quickly on tasks that require real tools, files, and client-style deliverables.

There is one important note on Fable 5's score. Only 218 of 240 projects could be evaluated before the U.S. government restricted access to the model. Even under the harsh assumption that Fable 5 failed every unevaluated project, its automation rate would still be 14.6 percent, which remains higher than any other model in the results.

Newer Models Do Not Always Rank Higher

The RLI results also complicate a simple story about model release dates. Progress is visible at the top of the leaderboard, but newer does not automatically mean stronger on this benchmark.

On the full Scale Labs leaderboard, Gemini 3 Pro scores just 1.25 percent, placing it near the bottom and behind much older systems. That matters because the RLI measures end-to-end work, not only reasoning or text generation.

The examples described in the study show the gap between a better AI output and a professional final product. On a ring design task, Fable 5 improves on earlier AI systems but still appears unprofessional when inspected closely. On an architecture task, GPT-5.5 produced an attractive render with an image generator even though the underlying 3D model remained flawed.

Those cases highlight a central issue for freelance automation: presentation can look convincing before the actual deliverable holds up. For clients, the file itself matters. A polished preview is not the same as usable work.

Why Human Evaluation Still Matters

The team also tested whether AI judges could replace costly human evaluation. The result was not encouraging. AI evaluators gave the new models scores that were far too high.

For GPT-5.5, the AI judge's score was almost three times too high. For Opus 4.8, it was about two and a half times too high. The automated judge did identify the correct ranking order, but the specific automation rates were badly inflated.

According to CAIS, judging this kind of work requires more than looking at an output on the surface. A reviewer has to open files in the appropriate professional software, use that software properly, and assess the result the way a paying client would.

That requirement exposes the same weakness that limits AI workers. Current AI agents still struggle with hands-on software operation. An AI judge can therefore miss problems that become obvious only when the real file is inspected. The GPT-5.5 architecture example shows the point: spotting the issue requires opening the 3D model and checking the actual geometry.

How The Agents Were Tested

To give the models a fair chance, the benchmark does not restrict them to a simple chat interface. The team runs them in developer tools such as Claude Code and Codex CLI, extended so the agents can operate graphical programs directly.

The work happens inside a virtual Linux machine with more than 30 professional apps installed, including Blender, GIMP, and Audacity. Each project receives up to 24 hours of compute time.

The setup also includes a critic loop. A second AI agent reviews the work as critically as a demanding client, and the first agent then revises the output. This structure is meant to let the systems improve their deliverables before final evaluation.

Even with that support, the benchmark's message is mixed. AI agents are advancing quickly, and the jump from 2.5 percent to 16.1 percent is substantial. At the same time, AI still fails to reach professional quality on most projects, and none of the three Fable 5 results shown in the blog post would qualify as finished work.

For remote work, the implication is clear: automation is no longer theoretical, but it is uneven. The strongest agents can now complete a meaningful minority of freelance tasks at professional quality, while the harder work still demands human skill, careful inspection, and client-level standards.