Cursor’s browser experiment shows both the promise and the limits of large-scale AI software development. The company assigned hundreds of autonomous AI agents to one of software’s hardest tasks: building a working web browser with its own rendering engine.
The outcome was not a polished commercial browser. It did, however, render web pages in a recognizably correct way, with visible glitches that suggested it was not leaning on an existing engine. For a project that is widely treated as extremely complex, that result was enough to force a reassessment from people watching the field closely.
A browser became the stress test
The source article describes browser building as one of the most complex software projects imaginable. That is why Cursor’s choice of target matters. A browser is not a narrow coding exercise. It requires many parts of a system to work together, from rendering to JavaScript behavior to CSS handling.
Cursor’s project took several weeks to build and spans roughly one million lines of code across more than 1,000 files. The article says it is available on GitHub. Cursor also says that, despite the codebase size, new agents can still understand it and make meaningful progress.
Simon Willison, the British programmer and co-creator of the Django web framework, was surprised by the result. He had earlier in January predicted that an AI-assisted web browser would not be realistic until 2029 at the earliest. After seeing Cursor’s result, he wrote: "I may have been off by three years."
Willison’s reaction matters because his assessments of AI-assisted software development are closely followed in the industry. He also coined the term "Prompt Injection" in 2022 for a critical security vulnerability in LLMs, after Jonathan Cefalu had previously reported the problem to OpenAI as "command injection".
Why the first agent design broke down
Cursor’s early structure did not work. The first setup relied on agents of equal status coordinating through a shared file. When an agent wanted a task, it had to lock that task so another agent would not duplicate the effort.
That simple coordination pattern became a bottleneck. Agents held locks for too long or failed to release them. The article quotes the result this way: "Twenty agents would slow down to the effective throughput of two or three, with most time spent waiting."
The flat structure also changed agent behavior in an unhelpful direction. Without a clear hierarchy, agents became cautious. They moved toward minor changes and away from difficult work. The source article describes long periods of churn without meaningful forward movement.
That failure is important because it shows that scaling autonomous coding agents is not just a matter of running more of them. If coordination is weak, adding agents can increase waiting, duplicate effort and avoidance of hard tasks.
The system worked after roles became clearer
Cursor improved the process by separating responsibilities. Planners explored the codebase and created tasks. They could also spawn sub-planners for focused areas, such as CSS rendering or the JavaScript engine. That made planning parallel and recursive.
Workers had a narrower responsibility. They picked up a task, completed it and pushed their changes. They did not try to manage the whole project.
A Judge Agent reviewed each cycle and decided whether the project was complete or needed another iteration. This gave the system a way to keep moving while still checking progress against the larger goal.
The lesson is not that every AI coding project needs the same structure. The source article shows something more specific: in Cursor’s browser project, agents needed explicit roles. Equal status and shared-file coordination were not enough for work at this scale.
Prompts and models shaped the result
Cursor’s experience also pushed back against the idea that an agent harness alone determines performance. Wilson Lin from Cursor wrote: "Many of our improvements came from removing complexity rather than adding it."
One example was a dedicated integrator role for quality control and conflict resolution. Cursor found that it created more bottlenecks than it solved. Workers were able to handle conflicts themselves.
Model choice also mattered. The article says GPT-5.2 was found to be significantly better at "following instructions, keeping focus, avoiding drift, and implementing things precisely and completely." Opus 4.5, by contrast, "tends to stop earlier and take shortcuts when convenient," returning control quickly rather than fully completing a task.
Cursor also found that different models fit different roles. GPT-5.2 proved "a better planner than GPT-5.1-Codex, even though the latter is trained specifically for coding." Cursor now uses models based on role fit rather than treating one model as best for every job.
The clearest takeaway may be Cursor’s comment on prompting: "A surprising amount of the system's behavior comes down to how we prompt the agents. The harness and models matter, but the prompts matter more."
Beyond the browser experiment
The browser was not the only large project described. Cursor also used agents on a Solid-to-React migration in its own codebase. The overhaul took more than three weeks and touched +266,000/-193,000 lines of code. The result passes CI tests, but still needs thorough human review.
Another agent improved video rendering with an efficient Rust implementation that is expected to ship soon. Other projects remain in progress, including a Java Language Server Protocol with 7,400 commits and 550,000 lines of code, a Windows 7 emulator with 14,600 commits and 1.2 million lines, and an Excel clone with 12,000 commits and 1.6 million lines.
These examples frame Cursor’s browser not as a one-off demo, but as part of a wider attempt to apply agent swarms to major software work. The results still involve glitches, unfinished review and projects in progress. But they also show that autonomous agents can contribute to large codebases when the system gives them structure, suitable models and carefully designed prompts.