The Decoder October 22, 2024 NEUTRAL

Agent S moves AI assistants closer to hands-on computer work

Agent S is an AI system designed to operate computers by observing human actions and using mouse, keyboard, and screen controls. Early tests show promise across Linux and Windows, but the system remains slow and still reaches only around 20 percent success in one benchmark.

Agent S is an AI system built to handle routine computer tasks by using a machine in a way that resembles human interaction. Instead of relying only on text output, it can click buttons, type text, move through menus, and work with folders through a dedicated computer interface.

The system is described in the paper Agent S: An Open Agentic Framework That Uses Computers Like a Human. Its goal is straightforward: reduce the time people spend on repetitive office work such as data entry, scheduling, and document creation, while pointing toward more capable digital assistants.

How Agent S Uses a Computer

Agent S combines modern language models with an interface that can take control of mouse, keyboard, and screen. That matters because many useful tasks still happen inside ordinary software, where the right action may be a click, a typed entry, or a menu choice rather than a generated paragraph.

The researchers modeled this interaction on human computer use. Agent S can navigate menus and folders, click controls, and enter text. This approach could apply across different software rather than depending on one narrow application.

That flexibility is the core promise. For individual users and businesses, the same basic idea could support automation in many routine workflows. The source also notes that the technology could create new opportunities for people with disabilities.

Learning Is the Main Difference

Other projects have aimed at similar kinds of computer control. Microsoft also demonstrated the experimental UFO framework earlier this year. Agent S is presented as notable because of how it learns and adapts.

The system can draw on information from the internet, including instructions for specific computer programs. That helps it respond to applications that change over time instead of depending only on fixed instructions.

Agent S also stores experience from earlier tasks in a memory-like knowledge base. When it receives a new task, it searches for related examples, then divides the job into smaller subtasks.

During execution, Agent S monitors its own progress and adjusts its approach. After finishing, the experience is added back to the knowledge store, so completed work becomes part of what the system can use later.

The Interface Tries to Reduce Fragile Clicking

A key part of the framework is the agent-computer interface, which acts as the bridge between the AI system and the machine. It translates between the model’s decisions and the actual actions needed on screen.

The interface evaluates visual information so it can detect screen changes. It also builds a digital twin of visible controls and their layout.

That design is meant to avoid one common weakness in computer automation: relying on fixed mouse coordinates. Instead of moving to a specific screen position, Agent S can use an instruction such as "Click on button No. 42".

According to the paper, this makes control more robust and reduces the chance of errors. In practical terms, the system is trying to identify interface elements as objects to act on, not just as points on a display.

Early Results Show Promise and Limits

In practical tests by the developers, Agent S was compared on typical computer tasks. In a benchmark with tasks under Linux, the system increased the success rate by almost 90 percent compared with a pure language model, but it still only reaches around 20 percent.

Speed is another major limitation. In demo videos, Agent S takes about six minutes to remove an account in the Thunderbird email client. It takes a good three minutes to deactivate the autosave function in VS Code.

The system can be connected to different language models through an API. Depending on the task area, Claude 3.5 or GPT-4o came out ahead when combined with the framework, but the overall difference was marginal at 0.1 percentage points.

The source notes that a language model optimized for this type of use case could promise better performance. Agent S also achieved good results in a Windows test environment without special adaptation, suggesting that the core principles may transfer across operating systems.

What Still Needs Work

The researchers identify clear room for improvement. A detailed error analysis attributed about 40 percent of observed problems to weaknesses in task planning or in assigning control commands to screen elements.

Processing speed is another target for improvement. If an assistant takes several minutes to complete a small software setting change, it may not yet save time in everyday use.

That is the broader state of this field. Scientists are exploring several ways to operate user interfaces through natural language input. Rabbit made a similar promise with the Large Action Model Playground, but the source says that promise has not yet been fulfilled.

Agent S shows why this area remains important and difficult. The system can learn from past attempts, read instructions, observe the screen, and act through ordinary controls. But the early results also show that reliability and speed still need to improve before this kind of AI assistant consistently saves more time than it costs.

The Python code for Agent S is freely available on GitHub.