Why AI browsing may need websites built for agents

Researchers at TU Darmstadt introduced VOIX, a framework that adds structured action and context declarations to websites for AI agents. Early tests showed faster task completion than standard AI browser agents, but real adoption would require developers to rethink how they expose site functions.

WTF Index NEUTRAL
◄ Terminator 1 Idiocracy 0 ►

The story is mainly a technical framework for making AI web agents more reliable, with only mild autonomy and security implications.

Why AI browsing may need websites built for agents

AI browsers promise a web where people can ask for an outcome and let an agent handle the steps. The hard part is that most websites were built for human eyes, not for software that must infer what buttons, fields, menus, and visual states actually mean.

VOIX, a framework introduced by researchers at TU Darmstadt, proposes a different path. Instead of asking AI agents to visually decode complex interfaces, websites would declare what actions are available and what state the application is in.

What VOIX Adds to a Website

The framework centers on two new HTML elements: <tool> and <context>. The first describes actions an agent can use, including their names, parameters, and descriptions. The second provides current information about the application state.

In a to-do app, for example, a page could include a <tool name="add_task"> element. That tool might define parameters such as "title" and "priority" and connect to the app's JavaScript logic. When an agent needs to create a task, it can call the declared function instead of trying to locate the right input box and button on screen.

This is the central idea behind VOIX: make the website's capabilities explicit. The user interface can still exist for people, but the agent receives a structured layer that is easier to interpret and less dependent on visual guessing.

Why Visual Browsing Is Fragile

Current AI browser agents often work by treating websites more like images than systems. They inspect screenshots, infer which elements are interactive, decide what to click, and then check whether the expected change occurred. That process can be slow, unreliable, and exposed to attacks.

The researchers describe the problem directly: "Agents must infer affordances from human-oriented user interfaces, leading to brittle, inefficient, and insecure interactions,"

VOIX separates the work into clearer roles. The website declares its functions. A browser agent sits between the site and the AI. The inference provider decides what to do using the structured information the site has made available.

The architecture also has privacy implications. According to the source, user conversations go from the browser agent to the LLM provider, while the website stays out of that exchange. Agents also receive only the data explicitly released to them, rather than the full page. Because VOIX runs on the client side, site owners do not have to cover LLM inference costs.

Early Developer Tests and Demos

To evaluate the approach, the team ran a three-day hackathon with 16 developers. Six teams built apps with the framework, and most participants had no prior experience with it. The System Usability Scale score reached 72.34, above the industry average of 68, while developers also gave strong ratings for system understanding and performance.

The hackathon projects showed that the same pattern could apply across very different interfaces. Examples included:

  • A basic graphic design demo where users could click objects and issue voice commands such as "rotate this by 45 degrees."
  • A fitness app that generated full workout plans from prompts such as "create a full week high-intensity training plan for my back and shoulders."
  • A soundscape creator that changed audio environments in response to commands such as "make it sound like a rainforest."
  • A Kanban tool that generated tasks from prompts.

These examples matter because they show VOIX as more than a shortcut for form filling. The framework can expose actions in creative tools, planning apps, audio systems, and productivity boards, as long as developers decide what functions an agent should be allowed to call.

Speed Gains Over Standard AI Agents

The strongest practical argument for VOIX is latency. Benchmarks in the source show VOIX completing tasks in 0.91 to 14.38 seconds. Standard AI browser agents took from 4.25 seconds to over 21 minutes.

One benchmark involved rotating a green triangle 90 degrees. VOIX completed the action in a single second, while Perplexity Comet needed ninety seconds for the same task. The difference comes from how the systems operate: a vision-based agent must interpret the interface, choose the action, and verify the result, while VOIX can use a declared tool.

The source also notes that some complex tasks failed altogether with traditional tools. That points to a broader issue for AI browsing: speed is only one part of usability. If an agent cannot reliably identify what an app can do, it may not matter how capable the underlying language model is.

The Standardization Question

VOIX is not presented as a finished answer for the whole web. The researchers point to real deployment challenges, especially in large or legacy codebases. Tool declarations can drift out of sync with the visible interface, which would create a new maintenance burden.

Developers would also need to think differently. They would have to define agent-facing actions, decide which tools should be available, and choose between simple low-level functions and broader intent-based commands. That balance is still a challenge.

As a reference implementation, the researchers built a Chrome extension with chat and voice support that works with any OpenAI-compatible API. The framework supports both cloud-based and local LLMs and was tested with Qwen3-235B-A22B.

The larger backdrop is the push toward AI browsers and chatbot-led web use. Companies like OpenAI and Perplexity imagine assistants such as Atlas and Comet handling tasks from booking travel to online shopping without custom APIs. But today's language models still struggle with modern websites, and prompt injection remains a persistent threat.

VOIX fits into a wider move toward making websites more legible to AI systems. The source points to initiatives like llms.txt and the rise of MCP servers as signs that the industry is already looking beyond purely visual browsing. If agent-driven web use becomes common, developers may need to expose not just what a page looks like, but what it can safely do.