Apple’s Ferret-UI 2 points to a practical direction for AI assistants: systems that can understand app screens well enough to act on them. The project is built around reading user interfaces across iPhones, iPads, Android devices, web browsers, and Apple TV, then choosing the right interface elements for a task.
The important shift is from simple screen recognition toward intent-aware control. Instead of depending on exact click coordinates, Ferret-UI 2 is designed to understand what a command means and connect that request to the correct part of an app or interface.
What Ferret-UI 2 Is Built To Do
Ferret-UI 2 is an AI system from Apple that can read and control apps across multiple device categories. The source describes support for iPhones, iPads, Android devices, web browsers, and Apple TV, which makes the system broader than a tool built only for one screen size or operating environment.
At the core, the system recognizes UI elements. That includes basic components such as text and buttons, along with more complex operations that require the model to interpret how parts of a screen relate to one another.
In UI element recognition tests, Ferret-UI 2 scored 89.73. GPT-4o scored 77.73 in the same context described by the source. The system also improved significantly over its predecessor in basic recognition work and in more advanced tasks.
Those results matter because interface control depends on precision. If an assistant cannot reliably identify the button, text field, menu, or other screen element involved in a request, it cannot safely complete a task inside an app.
Why User Intent Matters More Than Coordinates
A major idea behind Ferret-UI 2 is that an AI system should not need a rigid map of screen coordinates to operate an app. A command may be phrased in ordinary language, while the screen may contain multiple buttons, labels, or input areas. The model has to decide which element fits the user’s goal.
The source gives the example command, “Please confirm your input.” In that case, Ferret-UI 2 can identify the appropriate button without needing precise location data. That is a different kind of interface understanding from simply matching a click to a fixed point on a screen.
Apple’s research team used GPT-4o’s visual capabilities to create high-quality training data. That training data helped Ferret-UI 2 learn more about spatial relationships between UI elements, meaning how interface parts sit near, above, below, or alongside one another.
This is important because modern app screens are not just lists of isolated objects. A button may only make sense in relation to nearby text, a field, a dialog, or a previous action. A system that understands those relationships can respond more flexibly when layouts change.
How It Handles Different Screens
Ferret-UI 2 uses an adaptive architecture for recognizing UI elements across platforms. The system also includes an algorithm that automatically balances image resolution and processing requirements for each platform.
According to the researchers, this design is “both information-preserving and efficient for local encoding.” In plain terms, the system is trying to keep enough screen detail to understand the interface while managing the processing demands of different device environments.
The cross-platform results were strongest when moving between related mobile interfaces. Models trained on iPhone data reached 68 percent accuracy on iPads and 71 percent accuracy on Android devices.
The transition was harder between mobile devices and TV or Web interfaces. The researchers attributed that difficulty to differences in screen layouts. That limitation is logical from the facts in the source: a phone app, a browser page, and a TV interface can organize controls in very different ways.
- Supported environments: iPhones, iPads, Android devices, web browsers, and Apple TV.
- Reported UI recognition score: 89.73 for Ferret-UI 2, compared with 77.73 for GPT-4o.
- Cross-platform results: 68 percent accuracy on iPads and 71 percent accuracy on Android devices when trained on iPhone data.
- Main challenge: moving between mobile layouts and TV or Web interfaces.
Where Ferret-UI 2 Fits In The AI Assistant Race
Apple is not the only company working on AI systems that can understand and interact with user interfaces. The source notes that Anthropic recently released an updated Claude 3.5 Sonnet with UI interaction. Microsoft also released OmniParser, an open-source tool that converts screen content into structured data for the same purpose.
These efforts point toward the same broad goal: giving AI systems a usable model of what appears on a screen. Once screen content is structured or understood, an assistant can move beyond answering questions and begin helping with actions inside apps and websites.
The source also says Llama- and Gemma-based Ferret UI models are available from Hugging Face, along with a demo. That makes the work visible beyond Apple’s internal research description and places it within a wider ecosystem of UI-focused AI models.
What This Could Mean For Siri
Ferret-UI 2 may become more meaningful when combined with other agent systems. Apple also recently unveiled CAMPHOR, a framework that uses specialized AI agents coordinated by a master reasoning agent to handle complex tasks.
Together, CAMPHOR and Ferret-UI 2 could help voice assistants such as Siri analyze and perform tasks that require moving through apps or the web. The source gives the example of finding and booking a specific restaurant using only voice commands.
That scenario depends on several abilities working together. The assistant must understand the user’s request, inspect changing screens, identify the correct controls, and continue through a task that may involve several steps. Ferret-UI 2 addresses the screen understanding and UI control part of that chain.
The larger point is straightforward: if AI assistants are going to do useful work inside real apps, they need to understand interfaces as people use them. Ferret-UI 2 is Apple’s attempt to make that understanding more accurate, more adaptable, and less dependent on fixed coordinates.