WIRED AI July 11, 2024 TERMINATOR

How Gemini is pushing robots beyond simple commands

Google DeepMind has shown a wheeled office robot that uses Gemini to interpret language, video and visual cues while navigating a real workplace. The project points to a broader robotics shift, as labs and startups test whether vision language models can make robots easier to direct and more useful in everyday spaces.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 0 ►

Gemini-powered robots interpreting language and visual cues to act in physical workplaces modestly increases AI autonomy in the real world.

How Gemini is pushing robots beyond simple commands

A robot moving through an office is not new. What is changing is the way people can ask that robot for help. Google DeepMind has revealed a wheeled office robot in Mountain View, California, that uses Gemini to understand more natural requests and connect them to action in the physical world.

The system is notable because it does not rely only on rigid commands or a simple map. It combines the latest version of Google’s Gemini large language model with a separate algorithm that turns instructions and visual information into specific robot actions, such as turning. The result is a machine that can respond to requests that require context, perception and a measure of commonsense reasoning.

What Google DeepMind demonstrated

The robot described in the source is tall, slender and wheeled. It has been operating in a cluttered open-plan office in Mountain View, California, where it acts as a tour guide and informal helper. Its key upgrade is the use of Gemini to process both commands and the environment around it.

One example shows how different this is from older robot interaction. When a person says, “Find me somewhere to write,” the robot can guide that person to a clean whiteboard somewhere in the building. That request does not name a destination in a mechanical way. It asks the robot to infer what kind of place would satisfy the need.

Gemini helps because it can handle video and text. The system also uses previously recorded video tours of the office, giving the robot information it can draw on when navigating. Instead of treating the workplace as a set of narrow waypoints, the robot can connect language, visual context and remembered surroundings.

The same direction of travel applies to visual instructions. The source says Gemini allows Google’s robot to parse spoken instructions as well as visual ones, including following a sketch on a whiteboard that shows a route to a new destination. That makes the interaction feel less like programming a machine and more like giving directions to a colleague.

Why language models matter for robots

Large language models have mostly been experienced through a browser or app. People type or speak, and the model responds with text, images, audio understanding or other digital outputs. The Google DeepMind robot shows another possibility: the model can become part of a control system for a machine that moves through a real environment.

That matters because offices, homes and other human spaces are full of ambiguity. People often do not state exactly what they want in machine-friendly terms. They say things like “Find me somewhere to write” or ask “Where did I leave my coaster?” A useful robot needs to connect those requests to objects, locations and likely human intentions.

Older navigation systems depended on carefully prepared maps and carefully chosen commands. The source makes clear that, just a few years ago, a robot would need that kind of setup to navigate successfully. Vision language models change the equation because they are trained on images and video as well as text, and can answer questions that require perception.

Google DeepMind’s system still uses more than Gemini alone. The language model helps interpret commands and environmental information, while another algorithm generates concrete actions for the robot. That division is important: understanding a request is not the same as safely and effectively moving through a room.

Reliability and usability are the core test

The researchers behind the project say in a new paper that the robot was up to 90 percent reliable at navigating, including when it received difficult commands such as “Where did I leave my coaster?” That number is central because robotics progress is not only about impressive demonstrations. A robot must work often enough, and predictably enough, to be useful.

The team writes that DeepMind’s system “has significantly improved the naturalness of human-robot interaction, and greatly increased the robot usability.” In plain terms, the project is trying to reduce the gap between how humans naturally ask for help and how robots traditionally need to be instructed.

Several capabilities come together in that goal:

Understanding spoken requests that do not name an exact destination.
Using video and text to interpret an office environment.
Drawing on previously recorded video tours of the building.
Following visual directions, including a route sketched on a whiteboard.
Turning interpreted intent into physical movement through an action-generating algorithm.

Each piece matters because real-world robotics is unforgiving. A chatbot can answer again if it misunderstands a prompt. A robot has to move, turn and navigate among people, furniture and changing surroundings. That makes natural language useful, but only when it is connected to reliable perception and control.

A wider race in AI robotics

The Google DeepMind demo sits inside a broader push across academic and industry research labs. Researchers are exploring how large language models and vision language models can improve robot abilities. The source notes that the May program for the International Conference on Robotics and Automation listed almost two dozen papers involving the use of vision language models.

Investment is moving in the same direction. Several researchers involved with the Google project later left the company to found Physical Intelligence, which received an initial $70 million in funding. The company is working to combine large language models with real-world training so robots can develop more general problem-solving abilities.

Skild AI, founded by roboticists at Carnegie Mellon University, has a similar goal. This month it announced $300 million in funding. Those figures show that the race is not limited to research papers or controlled demonstrations. Startups are trying to turn advances in AI into more capable machines.

Demis Hassabis, CEO of Google DeepMind, had already pointed toward this direction when Gemini was introduced in December. He told WIRED that Gemini’s multimodal capabilities would likely unlock new robot abilities, and said researchers at the company were testing the model’s robotic potential. In May, Hassabis also showed an upgraded version of Gemini that could make sense of an office layout through a smartphone camera.

What comes next

The researchers say they plan to test the system on different kinds of robots. That step matters because a wheeled office helper is only one form factor. If the approach works more broadly, the same kind of model-based understanding could support other machines with different bodies and tasks.

They also suggest Gemini should be able to handle more complex questions. One example from the paper is “Do they have my favorite drink today?” from a user whose desk has many empty Coke cans. The point is not just recognizing a drink. It is connecting a personal pattern, a current request and the surrounding context.

That is why this project is larger than a single office robot. It shows how AI systems that understand language, video and images may give robots a more flexible interface with the world. The challenge now is to make that understanding dependable enough for everyday use beyond a polished demo.