Robots often fail not because a task is impossible, but because human instructions leave too much unsaid. MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) is exploring a way to close that gap with large language models, so a robot can learn from a mix of physical demonstration and natural language without requiring a person to spell out every detail.
Why vague instructions are hard for robots
A person might tell a robot to place coffee on a desk while a Zoom call is happening. The instruction sounds simple, but the real task contains hidden preferences: the robot should avoid getting too close to the person and the laptop, because either could interrupt the meeting.
That is the kind of detail humans often understand from context, but robots need to be trained to recognize. Computer scientists have tried to teach manipulation tasks by collecting many physical demonstrations or by writing detailed instructions. The source article explains that when the robot does not have both kinds of information, it can misunderstand what it is supposed to do.
MIT CSAIL’s answer is called “Masked Inverse Reinforcement Learning,” or Masked IRL. The approach automates more of the teaching process, clarifies ambiguous instructions, and uses nearly five times less demonstration data.
How Masked IRL uses LLMs
Masked IRL begins with data from a user’s demonstration. The system uses the robot’s sensors to capture the surrounding environment and record the movements in a kinesthetic demonstration. In this training method, a person physically moves the robot through the action so it can learn how to grab, move, or place objects.
MIT’s system then uses an LLM to compare that motion sequence, called a trajectory, with the shortest possible path. The same model also expands unclear prompts into more specific ones. A request such as “stay close” can become “stay close to the surface of the table.”
That clarification matters because the robot is not just copying movement. It is trying to infer why the demonstrated motion matters for the task. The LLM helps connect the human’s words, the observed trajectory, and the practical constraints that may not have been fully stated.
A second LLM handles another part of the problem: deciding what in the environment should influence the robot’s motion plan. It evaluates details such as obstacles and the shape of the target object, then masks out information that appears irrelevant.
In the system, each detail is scored as either a “1” for important or “0” for not so important. If a user happened to lean on a table during a demonstration, that could be treated as a “0.” Details scored as “1” are passed into the final action plan created by an algorithm.
What the tests showed
The masks gave Masked IRL an advantage in 3D and real-world demonstrations because the robot learned which information deserved attention. Virtual and real robots used the approach to move objects around obstacles, including moving a coffee mug around a laptop to different positions on a table.
In those tasks, Masked IRL identified user preferences that were not explicitly stated in prompts up to 15 percent more often than comparable baselines. The researchers also found in simulation experiments that the system needed fewer demonstrations to learn how to move the mug than its baselines.
The article also notes that robots performed better when an LLM clarified instructions, compared with having the machine attempt to follow a vague request directly. That points to a practical role for LLMs in robotics: not as the whole control system, but as a layer that helps translate human intent into useful task constraints.
The real robotic arm results are especially important because the system handled prompts it had not seen during training. After being trained on 50 kinesthetic demonstrations, the robot moved a cup toward a human while avoiding a user’s computer. The system learned that obstacle from a general request to “stay away.”
It also wiped a table while “staying close” to it, and handed a user a bag of chips while “staying away” from both a human and a table.
What this could mean for everyday robots
The source article frames the work around chores in homes, offices, and factories. In each setting, a robot may need to act safely around details that a person does not think to mention. A snack-fetching robot may need to avoid a laptop. A factory robot placing items into boxes may need to navigate around shelves.
That makes the core challenge less about understanding isolated words and more about identifying relevant context. Masked IRL is designed to sense and explain what users leave unsaid, then focus the robot’s motion planning on the parts of the scene that matter.
MIT PhD student and CSAIL researcher Minyoung Hwang, a lead author on a paper presenting the project, described the goal this way: “Our approach could come in handy when a human interacts with a robot but doesn’t want to spell out all the details of a task,” adding, “We’re minimizing human effort by enabling machines to get to the bottom of what users really want.”
What comes next
The researchers plan to make Masked IRL more dynamic by adding cameras, so a robot can use images of its surroundings. With that capability, the system could highlight and focus on nearby elements, while ignoring objects that are not relevant to the task. The article gives the example of a robot asked to pick up a toy while ignoring bananas nearby.
Hwang wrote the paper with three CSAIL colleagues: PhD student Alexandra Forsey-Smerek ’20, SM ’22; postdoc Nathaniel Dennler; and MIT Assistant Professor Andreea Bobu, who is a member of the Department of Aeronautics and Astronautics and CSAIL. Their work was supported, in part, by the Tata Group via the MIT Generative AI Impact Consortium Award, and the Department of Defense.
The team will present the project at the 2026 IEEE International Conference on Robotics and Automation in June.