Microsoft Tests a Large Action Model That Can Work in Word

Microsoft researchers have built a Large Action Model that can operate Windows programs and complete some Word tasks on its own. In a Word test environment, the model was faster than GPT-4o and scored higher than GPT-4o without visual information, though GPT-4o performed better when given visual input.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 1 ►

The story mildly leans Terminator because it highlights more autonomous AI systems taking actions in software, with safety and control still unresolved.

Microsoft Tests a Large Action Model That Can Work in Word

Microsoft researchers have developed a new kind of AI system called a "Large Action Model" that is designed to do more than generate text. The model can operate Windows programs, including some tasks inside Word, turning a user request into a sequence of actions.

The work points to a broader change in AI development: systems that do not only explain what should happen, but can take steps toward completing the task. The research also shows why that shift is difficult. Accuracy, safety, regulation, and scaling remain open challenges.

What makes a Large Action Model different

Traditional language models such as GPT-4o mainly process and produce text. A Large Action Model, or LAM, is built to convert a request into action. In the source article, that can mean operating software or controlling robots.

The key difference is practical execution. A language model may describe how to complete a workflow. A LAM is trained to create a plan and then carry out steps in an environment where the state can change as the task unfolds.

According to the source, LAMs can interpret different kinds of input, including text, voice, or images. From there, they turn intent into detailed step-by-step plans. They can also adjust their approach based on what is happening in real time.

That matters because desktop software tasks are rarely just one command. A user may ask for a document change, a formatting adjustment, or a structured setup inside an app. The system needs to understand the goal, choose the next action, check whether it worked, and continue.

How Microsoft trained the Word-focused model

The researchers describe a four-step process for building a LAM. First, the model learns how to divide tasks into logical steps. That foundation helps it move from a broad request toward a practical plan.

Second, it learns from more advanced AI, including GPT-4o, to translate plans into actions. Third, it explores new solutions on its own, including problems that other AI systems could not solve. Finally, the system is refined through reward-based training.

For the test case, Microsoft researchers built a LAM based on Mistral-7B and used it in a Word test environment. The goal was not only to see whether the model could reason about tasks, but whether it could complete them in the software setting.

The training data started with 29,000 task-plan pairs. Those examples came from Microsoft documentation, wikiHow articles, and Bing searches. The team then used GPT-4o to make simple tasks more complex.

One example in the source shows how this worked. A basic task, "Create a drop-down list", became "Create a dependent drop-down list where the first selection filters the options in the second list." That process helped create more demanding examples for the model to learn from.

The benchmark results show speed and limits

In the Word test environment, the LAM completed tasks successfully 71% of the time. GPT-4o, without visual information, reached a 63% success rate. On that comparison, the LAM performed better.

The speed difference was also notable. The LAM needed only 30 seconds per task, while GPT-4o needed 86 seconds. For any AI system expected to operate software, time per task is part of the user experience, not a side detail.

But the result was not one-sided. When GPT-4o received visual information, it reached a 75.5% success rate. That means the LAM was faster and stronger than GPT-4o without visual input, while GPT-4o was more accurate when visual information was available.

The dataset also changed substantially during development. The "data evolving" strategy expanded the collection from 29,000 task-plan pairs to 76,000 pairs, described in the source as a 150% increase. From those examples, about 2,000 successful action sequences became part of the final training set.

Why this matters for AI assistants

The practical promise is easy to understand. If an AI assistant can operate software directly, it could help with real tasks instead of only giving instructions. In the source article, that shift is framed as moving from systems that understand and generate text toward systems that actively help complete real-world tasks.

Word is a useful test case because it is a familiar productivity environment with many possible actions. A system that can complete tasks there must connect language, planning, and execution. It must also deal with whether the action taken actually matches the user’s goal.

The same idea could apply beyond a single program, but the source is careful about the barriers. The system still faces several hurdles:

  • AI actions can go wrong.
  • Regulatory questions remain unresolved.
  • Technical limitations make it hard to scale up.
  • Adapting the approach to different applications is difficult.

Those concerns are central to the future of Large Action Models. Acting inside software raises different risks than answering a question in text. A mistaken answer is one kind of problem; a mistaken action inside an application can create a more direct consequence.

A step toward more active AI systems

The researchers believe LAMs are an important shift in AI development. They also say these "Large Action Models" mark a significant step toward artificial general intelligence (AGI).

The strongest takeaway is not that the Word-focused model solves the problem. It does not. The results show useful progress, especially in speed and in performance against GPT-4o without visual information, but they also show that visual context can still change the outcome.

For now, Microsoft’s Large Action Model is best understood as an early example of where AI assistants may be heading. The focus is moving from conversation alone toward systems that can interpret a request, make a plan, and perform actions inside the tools people already use.