Why 78 examples may reshape autonomous AI agent training

A study on LIMI argues that 78 carefully selected training examples can produce strong autonomous AI agent behavior. Its AgencyBench results suggest that quality, full-process examples may matter more than very large training sets.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 0 ►

The story mildly leans Terminator because it describes a method that could make autonomous AI agents more capable and independent.

Why 78 examples may reshape autonomous AI agent training

A new study is challenging one of the most common assumptions in AI development: that better autonomous agents require ever-larger training datasets. The LIMI approach, short for "Less Is More for Intelligent Agency", reports strong results from only 78 carefully chosen training examples.

The claim matters because autonomous agents are not just chatbots that answer prompts. In the study, agency means the ability of AI systems to act independently by discovering problems, forming hypotheses, and solving tasks through interaction with environments and tools.

A smaller dataset with a bigger training signal

LIMI takes a different route from large-scale training pipelines. Instead of relying on massive volumes of examples, it uses 78 handpicked samples drawn from real software development and research tasks.

Each example is designed to show the full process of human-AI collaboration. That includes the original request, tool use, problem-solving steps, and final successful completion. The aim is not simply to teach a model to produce an answer, but to teach it how to behave across a task from start to finish.

This distinction is important. Autonomous agents often need to decide what to do next, use tools, adjust after intermediate results, and continue until the job is done. A short answer can be correct in a normal benchmark while still revealing little about whether the model can manage a longer workflow.

Some LIMI trajectories stretched to 152,000 tokens. That detail suggests the training examples were not shallow demonstrations. They captured long, complex behaviors in which the model had to follow a task over many steps.

What AgencyBench measured

The study tested LIMI on AgencyBench, a benchmark built around real-world scenarios. The tasks include building C++ chat apps, Java to-do lists, AI-powered games, microservice pipelines, and research work such as LLM comparisons, data analysis, and business or sports analytics.

On AgencyBench, LIMI scored 73.5 percent while using only 78 training samples. The comparison with other models is central to the study's argument. Deepseek-V3.1 scored 11.9 percent, Kimi-K2-Instruct reached 24.1 percent, Qwen3-235B-A22B-Instruct reached 27.5 percent, and GLM-4.5 scored 45.1 percent.

LIMI also completed 71.7 percent of requirements on the first try. Its overall success rate was 74.6 percent, compared with GLM-4.5's 47.4 percent. On standard coding and scientific computing benchmarks, LIMI reached a 57.2 percent average and led all baselines reported in the source.

The study also compared LIMI with alternative training approaches. GLM-4.5 code, trained on 10,000 samples, reached only 47.8 percent on AgencyBench. That contrast is the clearest expression of the paper's core message: more examples do not automatically mean stronger agent behavior.

Why curated examples could matter

The source article frames LIMI as a direct challenge to brute-force scaling. The argument is not that data no longer matters. It is that carefully selected data may carry more useful information than a much larger collection of weaker examples.

For autonomous agents, the valuable training signal may live in the structure of the work. A model needs to see how a task begins, how tools are selected, how partial progress is evaluated, and how a final result is reached. If an example captures that complete arc, it may teach more than many isolated snippets.

The reported results also suggest that training data quality and task realism are closely linked. LIMI's examples come from real software development and research tasks, which means they are aligned with the kinds of open-ended workflows autonomous systems are expected to handle.

  • Task continuity: the examples show work from request to completion.
  • Tool interaction: the model sees how tools fit into problem-solving.
  • Complex reasoning: long trajectories expose the model to extended decision paths.
  • Outcome focus: examples end in final success, giving the model a complete pattern to learn.

Those qualities help explain why a small dataset could still have a large effect. The study's position is that a compact set of rich examples can teach agentic behavior more efficiently than a broad dataset that lacks the same structure.

Model size and the next question

The LIMI approach worked across different model sizes in the reported results. LIMI-Air, with 106 billion parameters, improved from 17.0 percent to 34.3 percent. The larger LIMI, with 355 billion parameters, improved from 45.1 percent to 73.5 percent.

That does not mean scale is irrelevant. The larger model still reached the higher final score. But the gains across both sizes support the broader claim that training method and data selection can substantially affect agent performance.

The source article also notes a related argument from Nvidia researchers: most AI agents use language models that are far too large, and models under 10 billion parameters may be sufficient for agentic tasks. LIMI's results are presented as support for the idea that careful data curation can outperform brute-force scaling.

Still, the source is clear that LIMI is not yet a settled replacement for current methods. The approach looks promising, but it needs more research and real-world testing before it can become a new standard for autonomous AI systems.

The code, models, and datasets are public, which gives other researchers a path to examine the work more closely. For now, the study's most important contribution is its challenge to a familiar assumption: when training autonomous AI agents, the best next step may not always be more data. It may be better examples.