MIT Tech Review AI November 13, 2025 NEUTRAL

Why OpenAI built a weaker LLM to see inside AI

OpenAI has built an experimental large language model called a weight-sparse transformer to make AI behavior easier to inspect. It is far smaller and less capable than GPT-5, Claude, or Gemini, but it may help researchers understand hallucinations, failures, and trust in LLMs.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 1 ►

The story is mainly about interpretability research to make AI safer and more understandable, with only mild concern about hallucinations and unpredictable behavior.

Why OpenAI built a weaker LLM to see inside AI

OpenAI has built an experimental large language model that is not meant to win a race against today’s leading AI systems. Its purpose is different: to make the inner workings of an LLM easier to examine.

The model, called a weight-sparse transformer, is far smaller and far less capable than products such as GPT-5, Anthropic’s Claude, and Google DeepMind’s Gemini. But OpenAI hopes that by studying a more transparent system, researchers can learn more about why large language models hallucinate, behave unpredictably, and how much trust they deserve in critical tasks.

What OpenAI Built

The work comes from OpenAI, the maker of ChatGPT, and sits inside a research area known as mechanistic interpretability. That field tries to map the internal mechanisms models use when they complete tasks.

Leo Gao, a research scientist at OpenAI, described the safety motivation clearly: "It’s very important to make sure they’re safe." The point is not only to make a smaller model easier to inspect, but to use it as a window into the larger systems that are increasingly being considered for important domains.

By OpenAI’s own description, the new model is not competitive with the best systems on the market. Gao says that at most it is as capable as GPT-1, which OpenAI developed back in 2018, though he and his colleagues have not done a direct comparison.

That limitation is part of the design tradeoff. The model gives up performance in exchange for interpretability. Instead of focusing on the strongest possible output, OpenAI is focusing on whether researchers can follow the model’s internal steps with more confidence.

Why LLMs Are So Hard To Understand

Today’s LLMs are often described as black boxes because nobody fully understands how they do what they do. They are built from neural networks made of nodes, called neurons, arranged in layers.

In most networks, each neuron connects to every other neuron in the adjacent layers. This is known as a dense network. Dense networks are relatively efficient to train and run, but that efficiency comes with a major interpretability problem.

What the model learns can be spread across a large tangle of connections. A simple concept or function may not live in one obvious place. It can be split across neurons in different areas of the model.

At the same time, individual neurons can represent multiple different features. The source article identifies this as superposition, a term borrowed from quantum physics. The practical result is that researchers cannot easily point to a specific part of a model and say that it corresponds cleanly to one concept or one function.

That is why even apparently simple model behavior can be difficult to explain. A language model may answer a straightforward prompt quickly, but tracing how it produced that answer can mean untangling many neurons and connections.

What Sparsity Changes

OpenAI’s experiment changes the architecture. Instead of using a dense network, the company started with a weight-sparse transformer, where each neuron connects to only a few other neurons.

This forces the model to represent features in localized clusters instead of spreading them broadly through the network. That makes it easier to connect neurons, or groups of neurons, to specific concepts and functions.

The cost is speed and capability. The source article says the model is far slower than any LLM on the market. It also says the model is far less capable than top-tier mass-market models.

Still, the benefit is significant for researchers. Gao said, "There’s a really drastic difference in how interpretable the model is," and the experiments described in the source show why that matters.

OpenAI tested the model on very simple tasks. In one case, researchers asked it to complete a block of text that begins with quotation marks by adding matching marks at the end. That is an easy request for an LLM, but the research value is in tracing how the model performs it.

With the new model, OpenAI says researchers were able to follow the exact steps the model took. Gao said the team found a circuit that matched the algorithm a person might implement by hand, even though the model had learned it itself.

How Outside Researchers View The Work

The source article includes positive reactions from researchers outside the project. Elisenda Grigsby, a mathematician at Boston College who studies how LLMs work and was not involved, said, "I’m sure the methods it introduces will have a significant impact."

Lee Sharkey, a research scientist at AI startup Goodfire, also responded favorably to the direction of the work. The broader point is that the research is aimed at a central problem in AI: models are becoming more powerful while remaining difficult to inspect.

Mechanistic interpretability tries to narrow that gap. If researchers can understand the mechanisms behind even small model behaviors, they may gain better tools for studying larger and more complex systems.

But the source also makes the limits clear. Grigsby is not convinced that the technique would scale up to larger models that must handle a variety of more difficult tasks. Gao and Dan Mossing, who leads the mechanistic interpretability team at OpenAI, acknowledge that this is a major limitation.

What Could Come Next

OpenAI does not expect this approach to produce models that match cutting-edge products such as GPT-5. The research is still early, and the model built so far is smaller, slower, and less capable than leading systems.

Even so, OpenAI thinks the technique might improve enough to build a transparent model on a par with GPT-3, the company’s breakthrough 2021 LLM. Gao suggested that within a few years, researchers might have a fully interpretable GPT-3-scale system.

If that happens, the value would not be only in one model. It would be in the knowledge gained from being able to inspect how a language model performs its tasks from the inside.

For now, the weight-sparse transformer is best understood as a research instrument. It is weaker by design, but that weakness may make it more useful for answering a question that matters across AI: how do these systems actually work?