Generative AI is built on a powerful idea that is now showing strain. Transformers have driven many of the best-known AI systems, but their appetite for computation has become a central technical challenge. That is why test-time training, or TTT models, is attracting attention as researchers search for architectures that can handle more data with less compute.
Why transformers are under pressure
Transformers sit at the center of major AI systems, including OpenAI’s video-generating model Sora and text-generating models such as Anthropic’s Claude, Google’s Gemini and GPT-4o. Their success has made them the default reference point for modern generative AI.
But the same architecture that made transformers effective also creates a scaling problem. They are not especially efficient at processing and analyzing vast amounts of data on off-the-shelf hardware. As companies build and expand infrastructure around transformer requirements, power demand has risen sharply and may become unsustainable.
The issue is not that transformers have stopped working. It is that the cost of making them work at larger scales is becoming harder to ignore. The next step in generative AI may depend less on simply making systems bigger and more on changing how they process information.
How the hidden state creates a bottleneck
A key part of a transformer is its hidden state. This is a long list of data that grows as the model processes an input. When a transformer works through a book, for example, the hidden state stores representations of words or parts of words so the model can use that context later.
This mechanism helps explain why transformers can perform in-context learning. The hidden state acts like a lookup table that lets the model refer back to what it has already processed.
The weakness is that this memory structure becomes expensive to use. To produce even a single word about a book it has just read, a transformer may need to scan through the entire lookup table. In computational terms, that can resemble rereading the whole book.
That design becomes more challenging as inputs become longer and more complex. A book is one example. Long video, audio recordings and large sets of images make the problem even more demanding.
What TTT models change
TTT models take a different approach. Instead of relying on a hidden state that grows as more data is processed, the architecture replaces that lookup table with a machine learning model inside the larger model.
The important change is how information is stored. The internal model encodes what it processes into weights, rather than adding more and more entries to a lookup table. Because of that, the size of the internal model does not grow every time the system handles additional data.
Researchers at Stanford, UC San Diego, UC Berkeley and Meta developed the TTT approach over a year and a half. Their claim is that TTT models can process far more data than transformers while using far less compute power.
Yu Sun, a post-doc at Stanford and a co-contributor on the TTT research, believes future TTT models could efficiently process billions of pieces of data. The inputs could include words, images, audio recordings and videos.
That is the central promise: a model that can keep working with very large inputs without its internal memory structure becoming larger and more expensive at every step. If that holds up, TTT models could be especially relevant for long-form video and other data-heavy AI tasks.
Why the evidence is still early
The case for TTT models is not settled. The architecture is promising, but it is not yet a simple replacement for transformers. The researchers have developed two small models for study, which makes direct comparison with much larger transformer systems difficult.
That matters because performance at small scale does not automatically prove performance at large scale. Generative AI systems are judged not only by architectural elegance, but also by whether they can be trained, deployed and scaled in real environments.
Mike Cook, a senior lecturer in King’s College London’s department of informatics who was not involved with the TTT research, described the work as interesting but said the data would need to support the claimed efficiency gains. He also noted that adding a neural network inside a neural network resembles a familiar computer science instinct: solving a problem by adding another layer of abstraction.
That skepticism does not dismiss the idea. It places TTT models where they currently belong: as a serious research direction, not a proven successor to transformers.
The wider search for transformer alternatives
TTT models are part of a broader shift. Researchers and AI companies are actively exploring alternatives because the limits of transformers are becoming more visible.
One example is state space models, or SSMs. Like TTT models, SSMs appear to offer better computational efficiency than transformers and the ability to scale to larger amounts of data.
Mistral released Codestral Mamba, a model based on SSMs. AI21 Labs is also exploring SSMs. Cartesia, which pioneered some of the first SSMs and the Mamba and Mamba-2 models that inspired Codestral Mamba’s name, is part of the same movement.
The shared goal is clear: make generative AI systems more efficient without giving up the capabilities that made transformers dominant. If these approaches succeed, generative AI could become more accessible and widespread. That outcome would bring new opportunities, and also new concerns, because easier deployment can amplify both useful and harmful uses.
For now, TTT models are best understood as a sign of where AI research is heading. The transformer era is not over, but the pressure to find a more efficient foundation is real.