Sakana AI has presented Transformer² as a way to make language models more flexible after their initial training. The core idea is simple to state: instead of forcing a model to relearn broad parts of itself for each new job, the system gives it smaller, task-focused controls that can be selected or combined when needed.
That matters because current AI systems are usually trained once and then expected to handle many kinds of work, including writing, answering questions, math, programming, and reasoning. They can be powerful across familiar tasks, but unexpected problems can expose the limits of that fixed training.
What Transformer² changes
Transformer² is built around a two-stage learning process. Its main tool is a set of expert vectors trained through Singular Value Fine-Tuning, or SVF. Each expert vector is meant to help the model handle a particular kind of task, such as math, programming, or logical reasoning.
Traditional approaches to teaching a model new tasks often involve changing the weights across the network. That can be expensive for large models, and it can also cause a model to lose performance on things it previously knew how to do. In other words, improving one capability can come at the cost of weakening another.
LoRA is one alternative that tries to reduce that burden by attaching smaller additions to an existing network. SVF takes a different route. Rather than directly rewriting the network weights, it learns vectors that control how strongly connections in the network matter. The name comes from the fact that these vectors scale the singular vectors of the weight matrices.
Why expert vectors matter
The parameter difference is central to Sakana AI's case for the approach. The source comparison says LoRA requires 6.82 million parameters for adaptation, while SVF uses just 160,000. That smaller footprint is presented as a way to save memory and processing power while still giving the model room to specialize.
The expert vectors are also designed to reduce the risk that a model becomes too narrowly tuned to one task. Because the vectors can be kept separate and combined, the model can adapt without being pushed as hard toward a single specialization. That is important for a general language model, where usefulness depends on carrying many abilities at once.
Transformer² finds these expert vectors through reinforcement learning. The model tries training tasks, receives feedback on its results, and uses that feedback to improve the vectors. Over repeated steps, the system adjusts the expert vectors until they work better for the tasks they are meant to support.
How the system adapts to tasks
The researchers developed three ways for Transformer² to use its learned expertise. One method relies on adaptation prompts, which help the model identify the kind of task it is facing and choose a suitable expert vector. Another method uses a classifier that examines a few examples and selects the expert that appears most appropriate.
The third method is few-shot adaptation. In this version, Transformer² does not simply pick one expert. It builds a custom vector by combining all of its learned expert vectors.
Few-shot adaptation begins with a small set of examples from the new task. The system then tries different mixtures of expert vectors and searches for the best match. According to the source, more examples give it more room to refine those vectors.
This is where the approach becomes more than a routing system. It can draw from multiple areas of learned expertise at the same time. In tests involving complex math problems, Transformer² did not rely only on math expertise; it also used programming and logical thinking capabilities.
How it performed against LoRA
Transformer² was tested on benchmarks covering math, programming, knowledge, and comprehension questions. Against LoRA, it performed up to 16 percent better on math tasks while using far fewer parameters. On completely new tasks, it reached 4 percent higher accuracy than the original model.
The comparison also included a weaker result for LoRA in one setting: when handling completely new tasks, LoRA made the base Llama model perform worse. That contrast is part of why Transformer² is presented as a promising route for adaptation rather than a larger retraining process.
The tests also showed that expert vectors could be transferred between different models. That means smaller models could potentially benefit from expertise learned by larger ones. The source frames this as a possible path for sharing knowledge between models more efficiently.
The limits are still significant
Transformer² is not the same as a model that can freely learn anything new over time. The expert vectors trained with SVF can only work with abilities that already exist inside the pre-trained model. They can adjust and combine existing capabilities, but they cannot add completely new skills.
That is why the source describes the method as progress toward continuous learning, not the arrival of it. A system that continuously learns new skills on its own remains a more distant goal.
There is also an unresolved scaling question. It is still not clear how well the method works for models with more than 70 billion parameters. Until that is better understood, Transformer² should be seen as a focused advance in language model adaptation, not a complete answer to the broader problem of lifelong machine learning.