Can nested learning help AI models stop forgetting?

Google Research has introduced "nested learning," a model design approach meant to reduce or even avoid "catastrophic forgetting." Its HOPE architecture uses memory systems that update at different speeds, with tests showing gains over Transformer++, RetNet, and DeltaNet.

WTF Index TERMINATOR
◄ Terminator 1 Idiocracy 0 ►

The story is mainly technical research that could make AI models more capable through durable memory, but it does not emphasize direct harm or loss of human skill.

Can nested learning help AI models stop forgetting?

Google Research is proposing a different way to think about memory in AI models. The idea, called "nested learning," is aimed at a central weakness in large language models: after training, they do not build durable new long-term memories in the way users might expect.

The work appears in a NeurIPS 2025 paper and focuses on reducing or potentially avoiding "catastrophic forgetting". Instead of treating a model as mostly fixed after pretraining, nested learning treats more of the learning system itself as memory.

Why forgetting is such a hard problem

Current large language models mainly rely on two sources of knowledge after training: what they absorbed during pretraining and what fits inside the current context window. That creates a practical limit. A model can respond to information in front of it, but that does not mean it has converted the information into lasting memory.

Making the context window larger can help the model look at more material at once. Retraining can also refresh what the model knows. But the source describes both approaches as ways of postponing the deeper issue rather than solving it.

The problem is especially important because modern models are generally static once pretraining is complete. They can use learned capabilities, but they do not easily acquire new ones outside the context available to them. More updates can make the forgetting problem worse, which is why the research focuses on model designs that can support continuous learning.

How nested learning reframes memory

Nested learning draws inspiration from neuroscience. The source describes the brain as operating across different time scales: faster circuits respond to what is happening now, while slower systems consolidate important patterns into longer-term memory.

That distinction matters because not every experience should become permanent. Most information fades, while a smaller set of patterns is preserved. The source connects this to neuroplasticity, the brain's ability to change while still retaining essential information.

Google Research applies that general idea to AI architecture. In nested learning, memory is not limited to a single component. The model, the optimizer, and the training algorithm are all treated as parts of a broader memory system.

That includes backpropagation, which stores relationships between data and errors. It also includes optimizer state, such as momentum, which can function as a form of memory. The broader goal is to give the system more temporal depth, so different parts can update at different speeds instead of forcing all learning into one flat process.

What the Continuum Memory System adds

The Continuum Memory System, or CMS, is the mechanism described in the source for dividing memory into modules with different update rates. Fast components can handle immediate input. Slower components can preserve signals that appear important enough to store for longer.

This layered design is meant to move beyond the usual "pretrain and freeze" pattern. In that older pattern, a model receives its core knowledge during training and then mostly depends on context at use time. Nested learning instead tries to make the learning process itself more adaptive.

The practical implication is straightforward: a model with multiple memory speeds can process the present while also deciding what deserves longer-term retention. That does not mean every new input becomes permanent. It means the architecture has a built-in way to separate short-lived information from patterns that may be worth keeping.

HOPE puts the idea into an architecture

Google's HOPE architecture is the concrete implementation described in the source. HOPE uses long-term memory modules called Titans. These modules store information based on how surprising it is to the model.

HOPE also layers different kinds of memory and uses CMS blocks to support larger context windows. The source describes fast layers that process live input and slower layers that distill what matters for long-term storage. The system can also adapt its update rules as it learns.

In plain terms, HOPE is designed to make memory more selective and more structured. It does not simply extend the amount of text a model can inspect. It changes how information can move through short-term and long-term processes inside the model.

What the tests showed

The Google team tested HOPE on language modeling and reasoning. With models at 1.3 billion parameters trained on 100 billion tokens, HOPE outperformed Transformer++ and newer models including RetNet and DeltaNet.

The source also reports better performance in long-context and needle-in-a-haystack tests. In those evaluations, the model must locate specific information inside a large body of text.

The tests ranged from 340 million to 1.3 billion parameters. According to the source, HOPE's gains were consistent, and the authors say it can outperform both transformers and modern recurrent networks.

An independent reproduction is also available on Github. That detail matters because reproduction is one path for other researchers to inspect whether the reported behavior holds outside the original implementation.

The broader point is not just that HOPE performed well in the described tests. It is that nested learning offers a different framework for building models that can keep learning without collapsing older knowledge. If the approach continues to hold up, it could become part of how future AI systems manage memory after training.