The Decoder October 23, 2024 NEUTRAL

Why DIFF Transformer could make LLMs focus better

Microsoft Research has created the Differential Transformer, or DIFF Transformer, to help language models concentrate on relevant context and reduce interference. In tests, it matched conventional transformers with about 65 percent of the model size or training data, improved long-context retrieval, and showed fewer hallucinations.

Microsoft Research is testing a new AI architecture called the Differential Transformer, or DIFF Transformer, with a simple goal: make large language models pay less attention to the wrong things.

The architecture changes the attention mechanism at the heart of transformer models. According to the researchers, that change helps the model filter distracting context, retrieve important information more reliably, and reduce hallucinations in summarization and question-answering tasks.

What changes inside DIFF Transformer

The central idea is called differential attention. Instead of relying on one attention map, the DIFF Transformer calculates two separate softmax attention maps and subtracts one from the other.

The researchers compare the effect to noise-canceling headphones. If both attention maps contain the same distracting signal, the subtraction can reduce that shared noise and leave the model with a cleaner signal for the information that matters.

That matters because the research team says,

"Transformer tends to overallocate attention to irrelevant context"

In plain language, conventional transformer models may spend too much of their attention budget on text that does not help answer the task. The DIFF Transformer is designed to push attention away from that irrelevant context and toward the parts of the input that matter most.

Why long context is a key test

The DIFF Transformer showed particular strength on longer contexts of up to 64,000 tokens. That is important because long inputs make it harder for a model to identify the small piece of information that actually answers the question.

In tests that extract key information from long texts, often described as "needle in a haystack" tasks, the DIFF Transformer performed significantly better than conventional models. When important information appeared in the first half of a 64,000-token context, the new model reached up to 76 percent higher accuracy, according to the researchers.

This result fits the purpose of the architecture. If a model can reduce noise across a large amount of text, it should be better positioned to recover the relevant detail instead of being pulled toward unrelated passages.

Efficiency gains with less data or model size

The DIFF Transformer was also tested for efficiency. In those tests, it achieved comparable performance to conventional transformers while using about 65 percent of the model size or training data.

For a 3-billion-parameter model trained on one trillion tokens, the DIFF Transformer outperformed variants based on the established transformer architecture, according to the study. That does not mean every model would automatically become smaller or cheaper. It does show that the architecture produced stronger results under the conditions described by the researchers.

The efficiency claim is important because large language models are often judged not only by their raw output quality, but also by the resources needed to train and run them. A design that can preserve or improve performance with less model size or training data would be a meaningful direction for future model development.

Hallucinations, learning order, and compression

The researchers also report lower hallucination rates. In summarization tests using datasets such as XSum, CNN/DM, and MultiNews, the DIFF Transformer showed 9 to 19 percentage points higher accuracy than a comparable standard transformer. Similar gains appeared in question-answering tasks.

Hallucinations are a central problem for large language models because they make outputs harder to trust. The source does not claim the DIFF Transformer eliminates the issue, but it does report measurable improvements in the tested settings.

The architecture also proved more robust when the order of examples changed during contextual learning. That matters because conventional models can be sensitive to example order, which can make behavior less predictable across prompts that contain the same information arranged differently.

Another reported benefit appears in quantization, a technique used to reduce model size and increase inference speed by converting continuous model parameter values into a smaller set of discrete values. The DIFF Transformer reduces outlier activations, which can make efficient compression harder.

At extreme quantization to 4 bits, the DIFF Transformer achieved about 25 percentage points higher accuracy than a standard transformer. That result connects the architecture not only to model quality, but also to the practical work of making models smaller and faster.

The tradeoff is modest throughput loss

The DIFF Transformer is not presented as a free improvement. According to the study, its throughput is about 5 to 12 percent lower than that of a comparable conventional transformer.

That tradeoff is relatively small compared with the reported gains across long-context retrieval, hallucination reduction, contextual learning robustness, and quantization. The researchers describe the architecture as a promising foundation for future large language models.

The larger point is not just that DIFF Transformer changes how attention is calculated. It changes the model’s relationship with context. By subtracting shared noise from two attention maps, Microsoft Research is testing whether future LLMs can become more selective, more efficient, and less prone to misleading outputs.