The Decoder June 8, 2025 NEUTRAL

Why Gemini 2.5 Pro leads on long-context AI tests

Gemini 2.5 Pro currently leads the Fiction.Live benchmark for handling complex, lengthy text. OpenAI's o3 keeps pace up to 128,000 tokens, but its performance drops sharply at 192,000 tokens, while Gemini 2.5 Pro's June preview remains stable there.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is mainly a benchmark update about improved long-context capability, with no clear harm, autonomy, or societal degradation angle.

Why Gemini 2.5 Pro leads on long-context AI tests

Long-context AI is becoming one of the clearest ways to separate models that can merely hold a lot of text from models that can still use that text well. On the Fiction.Live benchmark, Google's Gemini 2.5 Pro is currently ahead in a test built around complex, lengthy stories and context.

The result matters because the benchmark is not just asking a model to find one hidden fact. It tests whether a language model can understand and accurately reproduce intricate stories and contexts, which is a harder and more realistic challenge for many long-document workflows.

What Fiction.Live Tests

Fiction.Live focuses on how well language models handle long, complicated material. That puts pressure on more than raw context size. A model must keep track of structure, relationships, and details across a large amount of text.

That makes the benchmark different from simpler search-style evaluations such as the popular "Needle in the Haystack" test. A model can sometimes retrieve an isolated item from a long input while still struggling to reason across the broader document.

For users, this distinction is important. Many real tasks involve lengthy PDFs, documents, stories, reports, or other inputs where the answer depends on context spread across many pages. The question is not only whether the model can accept the file, but whether it can keep the useful parts connected.

Where Gemini 2.5 Pro Pulls Ahead

According to Fiction.Live, OpenAI's o3 model performs similarly to Gemini 2.5 Pro up to a context window of 128,000 tokens, which the source describes as about 96,000 words. At that size, the two models are close enough that the benchmark does not show a major separation.

The gap appears at 192,000 tokens, roughly 144,000 words. At that point, o3's performance drops off sharply. Gemini 2.5 Pro's June preview, identified as preview-06-05, remains stable at that same length.

That does not mean Gemini 2.5 Pro will stay equally accurate at every larger size. The tested windows are still far below the one million tokens Google advertises as Gemini 2.5 Pro's maximum. The source also notes that Gemini's accuracy is likely to decrease as the window grows.

For comparison, OpenAI's o3 model currently tops out at a 200,000-token context window. The Fiction.Live result therefore highlights a practical point: advertised capacity and reliable long-context performance are related, but they are not the same thing.

Bigger Windows Are Not Automatically Better

Model providers often promote very large context windows, but the source shows why those claims need careful reading. Meta, for example, promotes a context window of up to ten million tokens for Llama 4 Maverick. In practice, the model struggles with complex long-context tasks and ignores too much information to be useful.

The core issue is quality of attention, not just quantity of input. A model may be able to receive a huge amount of text, but that does not guarantee it will weigh the right parts properly or preserve the relationships needed for a correct answer.

Nikolay Savinov of Google DeepMind recently described this as a basic "shit in, shit out" problem. His point, as presented in the source, is that adding more tokens can create a distribution issue: more attention on one token means less attention available for others.

That observation explains why context bloat can hurt results. If a prompt contains large amounts of irrelevant material, the model has to process more noise along with the useful signal. Even a strong long-context model can become less reliable when the input is cluttered.

What Users Should Do Now

The practical lesson is straightforward: be selective. Savinov recommends avoiding irrelevant information in the context whenever possible, while researchers continue working on new models to address the problem.

That advice applies even when a model can technically handle a large document. Recent studies also show that AI models still have trouble reasoning over long contexts. A large context window can help, but it does not remove the need to prepare the input carefully.

For long PDFs and other large files, users should remove pages that are not relevant to the task. The source gives introductory sections as an example of material that may be unnecessary for a specific query.

A useful workflow is therefore built around context discipline:

Include the document sections that matter to the question.
Remove pages that do not contribute to the task.
Treat maximum context size as capacity, not as a promise of perfect reasoning.
Compare models on complex long-context tasks, not only on simple retrieval tests.

The Bottom Line

Gemini 2.5 Pro's lead on Fiction.Live shows that long-context AI is becoming more measurable and more nuanced. The current result favors Google's model at 192,000 tokens, where OpenAI's o3 falls off sharply and Gemini 2.5 Pro's June preview remains stable.

But the broader message is not that more tokens solve everything. The benchmark, the examples of o3 and Llama 4 Maverick, and Savinov's warning all point in the same direction: useful long-context performance depends on how well a model handles relevant information, not just how much text fits into the window.