A faster AI bet arrives as Mercury 2 uses diffusion for reasoning

Inception Labs has launched Mercury 2, a diffusion-based reasoning AI model built to generate text differently from conventional language models. The company says the model is faster and cheaper than named speed-focused rivals, while supporting a 128K context window, tool usage, JSON output and an OpenAI-compatible API.

WTF Index NEUTRAL
◄ Terminator 1 Idiocracy 0 ►

This is mainly a routine model launch focused on speed, cost and developer features, with only mild capability expansion risk.

A faster AI bet arrives as Mercury 2 uses diffusion for reasoning

Inception Labs is bringing a different kind of language model into production with Mercury 2, a reasoning model the company presents as the first diffusion-based system of its kind. The pitch is direct: generate useful text faster, reduce latency, and lower the cost of running AI products that need quick answers.

The launch matters because Mercury 2 does not follow the same step-by-step text generation pattern associated with conventional language models. Inception says its model works by refining multiple text blocks at the same time, a process the startup compares to editing a full draft in one pass rather than checking one word after another.

How Mercury 2 Changes The Generation Process

Most users experience an AI model through its final answer, but the mechanics behind that answer shape speed, cost and product design. Mercury 2 is built around diffusion-based language reasoning, which means its text process is described as broader and more simultaneous than the word-by-word approach of conventional models.

That distinction is central to Inception's positioning. Instead of emphasizing only benchmark quality, the company is highlighting the practical impact of a different architecture: less waiting, lower token prices, and support for the tooling developers expect from production AI systems.

According to the source, Mercury 2 supports a 128K context window, tool usage and JSON output. Those features are important for business adoption because they point to workflows that go beyond simple chat, including structured responses and integrations with external tools.

Speed And Price Are The Main Claims

Inception's clearest performance claim is latency. The company says Mercury 2 reaches 1,009 tokens per second on Nvidia Blackwell GPUs, with end-to-end latency of just 1.7 seconds.

The source compares that figure with 14.4 seconds for Gemini 3 Flash and 23.4 seconds for Claude Haiku 4.5 with reasoning enabled. Inception also claims the output quality is comparable to leading speed-optimized models.

Pricing is another major part of the launch. Mercury 2 is listed at $0.25 per million input tokens and $0.75 per million output tokens. The source says this undercuts Gemini 3 Flash, priced at $0.50/$3.00, by half on input and four times on output. It also describes Mercury 2 as roughly four times cheaper than Claude Haiku 4.5, priced at $1.00/$5.00, on input and more than two and a half times on output.

For companies, those numbers are not just marketing details. Latency-sensitive products often become difficult to scale when every interaction depends on a slow or expensive model call. If Inception's claims hold in real deployments, Mercury 2 could be relevant for teams trying to keep AI interactions responsive without pushing token costs too high.

Where Inception Wants The Model Used

Inception is aiming Mercury 2 at companies building applications where delay is highly visible to users. The source names voice assistants, coding tools and search systems as target areas.

Those categories share a common requirement: the model has to respond quickly enough to feel useful inside an active workflow. A voice assistant cannot pause too long between turns. A coding tool loses value if suggestions arrive after the developer has moved on. A search system needs to return useful results without making the user wait through a long reasoning cycle.

Mercury 2 is available now through an OpenAI-compatible API. Companies can apply for early access, and the model can also be tested directly in the chat.

The OpenAI-compatible API detail is especially relevant for adoption because it suggests developers may be able to evaluate Mercury 2 without rebuilding their entire software stack around a new interface. The source does not describe the full integration process, but compatibility is clearly part of the production pitch.

The Broader Search Beyond Transformers

Mercury 2 arrives during a wider industry search for alternatives to the dominant Transformer architecture. The source notes that a growing number of startups are exploring other approaches, with diffusion-based language models now part of that conversation.

Inception has been moving toward this point for some time. Last November, the startup raised $50 million from investors including Microsoft, Nvidia, and Snowflake. It showed its first prototype in early 2025. With Mercury 2, it is now presenting a production-ready reasoning model rather than an early demonstration.

Inception is not the only company exploring diffusion for language. Google Deepmind is also working on diffusion-based language models. The source says Gemini Diffusion performed on par with the then-current Gemini 2.0 Flash Lite model in benchmarks, but Google has not said anything about the experiment since it was first presented in May 2025.

That leaves the central question open. Diffusion-based language models may offer a different path toward faster and cheaper reasoning systems, but long-term durability still has to be proven. Mercury 2 gives the market a concrete production model to test, compare and pressure in real use.

What To Watch Next

The near-term story is simple: Mercury 2 is trying to win on speed, price and production readiness. The larger story is whether diffusion-based reasoning can become more than an interesting alternative architecture.

For developers and companies, the first tests will likely focus on familiar questions:

  • Does the model's low latency hold up in real applications?
  • Is output quality strong enough for customer-facing use?
  • Can the OpenAI-compatible API make evaluation straightforward?
  • Do the token prices materially change the economics of latency-sensitive AI products?

Those are practical questions rather than theoretical ones. Mercury 2's launch puts diffusion-based language reasoning into a form companies can evaluate now, and that makes it a notable step in the search for what might come after today's dominant model designs.