Ars Technica AI February 27, 2025 NEUTRAL

Why text diffusion models could make AI coding faster

Inception Labs has released Mercury Coder, an AI language model that uses diffusion techniques to generate text in parallel rather than one token at a time. Its reported speed and coding benchmark results suggest text diffusion models may become a serious alternative for smaller AI language models, though reliability and scaling questions remain.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is mainly a technical model launch about faster AI coding, with only a mild capability-increase lean and no clear harm or dependency angle.

Why text diffusion models could make AI coding faster

AI language models usually write by moving forward step by step. A new group of text diffusion models is trying a different route: start with obscured content, then refine the whole answer into readable text.

Inception Labs released Mercury Coder on Thursday, bringing that approach into sharper focus for coding tasks. The company reports speed figures that are far above conventional models of similar scope, while researchers and developers are still watching closely for trade-offs in quality and reliability.

How text diffusion changes the generation process

Traditional large language models use autoregression. In plain terms, they build output from left to right, one token at a time. Each new piece of text depends on the pieces already produced.

That method is used by conventional models such as the kind that powers ChatGPT. It is effective, but it also means the model must wait on its own previous output before producing the next part of the response.

Text diffusion models take inspiration from image-generation systems such as Stable Diffusion, DALL-E, and Midjourney. Instead of adding noise to pixels, these models work with text by masking tokens, because text tokens are discrete chunks rather than continuous pixel values.

LLaDA, developed by researchers from Renmin University and Ant Group, uses masking probability to represent the amount of noise. A high masking level means high noise, while a low masking level means low noise. The model moves from more hidden content toward less hidden content until a coherent answer emerges.

Mercury uses noise terminology, while LLaDA describes the process through masking. The underlying idea is similar: begin from obscured content and progressively reveal the final response.

Why parallel output matters

The key difference is that diffusion-based language models can work across the whole response at once. Rather than committing to one token and then moving to the next, the model can refine many parts of the answer in parallel.

According to Inception Labs, this lets Mercury address mistakes while producing output, because it is not restricted to only the text that has already appeared. That is the main reason the company links the approach to higher throughput.

Inception Labs reports that Mercury reaches 1,000-plus tokens per second on Nvidia H100 GPUs. Mercury’s documentation also says its models run “at over 1,000 tokens/sec on Nvidia H100s, a speed previously possible only using custom chips” from specialized hardware providers like Groq, Cerebras, and SambaNova.

That speed claim is central to the interest around Mercury Coder. In applications where a model must respond quickly, a faster generation process could matter even if the model still needs multiple internal refinement passes.

What the benchmark claims show

Speed alone would not be enough if quality fell sharply. The source article reports that diffusion models can maintain performance that is faster than or comparable to similarly sized conventional models.

LLaDA’s researchers report that their 8 billion parameter model performs similarly to LLaMA3 8B across benchmarks including MMLU, ARC, and GSM8K. That matters because it suggests the diffusion method is not only a speed experiment, but also a possible model architecture path.

For coding, Mercury Coder Mini is reported to score 88.0 percent on HumanEval and 77.1 percent on MBPP, comparable to GPT-4o Mini. The speed comparison is more dramatic: Mercury Coder Mini is reportedly operating at 1,109 tokens per second compared to GPT-4o Mini’s 59 tokens per second.

The source describes that as roughly a 19x speed advantage over GPT-4o Mini while maintaining similar performance on coding benchmarks. It also reports that Mercury Coder Mini is about 5.5x faster than Gemini 2.0 Flash-Lite, listed at 201 tokens/second, and 18x faster than Claude 3.5 Haiku, listed at 61 tokens/second.

Mercury Coder Mini: 1,109 tokens per second, according to the reported comparison.
GPT-4o Mini: 59 tokens per second in the same comparison.
Gemini 2.0 Flash-Lite: 201 tokens/second.
Claude 3.5 Haiku: 61 tokens/second.

Where faster text generation could matter

Inception Labs sees several possible uses for the speed advantage. The source article names code completion tools, conversational AI applications, resource-limited environments like mobile applications, and AI agents that need fast responses.

Code completion is an especially clear fit for the claim. If a developer is waiting for the next suggestion, even a short delay can change how useful the tool feels. A model that can produce usable code-focused output faster may fit that workflow well, provided the output remains dependable.

Conversational AI is another obvious area from the source. A system that responds faster may feel more interactive, but speed does not remove the need for accuracy. The article notes that these systems still confabulate frequently on many topics.

Resource-limited environments such as mobile applications are also part of the argument. The source does not claim that diffusion models solve every constraint, but it presents speed as a potential advantage when fast response matters.

The open questions are still important

Text diffusion models are not free of trade-offs. They typically require multiple forward passes through the network to produce a complete response, while traditional models need one pass per token. The advantage comes from processing tokens in parallel, which can still produce higher throughput despite that overhead.

Researchers are also watching whether the approach can scale. The source raises questions about whether larger diffusion models can match models like GPT-4o and Claude 3.7 Sonnet, whether they can produce reliable results without many confabulations, and whether they can handle increasingly complex simulated reasoning tasks.

The response from AI researchers has been open but cautious. Independent AI researcher Simon Willison told Ars Technica, “I love that people are experimenting with alternative architectures to transformers, it’s yet another illustration of how much of the space of LLMs we haven’t even started to explore yet.”

Former OpenAI researcher Andrej Karpathy wrote on X about Inception, “This model has the potential to be different, and possibly showcase new, unique psychology, or new strengths and weaknesses. I encourage people to try it out!”

For now, the most grounded conclusion is narrow but meaningful. Mercury Coder and LLaDA show that diffusion-based language models may offer another route for smaller AI language models, especially where speed is a priority and comparable capability can be maintained.