Inception says diffusion can make AI text models faster

Inception, a Palo Alto-based company founded by Stanford professor Stefano Ermon, says it has built a diffusion-based large language model. The company claims its DLMs can run up to 10x faster than traditional LLMs while costing 10x less.

WTF Index TERMINATOR
◄ Terminator 1 Idiocracy 0 ►

This is mostly a routine model-efficiency story, with only a mild lean toward more powerful and cheaper AI systems.

Inception says diffusion can make AI text models faster

Inception is entering the AI market with a technical bet: the diffusion methods known for generating images, video, and audio can also reshape how text models work. The Palo Alto-based company, started by Stanford computer science professor Stefano Ermon, calls its approach a diffusion-based large language model, or DLM.

The claim is straightforward but ambitious. Inception says its model can handle familiar large language model tasks, including code generation and question-answering, while delivering much faster performance and lower computing costs.

What Inception says it has built

Most generative AI models receiving attention today fall broadly into two groups. Large language models, or LLMs, are used for text generation. Diffusion models are mainly associated with systems that create images, video, and audio, including Midjourney and OpenAI's Sora.

Inception is trying to combine those worlds. Its DLM is presented as a language model that uses diffusion technology rather than the sequential generation pattern common to traditional LLMs.

That distinction matters because of how text is produced. Ermon told TechCrunch that he has spent a long time studying how diffusion models could be applied to text in his Stanford lab. His starting point was that traditional LLMs are relatively slow compared with diffusion technology.

"you cannot generate the second word until you’ve generated the first one, and you cannot generate the third one until you generate the first two," Ermon said.

Diffusion models work differently. Instead of building output one step at a time, they begin with a rough version of the data they are generating and then refine it. In image generation, that means bringing a picture into focus. Inception's thesis is that a related process can be made useful for text.

Why parallel text generation is the core idea

The key concept behind Inception's model is parallelism. Ermon hypothesized that diffusion models could generate and modify large blocks of text in parallel, rather than producing language strictly word by word.

After years of trying, Ermon and a student reached what the source describes as a major breakthrough. That work was detailed in a research paper published last year.

The logic is important for anyone watching the AI infrastructure market. If a language model can make better use of parallel processing, it may reduce the time users wait for answers and the amount of compute required to deliver those answers. Inception says that is exactly what its DLMs are designed to do.

The company frames this as more than a small speed improvement. It says its models can use GPUs more efficiently. GPUs are the computer chips commonly used to run models in production, so efficiency at that layer can affect cost, latency, and deployment options.

"What we found is that our models can leverage the GPUs much more efficiently," Ermon said, referring to the computer chips commonly used to run models in production. "I think this is a big deal. This is going to change the way people build language models."

From Stanford research to a new AI company

Inception grew out of that research path. Ermon founded the company last summer after recognizing the potential of the breakthrough. He brought in two former students to co-lead the company: UCLA professor Aditya Grover and Cornell professor Volodymyr Kuleshov.

The company has not publicly detailed all of its financing. Ermon declined to discuss Inception's funding, though TechCrunch understands that the Mayfield Fund has invested.

Inception has also begun working with customers. According to Ermon, the company has secured several customers, including unnamed Fortune 100 companies. The stated appeal is reduced AI latency and increased speed, two issues that matter when AI systems are used in production rather than just in demonstrations.

That early customer interest points to the practical problem Inception is trying to solve. Many AI products are judged not only by whether they produce useful answers, but also by how quickly and economically they can do so at scale.

How Inception plans to offer DLMs

Inception is not presenting the model as a lab-only project. The company offers an API, on-premises deployment, and edge device deployment options. It also supports model fine-tuning and provides a suite of out-of-the-box DLMs for different use cases.

Those options suggest Inception is aiming at organizations with different requirements for where models run and how much they can be adapted. An API can fit teams that want hosted access. On-premises and edge deployment may matter for companies that need models closer to their own systems or devices.

The performance claims are the headline. Inception says its DLMs can run up to 10x faster than traditional LLMs while costing 10x less. The company also says its small coding model compares favorably with OpenAI's GPT-4o mini on quality while being much faster.

"Our ‘small’ coding model is as good as [OpenAI’s] GPT-4o mini while more than 10 times as fast," a company spokesperson told TechCrunch. "Our ‘mini’ model outperforms small open-source models like [Meta’s] Llama 3.1 8B and achieves more than 1,000 tokens per second."

Tokens are bits of raw data used by AI systems. The source notes that 1,000 tokens per second is an impressive speed, assuming Inception's claims hold up.

The larger question for language models

Inception's debut highlights a broader question in AI development: whether the dominant architecture for text generation is the only path forward. LLMs have become the standard way many people think about text AI, but Inception is arguing that diffusion can compete in the same territory.

The company still has to prove its claims in the market. Its pitch depends on whether DLMs can deliver useful language capabilities, not just speed. For customers, the relevant test will be whether faster response times and lower computing costs arrive without sacrificing the quality needed for code generation, question-answering, and other production tasks.

For now, Inception's message is clear. It wants to bring diffusion beyond media generation and into language modeling, with a focus on latency, speed, GPU efficiency, and cost.