Why AI quantization may not keep cutting inference costs

Quantization helps AI models run with fewer bits, making inference less computationally demanding. But a study from researchers at Harvard, Stanford, MIT, Databricks, and Carnegie Mellon suggests the technique can damage quality, especially when large models have been trained for a long time on large datasets.

WTF Index NEUTRAL
◄ Terminator 0 Idiocracy 1 ►

This is mainly a technical cost-and-quality story about quantization limits, with only a mild angle about degraded AI output quality.

Why AI quantization may not keep cutting inference costs

Quantization has become one of the AI industry’s preferred ways to make models cheaper and easier to run. The idea is simple in broad terms: represent parts of a model with fewer bits, accept a little less precision, and reduce the mathematical work needed during inference.

But the trade-off may be sharper than many teams hoped. Research described by TechCrunch suggests that quantization has limits, and that those limits matter most for the large, heavily trained models that major AI labs are trying to deploy at lower cost.

What quantization changes inside an AI model

In AI, quantization means lowering the number of bits used to represent information. Bits are the smallest units a computer can process, so reducing them can make a model less demanding to run.

The technique often applies to model parameters, the internal variables a model uses to make predictions or decisions. Because models perform millions of calculations when they run, representing those parameters with fewer bits can lower the computational burden.

That makes quantization especially attractive for inference, which is the process of running a model after training. When ChatGPT answers a question, for example, that is inference. The source article distinguishes quantization from distilling, which is a more involved and selective pruning of parameters.

The practical appeal is clear: if a model can deliver useful answers while requiring fewer resources, companies can serve more users at lower cost. The difficult question is how far bit precision can be reduced before model quality begins to suffer in ways that matter.

The study points to a hard limit

According to a study from researchers at Harvard, Stanford, MIT, Databricks, and Carnegie Mellon, quantized models can perform worse when the original, unquantized model was trained for a long time on large amounts of data. In that situation, the better path may sometimes be to train a smaller model instead of compressing a larger one.

That finding cuts against a common deployment strategy. AI companies often train extremely large models because size and training scale are known to improve answer quality, then quantize those models to make them less expensive to serve.

The problem is already visible in reported developer and academic feedback around Meta’s Llama 3. A few months ago, they reported that quantizing Llama 3 tended to be “more harmful” compared with other models, potentially because of how it was trained.

“In my opinion, the number one cost for everyone in AI is and will continue to be inference, and our work shows one important way to reduce it will not work forever,” Tanishq Kumar, a Harvard mathematics student and the first author on the paper, told TechCrunch.

That matters because inference can be more expensive in aggregate than training. The source gives one example: Google spent an estimated $191 million to train one of its flagship Gemini models. But using a model to generate just 50-word answers to half of all Google Search queries would cost roughly $6 billion a year.

Scaling up makes the economics harder

Major AI labs have leaned into the idea that scaling up training with more data and compute will produce more capable models. Meta’s Llama 3 was trained on a set of 15 trillion tokens, while Llama 2 was trained on “only” 2 trillion tokens. The source explains that tokens represent bits of raw data, and that 1 million tokens is equal to about 750,000 words.

In early December, Meta released Llama 3.3 70B, which the company says “improves core performance at a significantly lower cost.” That kind of claim reflects the central tension in current AI development: labs want stronger models, but they also need those models to be practical to run.

There are signs that simply scaling up may not keep delivering the same gains. Anthropic and Google reportedly trained enormous models that fell short of internal benchmark expectations. Even so, the source notes that there is little sign the industry is ready to meaningfully move away from established scaling approaches.

Quantization has helped bridge that gap by making large models easier to serve. The new research complicates that picture. If quantization becomes less reliable as models become more heavily trained, then cost reduction cannot depend only on lowering precision after training.

Why bit precision matters

Precision refers to how many digits a numerical data type can accurately represent. The data type FP8, for example, uses only 8 bits to represent a floating-point number.

Most models today are trained at 16-bit or “half precision” and then “post-train quantized” to 8-bit precision. In plain language, the model is trained with more numerical detail and later converted so certain components use a lower-precision format. That can preserve much of the benefit while cutting resource needs, but it can also cost accuracy.

Hardware vendors are pushing this further. Nvidia’s new Blackwell chip supports 4-bit precision, specifically a data type called FP4. Nvidia has pitched that capability as useful for data centers constrained by memory and power.

But the study’s message is that lower precision is not automatically better. Kumar says that unless the original model is incredibly large in terms of parameter count, precisions lower than 7- or 8-bit may produce a noticeable drop in quality.

What this means for AI efficiency

The lesson is not that quantization is useless. It remains a powerful method for reducing the cost of AI inference. The lesson is that it is not a limitless shortcut.

Kumar and his co-authors found that training models in “low precision” may make them more robust. That suggests model design and training strategy could matter as much as post-training compression when teams are trying to control inference costs.

“The key point of our work is that there are limitations you cannot naïvely get around,” Kumar concluded. “We hope our work adds nuance to the discussion that often seeks increasingly low precision defaults for training and inference.”

Kumar also acknowledged that the study was at relatively small scale and that the team plans to test more models in the future. Still, he argued that one point is likely to hold: reducing inference costs has real trade-offs.

“Bit precision matters, and it’s not free,” he said. “You cannot reduce it forever without models suffering. Models have finite capacity, so rather than trying to fit a quadrillion tokens into a small model, in my opinion much more effort will be put into meticulous data curation and filtering, so that only the highest quality data is put into smaller models. I am optimistic that new architectures that deliberately aim to make low precision training stable will be important in the future.”

For AI companies, that points toward a more careful efficiency strategy. Smaller models, better data curation, low precision training, and new architectures may all become more important if post-train quantization cannot keep absorbing the cost of ever-larger systems.