WIRED AI October 30, 2024 TERMINATOR

Why Meta is betting Llama 4 on more than 100,000 H100s

Meta says Llama 4 is being trained on a GPU cluster bigger than 100,000 H100s, with smaller versions expected first. The push shows how central compute scale, energy access, and Meta's open model strategy have become in the race against OpenAI and Google.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

Meta's massive compute escalation points toward more powerful AI systems, though the story is mostly an infrastructure update without direct harm or autonomy claims.

Why Meta is betting Llama 4 on more than 100,000 H100s

Meta is preparing the next major step for its Llama AI models with a training effort built around an unusually large GPU cluster. On an earnings call Wednesday, CEO Mark Zuckerberg said Llama 4 development is well underway and that an initial launch is expected early next year.

The headline detail is the size of the compute system behind it. Zuckerberg said Meta is training the Llama 4 models on a cluster bigger than 100,000 H100s, the Nvidia chips widely used for AI training, and described it as larger than anything else he had seen reported.

A bigger training run for Llama 4

Zuckerberg gave investors a direct signal about Meta's AI ambitions: Llama 4 is not just a routine update. The company is putting major infrastructure behind the model family, and the first releases may not all arrive at once.

“We're training the Llama 4 models on a cluster that is bigger than 100,000 H100s, or bigger than anything that I've seen reported for what others are doing,” Zuckerberg said. “I expect that the smaller Llama 4 models will be ready first.”

He did not provide a detailed feature list for Llama 4. Instead, he pointed in broad terms to “new modalities,” “stronger reasoning,” and “much faster.” Those clues suggest Meta is trying to improve both what the models can handle and how efficiently they respond, while leaving the specifics for a later announcement.

The scale matters because many AI companies believe larger training runs, more data, and more computing power are essential to producing more capable models. Meta previously worked with Nvidia on clusters of about 25,000 H100s used to develop Llama 3. The move beyond 100,000 H100s marks a sharp escalation from that earlier generation.

The compute race is widening

Meta is not the only company pushing toward very large AI training systems. In July, Elon Musk said his xAI venture had worked with X and Nvidia to set up 100,000 H100s. At the time, Musk wrote on X, “It’s the most powerful AI training cluster in the world!”

OpenAI is also training GPT-5, the successor to the model that powers ChatGPT. OpenAI has said GPT-5 will be larger than its predecessor and will include other innovations, including a recently developed approach to reasoning. But it has not disclosed details about the computer cluster being used for training.

CEO Sam Altman has said GPT-5 will be “a significant leap forward” compared with its predecessor. Last week, after a news report said OpenAI's next frontier model would be released by December, Altman responded on X, “fakes news out of control.”

Google is also moving ahead. On Tuesday, CEO Sundar Pichai said the company’s newest version of the Gemini family of generative AI models is in development.

Together, these updates show a race with two intertwined tracks:

building larger and more capable AI models;
securing enough chips, data centers, and power to train them.

Open models make Meta's strategy different

Meta's Llama strategy stands apart because the models can be downloaded in full for free. That differs from systems offered by OpenAI, Google, and most other major companies, which are typically accessed through an API.

That download model has made Llama popular with startups and researchers that want direct control over their models, data, and compute costs. It also gives Meta a different role in the AI market: the company is not only building AI for its own products, but also supplying models that others can run and adapt.

Meta calls Llama “open source,” but the source article notes two important limits. The Llama license includes some restrictions on commercial use, and Meta does not share full details about how the models are trained. That makes it harder for outsiders to examine how the systems work.

The release history is still short. Meta released the first version of Llama in July of 2023 and made Llama 3.2 available this September. Llama 4 is now positioned as the next major test of whether Meta can pair open availability with frontier-scale training.

Energy and spending are part of the story

A cluster this large is not just a software project. Managing such a large number of chips can create major engineering challenges, and the power requirement is substantial.

According to one estimate cited in the source article, a cluster of 100,000 H100 chips would require 150 megawatts of power. By comparison, El Capitan, the largest national lab supercomputer in the United States, requires 30 megawatts.

Meta executives did not directly answer an analyst question on Wednesday about energy access constraints in parts of the US that have made it harder for companies to build more powerful AI infrastructure. Even without that detail, the spending picture is clear. Meta expects to spend as much as $40 billion in capital this year on data centers and other infrastructure, more than 42 percent above 2023. The company also expects that spending to grow even faster next year.

Those costs are rising while Meta's business remains strong. The company's total operating costs have grown about 9 percent this year, while overall sales, largely from ads, have surged more than 22 percent. That gives Meta larger profits and wider margins even as it commits billions of dollars to Llama.

The open AI tradeoff

Meta's open approach is also controversial. Some AI experts worry that freely available, more powerful AI models could help criminals launch cyberattacks or automate the design of chemical or biological weapons. Llama models are fine-tuned before release to limit misuse, but the source article notes that removing those restrictions is relatively trivial.

Zuckerberg remains committed to the strategy. “It seems pretty clear to me that open source will be the most cost effective, customizable, trustworthy, performant, and easiest to use option that is available to developers,” he said on Wednesday. “And I am proud that Llama is leading the way on this.”

Meta also has a direct product reason to keep investing. Zuckerberg said Llama 4's capabilities should support a wider range of features across Meta services. Today, the main Llama-based product is Meta AI, a ChatGPT-like chatbot available in Facebook, Instagram, WhatsApp, and other apps.

More than 500 million people monthly use Meta AI, according to Zuckerberg. Meta expects revenue opportunities to develop over time through ads in the feature. CFO Susan Li said, “There will be a broadening set of queries that people use it for, and the monetization opportunities will exist over time as we get there.”

If that model works, Meta may be able to use its advertising business to fund the massive infrastructure behind Llama while still making the models broadly available to developers, startups, and researchers.