TechCrunch AI February 16, 2025 NEUTRAL

Europe’s open source LLM plan tests digital sovereignty

OpenEuroLLM aims to create a series of open source LLMs covering all European Union languages and future EU market languages. The project has funding, compute partners and prior research to build on, but it also faces questions about coordination, openness, data access and overlap with EuroLLM.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is mostly a strategic open-source AI infrastructure story, with only mild power-and-control implications around digital sovereignty.

Europe’s open source LLM plan tests digital sovereignty

Europe’s push for digital sovereignty now has a large language model project at its center. OpenEuroLLM brings together roughly 20 organizations with the goal of developing “truly” open source LLMs for every current European Union language and for languages tied to countries negotiating entry to the EU market, such as Albania.

The effort is ambitious by design. It is meant to support transparent AI in Europe, preserve linguistic and cultural diversity, and give companies in Europe foundation models they can build on. But the project also arrives with clear pressure points: a complex consortium, a limited model-building budget compared with corporate AI giants, unresolved questions about training data, and a near-namesake project already operating in Europe.

A sovereignty project built around language

OpenEuroLLM fits into a broader European push to keep mission-critical digital infrastructure closer to home. The source article points to cloud companies investing in local infrastructure, OpenAI offering customers the ability to process and store data in Europe, and the EU signing an $11 billion deal for a sovereign satellite constellation to rival Elon Musk’s Starlink.

In that context, open source LLMs are not only a technical project. They are part of a larger strategy: reduce dependence on external platforms, support local control over data and infrastructure, and make sure European languages are served by models developed with Europe’s priorities in mind.

The project is co-led by Jan Hajič, a computational linguist from the Charles University in Prague, and Peter Sarlin, CEO and co-founder of Finnish AI lab Silo AI, which AMD acquired last year for $665 million. Its participating organizations span academia, research and companies, including groups from Czechia, the Netherlands, Germany, Sweden, Finland and Norway, along with EuroHPC supercomputer centers in Spain, Italy, Finland and the Netherlands.

From the corporate side, Silo AI is joined by Aleph Alpha, Ellamind, Prompsit Language Engineering and LightOn. One notable absence is Mistral, the French AI unicorn that has positioned itself as an open source alternative to incumbents such as OpenAI. Hajič said he tried to approach Mistral, but that it did not lead to a focused discussion about participation.

What OpenEuroLLM is trying to build

The project’s top-line aim is described as “A series of foundation models for transparent AI in Europe.” In practical terms, the exact deliverables are still being worked out, but the plan is likely to include a core multilingual LLM for general-purpose tasks where accuracy matters, as well as smaller quantized versions for uses where efficiency and speed are more important.

Coverage is a major part of the challenge. OpenEuroLLM is intended to support the current 24 official EU languages and additional languages linked to countries negotiating access to the EU market. The goal is not simply to add many languages as a feature list, but to support the linguistic and cultural diversity behind those languages.

Hajič acknowledged that equal performance across languages will be difficult, especially where digital resources are scarce. The project therefore wants benchmarks that better reflect those languages and cultures, instead of relying on tests that may not represent them well.

The expected timeline is also clear. Hajič said he expects the first version(s) to be released by mid-2026, with final iteration(s) by the project’s conclusion in 2028. The project formally started on Saturday [February 1], but Hajič said preparation had been underway for a year, with the tender process opening in February 2024.

The head start behind the new project

OpenEuroLLM can be read in two ways: as a new project with little public output yet, or as a continuation of existing European work on language technology. The latter view matters because Hajič has also coordinated the High Performance Language Technologies (HPLT) project since 2022.

HPLT has focused on free and reusable datasets, models and workflows using high-performance computing. It is scheduled to end in late 2025, and Hajič described it as a kind of predecessor to OpenEuroLLM because most HPLT partners, except the U.K. partners, are also participating here.

The data foundation is significant within the facts given by the source. Version 2.0 of the HPLT dataset was released four months ago. It was trained on 4.5 petabytes of web crawls and more than 20 billion documents, and Hajič said additional Common Crawl data will be added.

That inherited work is central to the project’s answer to critics who see OpenEuroLLM as too new or too diffuse. Hajič said the group is not starting from zero in data, expertise, tools or compute experience, even if the public-facing project itself is still early.

Funding, compute and coordination questions

The stated budget for building the models is €37.4 million, with roughly €20 million coming from the EU’s Digital Europe Programme. Compared with the largest corporate AI investments, that figure is modest. But the broader financial picture is more complicated because compute is a major cost, and the project is connected to EuroHPC supercomputer centers.

The broader EuroHPC project has a budget of around €7 billion. Sarlin argued that OpenEuroLLM has meaningful resources because its direct funding mostly covers people, while compute should largely come through EuroHPC partnerships.

He also drew a line between building models and building products. OpenEuroLLM is not meant to create a chatbot or AI assistant. Its stated role is to provide an open source foundation model that companies in Europe can build upon.

Still, the number of participants has drawn criticism. Anastasia Stasenko, co-founder of LLM company Pleias, questioned whether a “sprawling consortia of 20+ organizations” could match the focus of smaller private AI companies. She pointed to Europe’s successes in AI through teams like Mistral AI and LightOn.

Hajič acknowledged the project’s many moving parts, but argued that the mix of academic expertise and company focus could bring something new. That is the core bet: a consortium may be slower and more complex than a single company, but it may also pool skills, data and infrastructure that no one participant could supply alone.

The hard meaning of open source AI

OpenEuroLLM also faces a familiar but unresolved question: what does open source mean for AI? In software, open source can be judged against established license definitions. In AI, openness can involve models, datasets, pretrained models, weights and other components.

The Open Source Initiative has formed a definition of open source AI, but the source notes that not everyone agrees with it. One reason is training data. Some AI systems are built with data that cannot be freely redistributed, even if the model is meant to be open.

Hajič said the project’s goal is to have everything open, while also noting limitations. Some data may not be redistributable, though it may be stored for future inspection. The source connects this to the need for high-risk AI systems to make information available to auditors under the terms of the EU AI Act.

There is also a coordination issue beyond OpenEuroLLM itself. EuroLLM, a separate EU co-funded project with nine partners, launched its first model in September and a follow-up in December. It has similar goals: building an open source European Large Language Model that supports 24 Official European Languages and other strategically important languages.

Andre Martins, head of research at Unbabel, highlighted the similarity and expressed hope that the communities would collaborate instead of reinventing work each time a project gets funded. Hajič called the situation “unfortunate” and said he hoped cooperation might be possible, while noting OpenEuroLLM’s funding limits collaborations with non-EU entities, including U.K. universities.

The result is a project that is both technical and political. OpenEuroLLM does not need to beat every corporate model to matter. By the project leaders’ own framing, a good model with its components based in Europe could still be a positive outcome for digital sovereignty.