Bad AI translations are putting small-language Wikipedia at risk

Machine translation has made it easy to create Wikipedia pages in languages the authors do not speak. For vulnerable languages, those flawed pages can become training data for future AI systems, creating a damaging feedback loop.

WTF Index IDIOCRACY
◄ Terminator 1 Idiocracy 4 ►

The story centers on low-quality AI translation degrading knowledge, language quality, and future training data in a self-reinforcing feedback loop.

Bad AI translations are putting small-language Wikipedia at risk

Small-language Wikipedia editions are facing a problem that looks helpful at first: more articles, faster. But when those pages are made with weak machine translation and left uncorrected, they can damage the very languages they appear to support.

The risk is not only that readers encounter bad information. The deeper danger is that AI systems may later train on the same flawed pages, learn from those errors, and produce still more unreliable translations.

How Greenlandic Wikipedia exposed the problem

When Kenneth Wehr began managing the Greenlandic-language version of Wikipedia four years ago, he decided that most of it had to be removed. Wehr, who is 26 and grew up in Germany, had become deeply interested in Greenland after visiting as a teenager. He later moved to Copenhagen to study Greenlandic, a language spoken by some 57,000 mostly Indigenous Inuit people across Arctic villages.

The Greenlandic edition had been added to Wikipedia around 2003. By the time Wehr took over nearly 20 years later, it contained some 1,500 articles built by hundreds of contributors. On the surface, that looked like a success story for open, multilingual knowledge.

But Wehr found that the project was not what it appeared to be. He believed virtually every article had been written by people who did not actually speak Greenlandic. He suspected that perhaps only one or two Greenlanders had ever contributed.

The most serious issue was the spread of machine-translated text. Pages included basic grammar mistakes, nonsense words, and factual errors, including an entry that said Canada had only 41 inhabitants. Some articles contained strings of letters produced when translation tools could not find suitable Greenlandic words.

“It might have looked Greenlandic to [the authors], but they had no way of knowing,” complains Wehr.

Why AI makes bad pages easier to produce

Automation has long existed on Wikipedia. Bots fix broken links, repair formatting, correct spelling, and generate short formulaic pages about subjects such as rivers, cities, or animals. Those uses can support the platform because they handle repetitive tasks in narrow ways.

AI translation is different because it lets almost anyone create long, plausible-looking articles in a language they may not understand. Trosterud, a computational linguist at the University of Tromsø in Norway, says AI has empowered users he calls “Wikipedia hijackers.” These users may be naive, enthusiastic, or well meaning, but their work can overwhelm smaller communities.

The problem is worse for vulnerable languages because machine translation is often less reliable for them. One reason is that there may be relatively little source text available online. Another is that some languages are difficult for translation systems to identify or process, especially when they resemble other languages or have structures that do not fit the way many systems work.

Greenlandic is one example. Wehr notes that many Greenlandic words are agglutinative, meaning they are built by attaching prefixes and suffixes to stems. That can make words highly context specific, sometimes carrying ideas that other languages would express with a full sentence.

The feedback loop threatening vulnerable languages

AI systems such as Google Translate and ChatGPT learn from large amounts of online text. For languages with few speakers, Wikipedia can be one of the biggest available sources of digital language data. That makes quality control especially important.

If Wikipedia pages are full of errors, those errors can become part of the material AI systems learn from. The result is the simple principle described in the source article: “Garbage in, garbage out.” Bad text can train bad models, and bad models can then help create more bad text.

Kevin Scannell, a former professor of computer science at Saint Louis University who now builds software for endangered languages, explains that these models rely on raw data. He says there are no grammar books or dictionaries guiding them in that process, only the text they receive.

The available evidence suggests the risk is not isolated. Volunteers working on four African languages estimated to MIT Technology Review that between 40% and 60% of articles in their Wikipedia editions were uncorrected machine translations. MIT Technology Review also estimates that more than two-thirds of longer Inuktitut pages contain machine-translated portions.

Earlier research found similar dependence on Wikipedia as a data source. In 2020, Wikipedia was estimated to make up more than half the training data for AI models translating some languages spoken by millions across Africa, including Malagasy, Yoruba, and Shona. In 2022, a German research team found that Wikipedia was the only easily accessible online linguistic data source for 27 under-resourced languages.

Why small communities cannot simply fix everything

Large Wikipedia editions can absorb mistakes because many readers and editors are available to notice and repair them. Smaller editions often do not have that safety net. There may be few readers, few editors, or sometimes not a single regular editor.

That gap creates what Yuet Man Lee, a Canadian teacher in his 20s, calls a “bigger-Wikipedia arrogance.” Lee used Google Translate and ChatGPT to translate several English Wikipedia articles into Inuktitut, thinking that someone might later improve them. Nobody has touched one of the articles since he created it.

Lee now says he may have made a mistake because he did not consider the possibility of contributing to a recursive loop. His example shows how good intentions can still produce harmful results when the target-language community is too small to review the work.

The stakes are not only academic. Abdulkadir Abdulkadir, a 26-year-old agricultural planner in northern Nigeria, spends three hours every day working on Fulfulde entries. He sees Fulfulde Wikipedia as a potential resource for farmers in remote villages, including people looking for information about seeds or crops in a language they understand.

But poor translation can make that information dangerous. Abdulkadir said a machine-translated article could “easily harm them” if it gives incorrect information. He recently had to correct an article about cowpeas because much of it was largely illegible.

Responsible translation needs real speakers

Wikipedia’s own Content Translate tool can automatically translate articles while preserving references and formatting. The tool depends on external machine translation systems, so it carries many of the same weaknesses. Each Wikipedia community can decide whether to allow it, and English-language Wikipedia has largely banned its use, saying that some 95% of articles created with Content Translate failed to meet an acceptable standard without major additional work.

Amir Aharoni, a member of the volunteer Language Committee, says machine translation can be useful when used responsibly. But the central requirement is still human judgment from people who know the language.

For vulnerable languages, more content is not automatically progress. A small Wikipedia edition built on inaccurate machine translation can mislead readers, burden the few fluent editors who remain, and feed bad data back into AI systems. The clearest lesson from the Greenlandic, Inuktitut, and Fulfulde examples is that language preservation online depends on quality, not just quantity.