The European Union has 24 official languages and dozens extra unofficial ones spoken throughout the continent. In the event you add within the European nations outdoors the union, then that brings at the least a dozen extra into the combination. Add dialects, endangered languages, and languages introduced by migrants to Europe, and you find yourself with tons of of languages.
One factor many people in know-how may agree on is that the US dominates — and that extends to on-line languages. There are numerous causes for this, largely resulting from American establishments, requirements our bodies, and firms defining how computer systems, their working techniques, and the software program they run work of their nascent days. That is altering, however for the brief time period at the least, it stays the norm. This has additionally led to the vast majority of the net being in English. An astounding 50% of internet sites are in English, regardless of it being the native tongue of solely about 6% of the world’s inhabitants, with Spanish, German, and Japanese subsequent, however a good distance behind, every solely between 5-6% of the net.
As we delve deeper into the brand new wave of AI-powered purposes and companies, many are pushed by knowledge in giant language fashions (LLMs). As a lot of the information in these LLMs is scraped (controversially in lots of instances) from the net, LLMs predominantly perceive and reply in English. As we discover ourselves at first of or within the midst of a shift in technological paradigm attributable to the speedy development of AI instruments, this can be a downside, and we’re bringing that downside into a brand new age.
Europe already boasts a number of high-profile AI firms and tasks, similar to Mistral and Hugging Face. Google DeepMind additionally originated as a European firm. The continent has analysis tasks that develop language fashions to boost how AI instruments comprehend much less generally spoken languages.
TNW Convention 2025 – That is a wrap!
Take a look at the highlights!
This text explores a few of these initiatives, questions their effectiveness, and asks whether or not their efforts are worthwhile or if many customers default to utilizing English variations of instruments. As Europe seeks to construct its independence in AI and ML, does the continent have the businesses and expertise needed to attain its targets?
Terminology and know-how primer
To make sense of what follows, you don’t want to know how fashions are created, educated, or operate. Nevertheless it’s useful to know a few fundamentals about fashions and their human language assist.
Except mannequin documentation explicitly mentions it’s multilingual or cross-lingual, prompting it or requesting a response in an unsupported language might trigger it to translate backwards and forwards or reply in a language it does perceive. Each methods can produce unreliable and inconsistent outcomes — particularly in low-resource languages.
Whereas high-resource languages, similar to English, profit from considerable coaching knowledge. Low-resource languages, similar to Gaelic or Galician, have far much less, which frequently results in inferior efficiency
The tougher idea to clarify concerning fashions is “open,” which is uncommon, as software program typically has had a reasonably clear definition of “open supply” for some time. I don’t need to delve too deeply into this matter as the precise definition continues to be in flux and controversial. The abstract is that even when a mannequin may name itself “open” and is referenced as “open,” the that means of “open” isn’t at all times the identical.
Listed below are two different helpful phrases to know:
Coaching teaches a mannequin to make predictions or choices based mostly on enter knowledge.
Parameters are variables realized throughout mannequin coaching that outline how the mannequin maps inputs to outputs. In different phrases, the way it understands and responds to your questions. The bigger the variety of parameters, the extra advanced the mannequin is.
With that temporary clarification executed, how are European AI firms and tasks working to boost these processes to enhance European language assist?
Hugging Face
When somebody needs to share code, they usually present a hyperlink to their GitHub repository. When somebody needs to share a mannequin, they usually present a Hugging Face hyperlink. Based in 2016 by French entrepreneurs in New York Metropolis, the corporate is an energetic participant in creating communities and a powerful proponent of open fashions. In 2024, it began an AI accelerator for European startups and partnered with Meta to develop translation instruments based mostly on Meta’s “No Language Left Behind” mannequin. They’re additionally one of many driving forces behind the BLOOM mannequin, a groundbreaking multilingual mannequin that set new requirements for worldwide collaboration, openness, and coaching methodologies.
Hugging Face is a great tool for getting a tough thought of the language assist in fashions. On the time of writing, Hugging Face lists 1,743,136 fashions and 298,927 datasets. Have a look at its leaderboard for monolingual fashions and datasets, and also you see the next rating for fashions and datasets that builders tag (add metadata) as supporting European languages on the time of writing:
| Language | Language code | Datasets | Fashions |
|---|---|---|---|
| English English | en | 27,702 | 205,459 |
| English | eng | 1,370 | 1,070 |
| French | fra | 1,933 | 850 |
| Spanish Español | es | 1,745 | 10,028 |
| German Deutsch | de | 1,442 | 9,714 |
| English | eng | 1,370 | 1,070 |
You’ll be able to already see some points right here. These aren’t tags set in stone. The group can add values freely. Whilst you can see that they comply with them for probably the most half, there may be some duplication.
As you may see, the fashions are dominated by English. The same challenge applies to the datasets on Hugging Face, which lack non-English knowledge.
What does this imply?
Lucie-Aimée Kaffee, EU Coverage Lead at Hugging Face, stated that the tags point out {that a} mannequin has been educated to know and course of this language or that the dataset incorporates supplies in that language. She added that the confusion between language assist usually comes throughout coaching.“When coaching a big mannequin, it’s widespread for different languages to by chance get caught in coaching as a result of there have been some artefacts of it in that dataset,” she stated. “The language a mannequin is tagged with is often what the builders supposed the mannequin to know.”
As one of many foremost and busiest locations for mannequin builders and researchers, Hugging Face not solely hosts a lot of their work, but additionally lets them create outward-facing communities to inform individuals tips on how to use them.

Mistral AI
Maybe the best-known Europe-based AI firm is France’s Mistral AI, which sadly declined an interview. Its multilingual challenges partly impressed this text. On the FOSDEM developer convention in February 2024, linguistics researcher Julie Hunter requested one in every of Mistral’s fashions for a recipe in French — however it responded in English. Nonetheless, 16 months is an eternity in AI improvement, and neither the corporate’s “Le Chat” chat interface nor working its 7B mannequin regionally reproduced the identical error in latest exams. However curiously, 7B did produce a spelling error within the opening line: “boueef” — and extra might comply with.
Whereas Mistral sells a number of business fashions, instruments, and companies, its free-to-use fashions are widespread, and I personally have a tendency to make use of Mistral 7B for working duties by way of native fashions.
Till just lately, the corporate wasn’t express about its fashions having multilingual assist, however its announcement of the Magistral mannequin at London Tech Week in June 2025 confirmed assist for a number of European languages.
EuroLLM
EuroLLM was created as a partnership between Portuguese AI platform Unbabel and a number of other European universities to know and generate textual content in all official European Union languages. The mannequin additionally contains non-European languages broadly spoken by immigrant communities and main buying and selling companions, similar to Hindi, Chinese language, and Turkish.
Like a number of the different open mannequin tasks on this article, its work was partly funded by the EU’s Excessive Efficiency Computing Joint Enterprise program (EuroHPC JU). A lot of them share related names and goals, making it complicated to separate all of them. EuroLLM was one of many first, and as Ricardo Rei, Senior Analysis Scientist at Unbabel, advised me, the staff has realized quite a bit from the tasks which have come since.
As Unbabel’s prime enterprise is language translation, and translation is a key job for a lot of multilingual fashions, the work on EuroLLM made sense to the Portuguese platform. Earlier than EuroLLM, Unbabel had already been refining present fashions to make its personal and located all of them too English-centric.
One of many staff’s greatest challenges was discovering enough coaching knowledge for low-resource languages. Finally, the provision of coaching materials displays the quantity of people that converse the language. One of many widespread knowledge sources used to coach European language fashions is Europarl, which incorporates transcripts of the European Parliament’s actions translated into all official EU languages. It’s additionally obtainable as a Hugging Face dataset, because of ETH Zürich.
At the moment, the mission has a 1.7B parameter mannequin and a 9B parameter mannequin, and is engaged on a 22B parameter mannequin. In all instances, the fashions can translate, however are additionally general-purpose, that means you may chat with them in an analogous technique to ChatGPT, mixing and matching languages as you do.
OpenLLM Europe
OpenLLM Europe isn’t constructing something instantly, however it’s fostering a Europe-wide group of LLM tasks, particularly medium and low-resource languages. Don’t let the one-page GitHub repository idiot you: the Discord server is vigorous and energetic.
OpenEuroLLM, Lumi, and Silo
A joint mission between a number of European universities and firms, OpenEuroLLM is without doubt one of the newer and bigger entrants to the listing of tasks funded by EuroHPC. Which means it has no public fashions as of but, however it includes lots of the establishments and people behind the Lumi household of fashions that target Scandinavian and Nordic languages. It goals to create a multilingual mannequin, present extra datasets for different fashions and conform to the EU AI Act.
I spoke with Peter Sarlin of AMD Silo, one of many firms concerned within the mission and a key determine in Finnish and European AI improvement, concerning the plans. He defined that Finland, particularly, has a number of institutes with important AI analysis applications, together with Lumi, one of many supercomputers a part of EuroHPC. Silo, by way of its SiloGen product, provides open supply fashions to prospects, with a powerful give attention to supporting European languages. Sarlin identified that whereas sovereignty is a vital motivation to him and Silo for creating and sustaining fashions that assist European languages, the higher motive is increasing the enterprise and serving to firms construct options for small markets similar to Estonia.
“Open fashions are nice constructing blocks, however they aren’t as performant as closed ones, and lots of companies within the Nordics and Scandinavia don’t have the sources to construct instruments based mostly on open fashions,” he stated. “So Silo and our fashions can step in to fill the gaps.”


The Lumi fashions use a “cross-lingual coaching” method by which the mannequin shares its parameters between high-resource and low-resource languages.
All this prior work led to the OpenEuroLLM mission, which Sarlin describes as “Europe’s largest open supply AI initiative ever, together with just about all AI builders in Europe aside from Mistral.”
Whereas many efforts are underway and performing nicely, the coaching knowledge challenge for low-resource languages stays the most important problem, particularly amid the transfer in the direction of extra nuanced reasoning fashions. Translations and cross-lingual coaching are choices, however can create responses that sound unnatural to native audio system. As Sarlin stated, “We don’t desire a mannequin that appears like an American talking Finnish.”
OpenLLM France
France is without doubt one of the extra energetic nations in AI improvement, with Mistral and Hugging Face main the best way. From a group perspective, the nation additionally has OpenLLM France. The mission (unsurprisingly) focuses on French language fashions, with a number of fashions of various parameters and datasets, which assist different tasks practice and enhance their fashions that assist French. The datasets embody a mixture of political discourse, assembly recordings, theatre reveals, and informal conversations. The mission additionally maintains a leaderboard of French fashions on Hugging Face, one of many few (energetic) European language mannequin benchmark pages.
Do Europeans care about multilingual AI?
Europe is stuffed with individuals and tasks engaged on multilingual language fashions. However do shoppers care? Sadly, getting language utilization charges for proprietary instruments similar to ChatGPT or Mistral is nearly not possible. I created a ballot on LinkedIn asking if individuals use AI instruments of their native language, English, or a mix of each. The outcomes had been a 50/50 break up between English and a mix of languages. This might point out that the variety of individuals utilizing AI instruments in a non-English language is larger than you assume.
Usually, individuals use AI instruments in English for work and in their very own language for private duties.
Kaffee, a German and English speaker, stated: “I take advantage of them largely in English as a result of I converse English at work and with my accomplice at dwelling. However then, for private duties…, I take advantage of German.”
Kaffee talked about that Hugging Face was engaged on a soon-to-be-published analysis mission that absolutely analysed the utilization of multilingual fashions on the platform. She additionally famous anecdotally that their utilization is on the rise.
“Customers have a conception that fashions are actually extra multilingual. And with the accessibility by way of giant fashions like Llama, for instance, being multilingual, I believe that made a huge impact on the analysis world concerning multilingual fashions and the variety of individuals eager to now use them in their very own language.”
The web was at all times imagined to be international and for everybody, however the damning statistic that 50% of websites are in English reveals it by no means actually labored out that means. We’re coming into a brand new section in how we entry data and who controls it. Possibly this time, the (AI) revolution can be worldwide.