Meta's quest to translate underserved languages is marking its first victory with the open source release of a language model able to decipher 202 languages.
Named after Meta's No Language Left Behind initiative and dubbed NLLB-200, the model is the first able to translate so many languages, according to its makers, all with the goal to improve translation for languages overlooked by similar projects.
"The vast majority of improvements made in machine translation in the last decades have been for high-resource languages," Meta researchers wrote in a paper [PDF]. "While machine translation continues to grow, the fruits it bears are unevenly distributed," they said.
According to the announcement of NLLB-200, the model can translate 55 African languages "with high-quality results." Prior to NLLB-200's creation, Meta said fewer than 25 African languages were covered by widely used translation tools. When tested against the BLEU standard, Meta said NLLB-200 showed an average improvement of 44 percent over other state-of-the-art translation models. For some African and Indian languages, the improvement reportedly went as high as 70 percent.
Along with its release on GitHub as an open-source model, Meta said it's also providing $200,000 in grants to nonprofits willing to research real-world applications for NLLB-200.
Lofty goals aside, Meta is already putting NLLB-200 to work. The model and other results from the NLLB program "will support more than 25 billion translations served every day on Facebook News Feed, Instagram, and our other platforms."
In addition, Meta has been working with the Wikimedia Foundation to use NLLB-200 as the back end of Wikipedia's Content Translation Tool. By including NLLB-200, the CTT added 10 languages that were unsupported by any other translation tool.
There are still hurdles. Meta explains it had to do quite a bit of work to overcome hurdles in doubling NLLB's capabilities, which it overcame through "regularization and curriculum learning, self-supervised learning and diversifying back-translation." Meta also made extensive use of language model distillation, which reduces previously trained AIs into training data for newer models.
As part of its open sourcing of NLLB-200, Meta is also releasing the new Flores-200 evaluation dataset it built for the project, seed training data, its 200-language toxicity list, its new LASER3 sentence encoder, the stopes data mining library, 3.3 billion and 1.3 billion parameter dense transformer models, 1.3 billion and 600 million parameter models distilled from NLLB-200 and NLLB-200 itself, which contains 54.5 billion parameters.
Not all communities may welcome the inclusion of their language in NLLB, or other programs for that matter. New Zealand's Māori community faced off against translation companies last year, arguing the entities didn't have a right to buy language data and sell the Māori language back to its speakers. ®
FDA says yes to the tests
Latest shine on the Jammy Jellyfish brings ton of fixes to keep you working smoothly
Something for the Weekend Hello customer, can I help you? Ha ha, just kidding, of course I won't
But warns 'upcoming major release of vSphere' will break some plugins
RHEL SHA-ll speak unto RHEL... except from 9 to 6
You might need to free up 24GB. A bug for now, but might be sign of way things are going
A fork of Redo Rescue that outdoes the original - and beats Clonezilla too