from Hacker News

No Language Left Behind

by pesenti on 7/6/22, 7:52 PM with 159 comments

by Etheryte on 7/6/22, 9:17 PM
I'll believe it when I actually see it. I'm a native of a reasonably small language spoken by about a million people and never have I ever seen a good automatic translation for it. The only translations that are good are the ones that have been manually entered, and those that match the structure of the manually entered ones. I think the sentiment is laudable and wish godspeed to the people working on this, but for the time being I don't see it becoming a reality yet. When Google Translate regularly struggles even with big pairs such as German-English-German, I have reservations about someone making it work for languages where datasets are orders of magnitude smaller.
by pesenti on 7/6/22, 8:35 PM
Blog post: https://ai.facebook.com/blog/nllb-200-high-quality-machine-t...
Paper: https://research.facebook.com/publications/no-language-left-...
Github: https://github.com/facebookresearch/fairseq/tree/nllb/
by jkw on 7/6/22, 10:21 PM
Hey all, I work on this project. Full list of languages can be found here: https://github.com/facebookresearch/flores/tree/main/flores2...
As well as in the research paper: https://research.facebook.com/publications/no-language-left-...
by mikewarot on 7/6/22, 9:00 PM
The analogy I like the most is that they've found the "shape" of languages in high dimensions, and if you rotate the shape for English the right way, you get an unreasonably good fit for the shape of Spanish, again for all the other languages.
We're at a point where it's now possible to determine the shape of every language, provided there are enough speakers of the language left who are both able and willing to help.
<Snark> Once done, Facebook can then commodify their dissent, and sell it back to them in their native language. </Snark>
by Groxx on 7/7/22, 5:38 AM
>REAL-WORLD APPLICATION
>Translating Wikipedia for everyone
Hmmm.
While there is very definitely utility in doing things like this, I do kinda fear "poisoning the well"-like effects of feeding (even partially-) AI-generated-data into extremely common AI-data-sources.
There's some info on it in a blog post[1] and the MediaWiki "Content translation" page[2], but does anyone know of any studies on the quality of the translations produced? I can absolutely see it being a huge time-saver for people who are essentially fluent in both (there's a lot of semi-mechanical drudgery in translating stuff like this that could be mostly eliminated)... but people are pretty darn good at choosing the easy option of trusting whatever they're given rather than being as careful as they should be. It kinda feels like it runs the risk of passively encouraging people to trust the machine's choice over their own, as long as it isn't obviously nonsense, and the cumulative effect could be rather large after a while.
[1]: https://diff.wikimedia.org/2021/11/16/content-translation-to...
[2]: https://www.mediawiki.org/wiki/Content_translation
by jw4ng on 7/6/22, 9:06 PM
Jeff Wang here with my fellow Meta AI colleague Angela Fan from No Languages left Behind, seeing the comments flowing through. If you want to ask us anything, go for it!
by kgeist on 7/7/22, 5:23 AM
I wonder how it differs from what Yandex.Translate did back in 2016: [0]
>The affinity of languages allows one common model to be trained for their translation. That is, “under the hood” of the translator, the same neural network translates into Russian from Yakut, Tatar, Chuvash and other Turkic languages. This approach is called many-to-one, that is, "from many languages \u200b\u200binto one." This is a more versatile tool than the classic bilingual neural network. And most importantly, it is the many-to-one approach that makes it possible to use knowledge about the structure and vocabulary of the Turkic languages, learned on the rich material of Turkish or Tatar, to translate languages like Chuvash or Yakut, which are less “resource-rich”, but no less important for the cultural diversity of the planet.
>In order to create a unified model for translating Turkic languages, Yandex developed a synthetic common script. Any Turkic language is translated into it, so that, for example, the Tatar “dүrt” (“four”) written in Cyrillic becomes similar to the Turkish dört (“four”), not only from the point of view of a person, but also at the level of similarity of lines for a computer.
This way they added support for Turkic and Uralic languages which are very underrepresented on the Internet. But I don't know what the quality of their translation is: even though I live in a region where Mari is spoken (indigenous Uralic language) and my wife is Mari, none of us, sadly, speak the language.
[0] https://techno-yandex-ru.translate.goog/machine-translation/...
by microtherion on 7/6/22, 10:08 PM
As a native Swiss German speaker, my native language is not only low resource in general, but has the additional difficulty of not having a standardized orthography (many native speakers will exclusively write in Standard German, and use Swiss German only for spoken communication).
So you have a language with some economic opportunity (a few million speakers in a fairly wealthy country) but no clearly defined written interface, and an ambivalent attitude of many speakers towards the very idea of writing the language.
by otreblatercero on 7/7/22, 12:09 AM
Not a single mesoamerican language is present. Maya, Náhuatl, Otomí, Zapoteco, etc. And these languages are big, they are spoken by millions and even have literature. Náhuatl and Maya are spoken in Central America.
by albertzeyer on 7/6/22, 10:13 PM
Note that very recently Google has done something very similar: "Building Machine Translation Systems for the Next Thousand Languages": https://arxiv.org/abs/2205.03983 https://ai.googleblog.com/2022/05/24-new-languages-google-tr...
The Facebook paper has some direct comparison to that work.
by yellowapple on 7/7/22, 12:13 AM
Hopefully the Scots language model wasn't trained on Wikipedia.
by btheshoe on 7/6/22, 8:07 PM
I'm not entirely sure why low resource languages are seen as such a high priority for AI research. It seems that by definition there's little payoff to solving translation for them.
by labrador on 7/6/22, 10:40 PM
I'll know AI translators are any good when the United Nations starts using them
"Skills required: United Nations translators are required to have a perfect command of their main language and an excellent knowledge of, in most cases, two other official languages"
https://www.un.org/dgacm/en/content/translation
by thamer on 7/7/22, 8:42 AM
Does this mean that Facebook's advertising system will finally start rejecting ads calling for genocide in Myanmar, and that they will finally flag comments expressing the same intent? As recently as March of this year there were reports that Facebook accepted ads that said "The current killing of the Kalar is not enough, we need to kill more!" or "They are very dirty. The Bengali/Rohingya women have a very low standard of living and poor hygiene. They are not attractive".
Full story: https://abcnews.go.com/Business/wireStory/kill-facebook-fail...
These were submitted to test Facebook's systems, because there's a good reason not to trust their promises on this front. Facebook was used extensively to propagate hate speech in Myanmar during the crisis of 2017, with their moderation tools and hate speech detection system letting through a ton of hateful content with real-world consequences, in the course of an actual ethnic cleansing campaign.
Other references: "Facebook Admits It Was Used to Incite Violence in Myanmar" https://www.nytimes.com/2018/11/06/technology/myanmar-facebo... (2018)
"Violent hate speech continues to thrive on Facebook in Myanmar, AP report finds" https://www.cbsnews.com/news/myanmar-facebook-violent-hate-s... (9 months ago)
by vjerancrnjak on 7/6/22, 9:27 PM
What are hardware requirements to run this?
I see the mixture model is ~ 300 GB and was trained on 256 GPUs.
I assume distilled versions can easily be run on one GPU.
by kwhitefoot on 7/6/22, 8:39 PM
What is a "low resource language"?
by Tabular-Iceberg on 7/6/22, 9:19 PM
My concern with this is that in low resource languages the unavoidable biases of the ML models might overpower their own organic development.
We shrug off all the little quirks of machine translated text because it usually gets the point across, and we recognize them as quirks because most of what we read was written by real people with no such quirks. But when most of what you read contain those quirks, I fear those will quickly become the standard way of writing and even speaking in those languages.
by TaupeRanger on 7/6/22, 8:53 PM
So they have a system that can translate to languages for which there isn't as much data as English, Spanish, etc. Waiting for a Twitter thread from a native speaker of one of these "low resource languages" to let us know how good the actual translations are. Cynically, I'd venture that they hired some native speakers to cherry pick their best translations for the story books. But mostly this just seems like a nice bit of PR (calling it a "breakthrough", etc.). I can't imagine this is going to help anyone who actually speaks a random, e.g., Nilo-Saharan language.
by account42 on 7/7/22, 8:44 AM
> Essential cookies
> These cookies are required to use Meta Products. They’re necessary for these sites to work as intended.
What cookies does Facebook "need" to serve a simple article?
by LtWorf on 7/6/22, 10:05 PM
Facebook translations are horrifying for the mainstream languages already. They go from completely wrong to kinda understandable but still wrong.
by NoInkling on 7/7/22, 2:15 AM
I know DeepL doesn't do low-resource languages, but it would be interesting to see a translation quality comparison between the two.
by enos_feedler on 7/6/22, 10:13 PM
I was two sentences in before I realized the headline wasn’t “No Luggage Left Behind”
by schoen on 7/7/22, 12:43 AM
I wonder if spy agencies have already developed, but not published, high-quality SMT methods for lots of minority and little-known languages. :-(
(Edit: and speech-to-text models.)
by _nalply on 7/7/22, 11:43 AM
"No Language Left Behind" - really?
Did the people at Meta think about the Signed Languages of the Deaf?
I didn't find a mention. Even Ctrl-F deaf didn't yield anything.
by langsoul-com on 7/7/22, 12:37 PM
So so many words but not a hint of any demo. It's just magic according to Facebook. Plz couldn't they at least have a crappy demo to break?
by pdonis on 7/6/22, 11:26 PM
tl/dr: Now your words can be misconstrued by far more people than before, because AIs will translate the misunderstandings into as many languages as possible.
by zzzeek on 7/6/22, 11:12 PM
So glad it's Facebook doing this and not some other weird company, when translating and delivering information to every culture on the planet it's good to have a trustworthy, ethical company without any past (or heck, even any current, ongoing) issues in spreading misinformation around the globe and contributing to the rise of fascism across the world while profiting massively off of it and denying any culpability, making sure it all goes smoothly.
by bvanderveen on 7/6/22, 9:40 PM
Great! Facebook no longer have to provide content moderation in all the various corners of the world where they could accidentally enable the dissemination of misinformation and hate speech in minority languages. They can simply transform it into English and run it back through the existing moderation tooling!
Understanding foreign culture is about reading automated translations of online comments into your native language. It has nothing to do with putting the effort into learning a language and understanding the nuances and current events and issues of the culture it embeds.
The ESL (English as a single language) speakers over at Facebook don't even need to understand foreign cultures, because they already know everyone in the world needs to spend their lives staring into the Metaverse. So grateful that they are working on the world's fattest pipeline for exporting Anglophone culture to every corner of the planet!