from Hacker News

Polyglot Word Embeddings Discover Language Clusters

by shriphani on 2/4/20, 1:01 AM with 5 comments

by pattusk on 2/4/20, 3:46 AM
I read the title and got excited thinking this would be using embeddings to gather insights about language family. As in, if you ran k-means on the same corpus of n languages with k < n, how would, say, Finnish, Mongolian, Turkish and Japanese turn out in the clusters. Curious too as to whether it would be possible to interpret the results rigorously to gather scientifically valid linguistic conclusions.
Instead it looks like this just performs language detection. Is there a significant advantage to that method as opposed to just reusing one of the many existing open sources solutions based on simpler models such as [1] and retraining them with a corpus that includes the language(s) that weren't supported? You offer a comparative table for FastText & GCP, how do you explain FastText's abysmal performance on English in terms of precision? The value just seems way too low to not be a bug of some sort?
[1] https://code.google.com/archive/p/language-detection/
by nl on 2/4/20, 3:17 AM
This is nice, but the blog post should point out that FastText has language identification built in[1].
The authors knew this, because it compares it in the paper, but doesn't call it out in the post!
Edit: just realised the link on popular "open source" goes to the FastText post I linked below. Still - I think it would have been good to explicitly note this!
[1] https://fasttext.cc/blog/2017/10/02/blog-post.html