from Hacker News

Ask HN: Kind of data needed to build a language model?

by goy on 8/9/22, 1:46 PM with 3 comments

On DeepL, Google Translate,etc the quality of the translation for not so popular languages is really bad, especially for oral-only dialects. I think this is due sometimes due to the lack written document or dictionary in these languages. What kind of data do you think I should gather from my local community if I want to build a translation software for our dialect ?

by PaulHoule on 8/9/22, 2:02 PM
See here for some machine translation training data sets
https://metatext.io/datasets-list/translation-task
they might not have what you're looking for but you'll probably need something similar to one of the data sets they have.
Most systems now are trained on parallel corpuses, for instance there is a collection of 30,000 sentences in English and Japanese listed on that site. If you've got enough training examples you don't need a dictionary, a specification of the grammar or anything else. You need a lot of text though.
by mikewarot on 8/9/22, 2:35 PM
I shudder at the complexity of the task, but I almost think you need to manually tweak the vector space of a model that already works, and I have no idea how to do that in practice.
The amount of text required for a machine to grind through it millions of times to tease out the shape of a language doesn't sound like something you have. If you have the time of native speakers, it might be possible to build tools for them to correct the most "off" parts of the model interactively.
by he11ow on 8/9/22, 8:29 PM
I'd recommend to look at the FastAi NLP course[1]. The relevant lessons are videos 8, 9 and crucially - 10. (the first minutes is about an ethics conference, then the lecture starts in earnest). Basically, you can then do similar things using HuggingFace, as indeed many have (you can explore the models in their hub)[2]
[1] https://www.youtube.com/playlist?list=PLtmWHNX-gukKocXQOkQju...
[2] https://github.com/huggingface/notebooks/blob/main/examples/...