by rayshan on 4/10/23, 5:12 PM with 149 comments
by lukeschlather on 4/10/23, 7:04 PM
For diacratics in French or Spanish, diacratics are logically characters. I can't think of an example where it's actually useful to split the letter into a different token but I could see it happening and not being harmful. I do think it's possible French is just weird and just needs more tokens. When I think about how I process French, I probably do treat e.g. "Je l'ai aimé" as a pathological example as 3 tokens when I speak it out loud. But I can also see why you would tokenize it as 6 tokens, I'm not sure that's Anglocentrism so much as it's recognizing a complexity difference between French and English writing.
But all this is contrast to how non-roman characters are tokenized at the byte level. That just seems bad and like it's definitely going to make it worse with non-roman languages. There's no point in having tokens that split characters.
by kouteiheika on 4/10/23, 6:08 PM
> One of the models listed above called NLLB (No Language Left Behind) has been open sourced by Facebook allowing for translation for 200 languages.
It was not. The model's weights are under CC-BY-NC, which certainly motivates commercial entities to not leave those languages behind. /s
by FredPret on 4/10/23, 6:26 PM
I sometimes wonder what it takes to unseat a lingua franca, but it looks like we won't see that soon. English is set to dominate for a long time.
by galaxytachyon on 4/10/23, 5:54 PM
I think even human has to spend extra energy to speak a language they were not born with, no matter how fluent they are in this language. I don't know about natural multilinguals.
by wolfium3 on 4/10/23, 6:04 PM
by karmoka on 4/10/23, 6:35 PM
by bob1029 on 4/10/23, 6:45 PM
For example, what would the real-world performance of ChatGPT be if we had trained it predominantly on German or Korean text?
Is English actually the best language/structure for this system?
by wordpad25 on 4/10/23, 6:48 PM
by rubywilde on 4/12/23, 11:29 AM
Author compares different encoders: for Facebook's NLLB and GPT2. Where did title came from?
Another point is that OpenAI changed encoders for chat models. Link: https://github.com/openai/openai-cookbook/blob/main/examples...
Now English is less optimized for tokens usage and other languages are much more balanced. E.g. Ukrainian takes only twice as much tokens, before it had 6 times more tokens
by FrostKiwi on 4/11/23, 1:05 AM
It's not just broken grammar, it's a surprising lack of creativity, that English doesn't suffer from. ChatGPT English -> DeepL and fixing the auto-translation gives vastly improved results, than prompting ChatGPT to respond in an asian language.
by mgaunard on 4/10/23, 7:12 PM
Of course you'd end up with a lot more tokens. Just tokenize by word regardless of language.
by Imnimo on 4/10/23, 6:37 PM
by startupsfail on 4/10/23, 7:52 PM
Yes, there is overhead from localization. So what, this overhead was always there for software.
by jinushaun on 4/11/23, 1:01 AM
- “I want a pizza” = 4 tokens
- “Je voudrais une pizza” = 7 tokens
Why is “want” only 1 token in English, but “voudrais” 4 tokens? Following the French example, would “wants” and “wanted” map to 1 or two tokens?by seba_dos1 on 4/10/23, 5:56 PM
by 29athrowaway on 4/10/23, 6:05 PM
Take "lampara" or "pantalones" in Spanish for example. English speakers were clever enough to shorten those words to "lamp" and "pants" respectively. And they have done this with many words.
Translate text into Spanish and you will see text gets longer and there is more meaning encoded into words.
"La mesa" refers to a female table, although tables are not lifeforms and have no sex.
To me some languages impose a communication tax. It is taboo because people conflate language and culture and such.