from Hacker News

Ask HN: Can anybody clarify why OpenAI reasoning now shows non-English thoughts?

by johnnyApplePRNG on 6/12/25, 11:33 PM with 41 comments

People have noticed for a while now that Google's Bard/Gemini has inserted random hindi/bengali words often. [0]

I just caught this in an o3-pro thought process: "and customizing for low difficulty. কাজ করছে!"

That last set of chars is apparently Bengali for "working!".

I just find it curious that similar "errors" are appearing from multiple different models... what is the training method or reasoning that these alternate languages can creep in, does anyone know?

[0] https://www.reddit.com/r/Bard/comments/18zk2tb/bard_speaking_random_languages/

  • by mindcrime on 6/13/25, 2:32 AM

    LLM's aren't humans and there's no reason to expect their "thinking"[1] to behave exactly - or even much - like human thinking. In particular, they don't need to "think" in one language. More concretely, in the DeepSeek R1 paper[2] they observed this "thought language mixing" and did some experiments on suppressing it... and the model results got worse. So I wouldn't personally think of it as an "error", but rather as just an artifact of how these things work.

    [1]: By this I mean "whatever it is they do that can be thought of as sorta kind roughly analogous to what we generally call thinking." I'm not interested in getting into a debate (here) about the exact nature of thinking and whether or not it's "correct" to refer to LLM's as "thinking". It's a colloquialism that I find useful in this context, nothing more.

    [2]: https://arxiv.org/pdf/2501.12948

  • by puttycat on 6/12/25, 11:45 PM

    Multilingual LLMs dont have a clear boundary between languages. They will appear to have one since they maximize likelihoods, so asking something in English will most likely produce an English continuation, etc.

    In other circumstances they might take a different path (in terms of output probability decoding) through other character sets, if the probabilities justify this.

  • by yen223 on 6/12/25, 11:41 PM

    I have no idea what's going on with ChatGPT, but I can say it's pretty common for multilingual people to be thinking about things in a different language from what they are currently speaking.
  • by Bjorkbat on 6/13/25, 1:58 AM

    I don't actually think this is the case, but nonetheless I think it would be kind of funny if LLMs somehow "discovered" linguistic relativity (https://en.wikipedia.org/wiki/Linguistic_relativity).
  • by diwank on 6/13/25, 2:53 AM

    This isn’t entirely surprising. Language-model “reasoning” is basically the model internally exploring possibilities in token-space. These models are trained on enormous multilingual datasets and optimized purely for next-token prediction, not language purity. When reasoning traces or scratchpads are revealed directly (as OpenAI occasionally does with o-series models or DeepSeek-R1-zero), it’s common to see models slip into code-switching or even random language fragments, simply because it’s more token-efficient in their latent space.

    For example, the DeepSeek team explicitly reported this behavior in their R1-zero paper, noting that purely unsupervised reasoning emerges naturally but brings some “language mixing” along. Interestingly, they found a small supervised fine-tuning (SFT) step with language-consistency rewards slightly improved readability, though it came with trade-offs (DeepSeek blog post).

    My guess is OpenAI has typically used a smaller summarizer model to sanitize reasoning outputs before display (they mentioned summarization/filtering briefly at Dev Day), but perhaps lately they’ve started relaxing that step, causing more multilingual slips to leak through. It’d be great to get clarity from them directly on whether this is intentional experimentation or just a side-effect.

    [1] DeepSeek-R1 paper that talks about poor readability and language mixing in R1-zero’s raw reasoning https://arxiv.org/abs/2501.12948

    [2] OpenAI “Detecting misbehavior in frontier reasoning models” — explains use of a separate CoT “summarizer or sanitizer” before showing traces to end-users https://openai.com/index/chain-of-thought-monitoring/

  • by ipsum2 on 6/13/25, 12:35 AM

    Models like O3 are rewarded for the final output, not the intermediary thinking steps. So whatever it generates as "thoughts" that gives a better answer gets a higher score.

    The DeepSeek-R1 paper has a section on this, where they 'punish' the model if it thinks in a different language to make the thinking tokens more readable. Probably Anthropic does this too.

  • by drivingmenuts on 6/13/25, 11:38 AM

    I see this as a problem. You can't make an LLM "unlearn" something; once it's in there, it's in there. If I have a huge database, I can easily delete swathes of useless data, but I cannot do the same with an LLM. It's not a living, thinking being - it's a program running on a computer; a device that we, in other circumstances, can add information to or remove it from. We can suppress certain things, but that information is still in there, taking up space and can still possibly be accessed.

    We are intentionally undoing one of the things that makes computers useful.

  • by janalsncm on 6/13/25, 1:24 AM

    Others have mentioned that DeepSeek R1 also noticed this “problem”. I believe there are two things going on here.

    One, the model is no longer being trained to output likely tokens or tokens likely to satisfy pairwise preferences. So the model doesn’t care. You have to explicitly punish the model for language switching, which dilutes the reasoning reward.

    Two, I believe there has been some research on models representing similar ideas in multiple languages in similar areas. Sparse autoencoders have shown this. So if the translated text makes sense, I think this is why. If not, I have no idea.

  • by NoahZuniga on 6/15/25, 1:29 AM

    I feel like most other comments are missing something important, the o3-pro thought process you see in the chatgpt ui is a summary. So although the model might think in different languages, the summary (presumably done by a different model) will translate it into your UI language. It seems like this summarization AI messed up, and gave you some text in a different language.
  • by neilv on 6/13/25, 1:39 AM

    If the reasoning didn't need to be exposed to a user, are there any ways in which you get better performance or effect by using the same LLM methods, but using a language better suited to that? (Existing language or bespoke.)

    (Inspired by movies and TV shows, when characters switch from English to a different language, such as French or Mandarin, to better express something. Maybe there's a compound word in German for that.)

  • by jmward01 on 6/13/25, 12:47 AM

    It would be interesting to study when this type of behavior emerges to see what the patterns are. It could give insights into language or culture specific reasoning patterns and subjects that are easier to convey in one language or another. Is it easier to understand math word problems in XXX or YYY? What about relationships?
  • by dpiers on 6/13/25, 5:30 AM

    Languages are thought encodings.

    Most people can only encode/decode a single language but an LLM can move between them fluidly.

  • by atlex2 on 6/13/25, 1:01 AM

    Definitely curious what circuits light-up from a Neuralese perspective. We want reasoning traces that are both faithful to the thought process and also interpretable. If the other language segments are lighting up meanings much different than their translations, that would raise questions for me.
  • by tough on 6/13/25, 1:16 AM

    I've seen also russian and chinese which i certainly have never speaked to it nor understand
  • by muzani on 6/13/25, 6:05 AM

    I do some AI training as a side gig and there has been a few recent updates on code-switching (i.e. speaking two languages at the same time) in the last few months. It's possible that these changes may have caused such behavior recently.
  • by NooneAtAll3 on 6/13/25, 1:35 AM

    I remember watching video mentioning it (https://www.youtube.com/shorts/Vv5Ia6C5vYk)

    The main suspicion is that it's more compact?

  • by rerdavies on 6/13/25, 3:00 AM

    Reminds me of the son of a friend of mine, who was raised bilingually (English and French). When he was 3, he would sometimes ask "is this English, or the other language?"
  • by throwpoaster on 6/13/25, 1:21 PM

    Multilingual humans do this too. Sometimes a concept is easier to shorthand in one language versus another. It’s somehow “closer”.
  • by CMCDragonkai on 6/13/25, 3:40 AM

    Multilingual humans do this too, so not surprising that AI does this.