from Hacker News

OpenAI's new reasoning AI models hallucinate more

by almog on 4/18/25, 10:43 PM with 96 comments

  • by vessenes on 4/18/25, 11:39 PM

    One possible explanation here: as these get smarter, they lie more to satisfy requests.

    I witnessed a very interesting thing yesterday, playing with o3. I gave it a photo and asked it to play geoguesser with me. It pretty quickly inside its thinking zone pulled up python, and extracted coordinates from EXIF. It then proceeded to explain it had properly identified some physical features from the photo. No mention of using EXIF GPS data.

    When I called it on the lying it was like "hah, yep."

    You could interpret from this that it's not aligned, that it's trying to make sure it does what I asked it (tell me where the photo is), that it's evil and forgot to hide it, lots of possibilities. But I found the interaction notable and new. Older models often double down on confabulations/hallucinations, even under duress. This looks to me from the outside like something slightly different.

    https://chatgpt.com/share/6802e229-c6a0-800f-898a-44171a0c7d...

  • by billti on 4/18/25, 11:54 PM

    If it’s predicting a next token to maximize scores against a training/test set, naively, wouldn’t that be expected?

    I would imagine very little of the training data consists of a question followed by an answer of “I don’t know”, thus making it statistically very unlikely as a “next token”.

  • by simianwords on 4/19/25, 7:06 AM

    My prediction: this is because of tool use. All models by OpenAI hallucinate more once tool use is given. I noticed this even with 4o with web search. With and without websearch I have noticed a huge difference in understanding capabilities.

    I predict that O3 will hallucinate less if you ask it not to use any tools.

  • by msadowski on 4/19/25, 6:22 AM

    Anyone has any stories on companies overusing AI? I’ve had some very frustrating encounters already when non-technical people were trying to help by sending AI solution to the issue which totally didn’t make any sense. I liked how the researchers in this work [1] prose calling LLM output “Frankfurtian BS”. I think it’s very fitting.

    [1] https://ntrs.nasa.gov/citations/20250001849

  • by serjester on 4/18/25, 10:56 PM

    Anecdotally o3 is the first OpenAI model in a while that I have to double check if it's dropping important pieces of my code.
  • by saithound on 4/19/25, 2:05 AM

    OpenAI o3 and o4-mini are massive disappointments for me so far. I have a private benchmark of 10 proof-based geometric group theory questions that I throw at new models upon release.

    Both new models gave inconsistent answers, always with wrong or fake proofs, or using assumptions that are not in the queation, and are often outright unsatisfiable.

    The now inaccessible o3-mini was not great, but much better than o3 and o4-mini at these questions: o3-mini can give approximately correct proof sketches for half of them, whereas I can't get a single correct proof sketch out of o3 full. o4-mini performs slightly worse than o3-mini. I think the allegations that OpenAI cheated FrontierMath have unambiguously been proven correct by this release.

  • by rzz3 on 4/18/25, 10:46 PM

    Does anyone have any technical insight on what actually causes the hallucinations? I know it’s an ongoing area of research, but do we have a lead?
  • by pllbnk on 4/22/25, 6:45 PM

    With my limited knowledge, I can't help but wonder, aren't current Transformer-based LLMs facing the five nines problem of their own? We're reaching a point where next token prediction accuracy improves merely linearly (maybe even on a logarithmic scale?) with additional parameters, while errors compound exponentially across longer sequences.

    Even if a 5T parameter model improves prediction accuracy from 99.999% to 99.9999% compared to a 500B model, hallucinations persist because these small probabilities of error multiply dramatically over many tokens. Temperature settings just trade between repetitive certainty and creative inconsistency.

  • by the_snooze on 4/19/25, 12:03 AM

    With all the money, research, and hype going into these LLM systems over the past few years, I can't help but ask: if I still can't rely on them for simple easy-to-check use cases for which there's a lot of good training data out there (e.g., sports trivia [1]), isn't it deeply irresponsible to use them for any non-toy application?

    [1] https://news.ycombinator.com/item?id=43669364

  • by taf2 on 4/19/25, 12:12 AM

    I think for intelligence it’s a fine line between a lie and creativity
  • by evo_9 on 4/18/25, 11:19 PM

    Maybe they need to evoke a sort of sleep so they can clear these out while dreaming, sorta like if humans don’t sleep enough hallucination start penetrating waking life…
  • by czk on 4/18/25, 11:28 PM

    will be interesting to see how they tighten the reward signal / ground outputs in some verifiable context. don't reward it for sounding right (rlhf), reward it for being right. but you'd probably need some sort of system to backprop a fact-checked score, and i imagine that would slow down training quite a bit. if the verifier finds a false claim it should reward the model for saying "i dont know"
  • by jablongo on 4/19/25, 1:35 AM

    In my experience this is true. One workflow I really hate is trying to convince an AI that it is hallucinating so it can get back to the task at hand.
  • by daxfohl on 4/19/25, 12:34 AM

    Maybe the fact that the answers sound more intelligent ends up poisoning the RLHF results used for fine tuning.
  • by mstipetic on 4/19/25, 5:24 AM

    I used it yesterday to help me with some visual riddle and I had some hints to the shape of the solution. It was gaslighting me completely that I’m pasting in the image wrong and it drew whole tables explaining how it’s right. It was saying things like “I swear in the original photo the top row is empty” and was fudging the calculation to prove it was right. It was very frustrating. I am not using it again.
  • by varispeed on 4/19/25, 11:05 AM

    I tried o3 few times, it more resembles a Markov chain generator than intelligence. Disappointed as well.