from Hacker News

There may not be aha moment in R1-Zero-like training

by qianli_cs on 2/7/25, 5:00 AM with 8 comments

by vessenes on 2/7/25, 7:21 AM
I sort of flipped between “boring” to “..interesting..” to “maybe boring?” To “possibly interesting?” Reading this.
The meaning of the title is simply that models from most providers can do some self-reflection when prompted to do so, without any R1-Zero type fine tuning.
This is put out as surprising, and I do not think that it is surprising at all. We’ve known about Chain of Thought type prompting for some time, and this is a variant of it.
They then mention that 1.5B-ish parameter models don’t seem to be very good at self reflection out of the box. This does not surprise me in the slightest. Unless heavily distilled and tuned for a very specific job, 1.5B parameter models aren’t good at much.
They then note that something about the reward functions in R1 Zero’s setup create a typical pattern of shorter and shorter self-reflection until some sort of inflection point where the reflection gets longer, and correct answers are more common.
This seems pretty interesting! The so-called “Aha” moment is when a model during training hits this inflection point and starts productively using and extending the self-reflection.
I think my reaction overall is that the research is worth doing, as it’s trying to get at what exactly works about R1-Zero training, and why it works, and that’s great. It’s just a small start though.
by Vetch on 2/7/25, 8:21 AM
The essence of the article is that self-correction exists as a nascent ability in base models already (more robustly in some like Qwen than others). This is highly reminiscent of Chain of Thought, which was found to be a capability already present in base models too. The result of RL is to reinforce already present authentic self-correction patterns and down weight superficial self-correction.
Thoughts:
- An analogy you shouldn't zoom too close into is going from CoT to reasoning traces is like going from purely ballistic trajectories to including navigation and thrusters. RL is for learning how to use the thrusters for adjustments based on its internal encodings of rare samples† where some author fully spelled out their thought process.
- This might also explain why SFT on reasoner traces seems to be surprisingly effective. If it were purely an RL mediated phenomenon, SFT for reasoning would not work nearly as well.
- Deepseek struggled to get RL to work on smaller models, if this is replicated, it might be the case that larger models encode self-correction patterns more robustly while having them as more probable.
- Imitating traces is easier than pure RL for bringing such patterns to the fore, for smaller models. However, we still want models to learn how to dynamically adjust their thrusters, SFT does not provide ample opportunity for this. Further training with RL or alternatively, replacing SFT with methods like [Critique Fine-Tuning](https://arxiv.org/abs/2501.17703) are needed.
- The article incidentally reinforces that having a low temperature means consistency not correctness. Except for high confidence scenarios, the highest greedily computed probability answer is generally less likely to be among the best ones the model can give.
†Question: First thought is blogs by people who discuss what didn't work. But, I wonder how much of reasoning model patterns and ability is shaped by Detective Conan transcripts?
by Jean-Papoulos on 2/7/25, 7:10 AM
>We found Superficial Self-Reflection (SSR) from base models’ responses, in which case self-reflections do not necessarily lead to correct final answers.
I must be missing something here. No one was arguing that the AI answers are correct to begin with, just that self-reflection leads to more correct answers when compared to not using the process ?
by littlestymaar on 2/7/25, 7:19 AM
TL;DR;
Base models exhibit what rhe authors call "Superficial Self-Reflection" where it looks like it's reasoning but it doesn't lead to an actual improvement in answer quality. Then with RL the models learn to effectively use this reflection to improve answer quality.
The whole read is interesting but I don't think the title is really an accurate description of it…
by trash_cat on 2/7/25, 12:58 PM
"...found that the increasing response length phenomenon is not due to the emergence of self-reflection, but a consequence of RL optimizing well-designed rule-based reward functions."
What is the difference?
by jamiequint on 2/7/25, 6:59 AM
Some interesting discussion in the author's X thread here: https://x.com/zzlccc/status/1887557022771712308
by benob on 2/7/25, 7:34 AM
This calls for controlling post-training instruction data from the base model. Does it contain many instances of self-reflection?
Also, has anyone tried non-instruct tuned base models?