by qianli_cs on 2/7/25, 5:00 AM with 8 comments
by vessenes on 2/7/25, 7:21 AM
The meaning of the title is simply that models from most providers can do some self-reflection when prompted to do so, without any R1-Zero type fine tuning.
This is put out as surprising, and I do not think that it is surprising at all. We’ve known about Chain of Thought type prompting for some time, and this is a variant of it.
They then mention that 1.5B-ish parameter models don’t seem to be very good at self reflection out of the box. This does not surprise me in the slightest. Unless heavily distilled and tuned for a very specific job, 1.5B parameter models aren’t good at much.
They then note that something about the reward functions in R1 Zero’s setup create a typical pattern of shorter and shorter self-reflection until some sort of inflection point where the reflection gets longer, and correct answers are more common.
This seems pretty interesting! The so-called “Aha” moment is when a model during training hits this inflection point and starts productively using and extending the self-reflection.
I think my reaction overall is that the research is worth doing, as it’s trying to get at what exactly works about R1-Zero training, and why it works, and that’s great. It’s just a small start though.
by Vetch on 2/7/25, 8:21 AM
Thoughts:
- An analogy you shouldn't zoom too close into is going from CoT to reasoning traces is like going from purely ballistic trajectories to including navigation and thrusters. RL is for learning how to use the thrusters for adjustments based on its internal encodings of rare samples† where some author fully spelled out their thought process.
- This might also explain why SFT on reasoner traces seems to be surprisingly effective. If it were purely an RL mediated phenomenon, SFT for reasoning would not work nearly as well.
- Deepseek struggled to get RL to work on smaller models, if this is replicated, it might be the case that larger models encode self-correction patterns more robustly while having them as more probable.
- Imitating traces is easier than pure RL for bringing such patterns to the fore, for smaller models. However, we still want models to learn how to dynamically adjust their thrusters, SFT does not provide ample opportunity for this. Further training with RL or alternatively, replacing SFT with methods like [Critique Fine-Tuning](https://arxiv.org/abs/2501.17703) are needed.
- The article incidentally reinforces that having a low temperature means consistency not correctness. Except for high confidence scenarios, the highest greedily computed probability answer is generally less likely to be among the best ones the model can give.
†Question: First thought is blogs by people who discuss what didn't work. But, I wonder how much of reasoning model patterns and ability is shaped by Detective Conan transcripts?
by Jean-Papoulos on 2/7/25, 7:10 AM
I must be missing something here. No one was arguing that the AI answers are correct to begin with, just that self-reflection leads to more correct answers when compared to not using the process ?
by littlestymaar on 2/7/25, 7:19 AM
Base models exhibit what rhe authors call "Superficial Self-Reflection" where it looks like it's reasoning but it doesn't lead to an actual improvement in answer quality. Then with RL the models learn to effectively use this reflection to improve answer quality.
The whole read is interesting but I don't think the title is really an accurate description of it…
by trash_cat on 2/7/25, 12:58 PM
What is the difference?
by jamiequint on 2/7/25, 6:59 AM
by benob on 2/7/25, 7:34 AM
Also, has anyone tried non-instruct tuned base models?