by t55 on 6/2/25, 9:27 AM with 28 comments
by ninakostoska on 6/2/25, 11:03 AM
by phh on 6/2/25, 10:28 AM
I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling). So I've been wanting a "RL Zoo" for quite a while. I hope this project won't be a one-off and will be maintained long term with many external contributions to add new targets!
by starzmustdie on 6/2/25, 9:44 AM
by jimmySixDOF on 6/2/25, 11:23 AM
>Spurious Rewards: Rethinking Training Signals in RLVR ### *TL;DR* We show that you can do RLVR on Qwen2.5-Math models with *completely random or incorrect rewards*, and still get massive math benchmark gains.
All of the following spurious rewards give 15-20+ points on MATH-500 when RLVR training Qwen2.5-Math-7B:
- RLVR + format reward (reward responses with `\boxed{}`): *+16.4%* - RLVR + incorrect reward (only incorrect answers rewarded): *+24.6%* - RLVR + random reward: *+21.4%* - (as a reference) RLVR + ground-truth reward: + 28.8%
How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?
>Learning to Reason without External Rewards Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. [2]
[1] https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking... [2] https://arxiv.org/abs/2505.19590
by sadboots on 6/2/25, 11:40 AM