from Hacker News

ReasoningGym: Reasoning Environments for RL with Verifiable Rewards

by t55 on 6/2/25, 9:27 AM with 28 comments

by ninakostoska on 6/2/25, 11:03 AM
Cool to see NVIDIA’s most recent reasoning model [1] already uses Reasoning Gymas a large part of their data mixture
[1] https://arxiv.org/abs/2505.24864
by phh on 6/2/25, 10:28 AM
Cool cool. I'm a bit put off by calling it "reasoning" /"thought". These RL targets can be achieved without "thinking" model but still cool. Gotta love the brainfuck task.
I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling). So I've been wanting a "RL Zoo" for quite a while. I hope this project won't be a one-off and will be maintained long term with many external contributions to add new targets!
by starzmustdie on 6/2/25, 9:44 AM
GitHub: https://github.com/open-thought/reasoning-gym
by jimmySixDOF on 6/2/25, 11:23 AM
RL is proving to be a weird science lately :
>Spurious Rewards: Rethinking Training Signals in RLVR ### *TL;DR* We show that you can do RLVR on Qwen2.5-Math models with *completely random or incorrect rewards*, and still get massive math benchmark gains.
All of the following spurious rewards give 15-20+ points on MATH-500 when RLVR training Qwen2.5-Math-7B:
- RLVR + format reward (reward responses with `\boxed{}`): *+16.4%* - RLVR + incorrect reward (only incorrect answers rewarded): *+24.6%* - RLVR + random reward: *+21.4%* - (as a reference) RLVR + ground-truth reward: + 28.8%
How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?
>Learning to Reason without External Rewards Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. [2]
[1] https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking... [2] https://arxiv.org/abs/2505.19590
by sadboots on 6/2/25, 11:40 AM
by the love of god, please stop overfitting on gsm8k