by delifue on 3/22/25, 7:59 AM with 1 comments
by delifue on 3/22/25, 8:01 AM
DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning
The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO