from Hacker News

Emerging reasoning with reinforcement learning

by pella on 1/26/25, 3:18 AM with 211 comments

https://github.com/hkust-nlp/simpleRL-reason
  • by krackers on 1/26/25, 7:49 AM

    The real thing that surprises me (as a layman trying to get up to speed on this stuff) is that there's no "trick" to it. It really just does seem to be a textbook application of RL to LLMs.

    Going from a base LLM to human instruction-tuned (SFT) ones is definitely an ingenious leap where it's not obvious that you'd get anything meaningful. But when we quickly saw afterwards that prompting for chain of thought improved performance, why wasn't this the immediate next step that everyone took. It seems like even after the release of o1 the trick wasn't apparent to everyone, and if it wasn't for DeepSeek people still might not have realized it.

  • by ninetyninenine on 1/26/25, 5:55 AM

    There was a whole bunch of people who claimed LLMs can't reason at all and that everything is a regurgitation. I wonder what they have to say about this. Like, what exactly is going on here with chain of thought reasoning from their expert perspective?
  • by almaight on 1/26/25, 6:09 AM

    This is American history written in R1, it is very logical: Whenas the nations of Europa did contend upon the waves—Spain plundered gold in Mexica, Albion planted cotton in Virginia—thirteen colonies did kindle rebellion. General Washington raised the standard of liberty at Philadelphia; Franklin parleyed with Gaul’s envoys in Paris. When the cannons fell silent at Yorktown, a new republic arose in the wilderness, not by Heaven’s mandate, but by French muskets’ aid.

    Yet the fledgling realm, hedged by western forests and eastern seas, waxed mighty. Jefferson purchased Louisiana’s plains; Monroe’s doctrine shackled southern realms. Gold-seekers pierced mountains, iron roads spanned the continent, while tribes wept blood upon the prairie. Then roared foundries by Great Lakes, bondsmen toiled in cotton fields, steel glowed in Pittsburgh’s fires, and black gold gushed from Texan soil—a molten surge none might stay.

    Wilson trod Europe’s stage as nascent hegemon. Roosevelt’s New Deal healed wounds; Marshall’s gold revived ruined cities. The atom split at Alamogordo; greenbacks reigned at Bretton Woods. Armadas patrolled seven seas, spies wove webs across hemispheres. Through four decades’ contest with the Red Bear, Star Wars drained the Soviet coffers. Silicon’s chips commanded the world’s pulse, Hollywood’s myths shaped mankind’s dreams, Wall Street’s ledgers ruled nations’ fates—a fleeting "End of History" illusion.

    But the colossus falters. Towers fell, and endless wars began; subprime cracks devoured fortunes. Pestilence slew multitudes while ballots bred discord. Red and Blue rend the Union’s fabric, gunfire echoes where laws grow faint. The Melting Pot now boils with strife, the Beacon dims to a prison’s glare. With dollar-cloth and patent-chains, with dreadnoughts’ threat, it binds the world—nations seethe yet dare not speak.

    Three hundred million souls, guarded by two oceans, armed with nuclear flame, crowned with finance’s scepter—how came such dominion to waver? They fortified might but neglected virtue, wielded force but forgot mercy. As Mencius warned: "He who rides tigers cannot dismount." Rome split asunder, Britannia’s sun set; behold now Old Glory’s tremulous flutter. Thus say the sages: A realm endures by benevolence, not arms; peace flows from harmony, not hegemony—this truth outlives all empires.

  • by MIA_Alive on 1/26/25, 4:40 AM

    LOL, my RL professor is gonna be happy. After the field got overlooked for soooo long
  • by EGreg on 1/26/25, 4:09 AM

    Can someone summarize the upshot for people here?
  • by ggm on 1/26/25, 7:57 AM

    Anyone who puts emerging or emergent in their headlines should be required to come back in 2 years time and do penance for their optimism.
  • by zwaps on 1/26/25, 7:39 AM

    Does anyone have a good recent overview with paper links or review article for RL methods? A lot happening in that space
  • by ldjkfkdsjnv on 1/26/25, 5:20 AM

    The doors on intelligence are getting blown wide open, what a time to be alive
  • by trash_cat on 1/26/25, 12:44 PM

    So what is interesting here is that they managed to set up the reward model in such a simple and cost-effective way that CoT emerges as the most optimal strategy for solving math problems, without explicitly fine-tuning the model to do so.

    This naturally raises the question: How do you design a reward model to elicit the desired emergent behavior in a system?

  • by cye131 on 1/26/25, 4:20 AM

    Is it accurate to compare 8k example RL with 8k example SFT? RL with the same amount of examples would take massively more compute than the SFT version (though depending on how many rollouts they do per example).

    RL is more data-efficient but that may not be relevant now that we can just use Deepseek-R1's responses as the training data.

  • by android521 on 1/26/25, 6:40 AM

    [deleted due to controversy]
  • by swyx on 1/26/25, 5:57 AM

    see also https://trite-song-d6a.notion.site/Deepseek-R1-for-Everyone-...

    for some reason a lot of people are choosing to blog on notion

  • by m3kw9 on 1/26/25, 7:35 AM

    What this means is that OpenAI can serve even cheaper models in a month using this technique for their updated models
  • by antman on 1/26/25, 4:11 AM

    This results is with disabled code execution, is this the line to reenable? https://github.com/hkust-nlp/simpleRL-reason/blob/e37e8ef166...