from Hacker News

Reinforcement Pre-Training

by frozenseven on 6/10/25, 5:30 AM with 18 comments

  • by NotAnOtter on 6/10/25, 5:26 PM

    I'm interested how an innovation like this affects the business prospects.

    Let's assume this is a paradigm shift on the scale of Transformers / `Attention is all you need`. Companies build out new models and pump another $100 Billion through it. And then a year from now, another innovation comes out. Same circus. And again.

    No one wants to be left behind but trying to keep up will sink smaller companies.

  • by ntonozzi on 6/10/25, 8:47 PM

    Is there any work related to using some kind of soft tokens for reasoning? It seems so inefficient to try to encode so much information down into a single token for the next pass of the model, when you could output a large vector for each forward pass, and have a drastically larger working memory/scratchpad, and have much higher bandwidth for the models to pass information forward to the next token call. If a single token has 17 bits of information, a vector of 1024 floats could have 32,768 bits of information.
  • by Imnimo on 6/10/25, 7:07 PM

    This is an interesting way of squeezing extra feedback from raw text, but I'm a little skeptical that it's the best way to spend training flops. It feels like most "next tokens" are pretty low information (even after filtering for entropy like they do). Does it make sense to spend a bunch of compute on a reasoning trace on them? Maybe if you're harshly data limited, but not compute limited?
  • by dgshsg on 6/10/25, 1:06 PM

    I notice that you can do this recursively to arbitrary depth. The cost is terrible though.
  • by hzia on 6/10/25, 10:54 AM

    This is very exciting! Existing data will become a lot more valuable and it brings it one step closer to how we learn as humans!

    The downside is that this is going to be extremely expensive, so the data set to conduct RL will need to be curated.

  • by rafaelero on 6/10/25, 7:25 PM

    This should be used for high entropy tokens during pre-training.
  • by babelfish on 6/10/25, 4:21 PM

    So marginally better (and occasionally worse) performance for an order of magnitude larger training costs…?
  • by watsonmusic on 6/10/25, 4:34 PM

    A new scaling paradigm finally comes out!
  • by beauzero on 6/10/25, 5:21 PM

    Interesting