by chriskanan on 3/26/25, 7:38 PM with 2 comments
by rahimnathwani on 3/26/25, 11:33 PM
Reinforcement Learning (RL) Training: In the final stage, an RL-based approach is applied to update the LLM, guiding the model to produce outputs closely aligned with high-scoring responses identified in the previous step. Through this adaptive learning process, the model refines its predictions to enhance quality.
I'm curious:1. How do they determine 'closely aligned'?
2. How does the performance of this RL approach compare with SFT using the same base model and same dataset?
by aktsvigun on 4/1/25, 8:16 AM