from Hacker News

Thinking LLMs: General Instruction Following with Thought Generation

by ed on 10/20/24, 8:36 PM with 1 comments

  • by ed on 10/20/24, 8:36 PM

    This paper comes from Meta and introduces Thought Preference Optimization (TPO), a post-training process that encourages small models to think, similar to o1.

    The results are impressive - Llama 3 8b performs almost on par with GPT-4o across a wide range of tasks, not just logic and math.

    Interestingly, the post-training process significantly improves model performance even without “thoughts” (the “direct baseline” case in the paper).