from Hacker News

Direct Preference Optimization vs. RLHF

by summarity on 5/25/25, 4:50 PM with 1 comments

by Genego on 5/28/25, 4:48 AM
I was building an multi-agent system connected to Telegram. There is one agent that synthesises a response through 5+ other agents. Initially I was tweaking the system through my IDE, making small adjustments to promps to ensure that patterns and workflows where followed better. But I also started to interact while on the road, or just from the bed. And I got very frustrated by seeing some multi-step / multi-agent interactions go completely wrong, so I build in an additional architecting agent, which can make adjustments to the agents prompts (in terms of executing logic of tool calls) on the fly.
So if I saw something went wrong, I would say: "Next time don't do that, please do this instead" - Architect agent then reviews the entire tool and agent call chain, and makes a new adaptation to each agent (if necessary).
I was calling this "Poor man's RLHF" - it has been quite fun to interact with. Ended up making it so that this is a JSON file that I could later (potentially use for finetuning). But I was always wondering if there was a name for this? Is it the similar as DPO? I called it "behavioral adaptation". For a small system it was quite effective. But I also didn't bother to research it.