from Hacker News

Show HN: A registry of agent benchmarks (including many OSS agent trajectories)

by lbeurerkellner on 12/23/24, 8:57 AM with 1 comments

If you're interested in exploring what LLM-based agent systems these days actually do to solve certain benchmarks such as SWEBench or WebArena, we created a small leaderboard with our team, that allows to view a lot of public and OSS agent results including all the runtime traces (the step-by-step reasoning behind the scenes).

Looking at traces is actually quite interesting, as they reveal a lot about the inner working and shortcomings of current agent system, e.g. see https://explorer.invariantlabs.ai/u/invariant/webarena--SteP... for an example trace.

by lbeurerkellner on 12/23/24, 9:41 AM
Let us know if you can think of any benchmark, that you'd like to see added.