from Hacker News

Show HN: Llmfao – Human-Ranked LLM Leaderboard with Sixty Models

by scoresmoke on 10/11/23, 8:08 PM with 2 comments

In September 2023, I noticed a tweet [1] on difficulties with LLM evaluation, which resonated with me a lot. A bit later, I spotted a nice LLMonitor Benchmarks dataset [2] with a small set of prompts and a large set of model completions. I decided to make my attempt without running a comprehensive suite of hundreds of benchmarks: https://dustalov.github.io/llmfao/

I also wrote a detailed post describing the methodology and analysis: https://evalovernite.substack.com/p/llmfao-human-ranking

[1]: https://twitter.com/_jasonwei/status/1707104739346043143

[2]: https://benchmarks.llmonitor.com/

Unfortunately, I did my analysis before the Mistral AI model was released, but published it after the model was released. I’d be happy to add it to the comparison if I had their completions.

by maxrmk on 10/11/23, 9:15 PM
This is really cool, nice work. Did you try out any of the grading yourself to compare it to the contractors you used? One thing I've found, especially for coding questions is that models can produce an answer that _looks_ great, but then turns out to use libraries or methods that don't exist. And that human graders tend to rate these highly since they don't actually run the code.