by jafitc on 12/15/23, 11:24 PM with 1 comments
by jafitc on 12/15/23, 11:24 PM
BYOT - bring your own tests style.
Gives a better picture of real-world performance and more robust against contamination.
They collected over 6000 and 1500 votes for Mixtral-8x7B and Gemini Pro.
While ELO ratings are widely used to rank performance in Chess or among sports teams, here's a disclaimer by the makers of the leaderboard:
---
> Please note Arena is a "live eval" and pretty much a sampling process to estimate models capability.
> That's why we show the confidence intervals through bootstrapping. Statistically, these models (e.g., GPT-3.5, Mixtral, Gemini Pro) are very close and only looking at their ranking can be misleading.