by RobinHirst11 on 11/23/24, 12:37 PM with 49 comments
by fnordpiglet on 11/29/24, 6:56 PM
by Unlisted6446 on 11/30/24, 2:36 AM
For one, they could consider using equivalence testing for comparing models, instead of significance testing. I'd be surprised if their significance tests were not significant given 10000 eval questions and I don't see why they couldn't ask the competing models 10000 eval questions?
My intuition is that multilevel modelling could help with the clustered standard errors, but I'll assume that they know what they're doing.
by ipunchghosts on 11/29/24, 11:12 PM
"Random seed xxx is all you need" was another demonstration of this need.
You actually want a wilcoxon sum rank test as many metrics are not gaussian especially as they get to thier limits!! I.e. accuracy roughly 99 or 100! Then it becomes highly sub gaussian.
by intended on 11/30/24, 5:43 AM