from Hacker News

Evals are not all you need

by amarble on 3/3/25, 6:09 PM with 12 comments

by intellectronica on 3/3/25, 8:09 PM
I think the author may be misunderstanding what practitioners mean when they say that "evals are all you need", because the term is overloaded.
There are generic evals (like MMLU), for which the proper term is really "benchmark".
But task-related evals, where you evaluate how a specific model/implementation is doing on performing a task in your project are, if not _all_ you need, at the very least the most important component by a wide margin. They do not _guarantee_ software performance in the same way that unit tests do for traditional software. But they are the main mechanism we have for evolving a system towards being good enough to use it in production. I am not aware of any workable alternative.
by shahules on 3/3/25, 8:37 PM
It's an interesting article and I agree with some points you brought up here. But here are some of them to which I don't agree to
1. Evals are used throughout the article in the sense of LLM benchmarking, but this is not the point. One could effectively evaluate any AI system by building custom evals.
2. The purpose of evals is to help devs systematically improve their AI systems (at least how we look at it) not any of the ones listed in your article. It's not a one-time thing, it's a practice like the scientific method.
by tcdent on 3/3/25, 8:01 PM
This conversation always results in a ":shrug: I guess we'll never know" at the end.
There's potentially never going to be a silver bullet approach to this, or something that satisfies our need for determinism as in unit testing, but we can still try.
Would love to see as much effort put into this in the open source framework sense as there is being put into agentic workflows.
by phillipcarter on 3/3/25, 8:10 PM
I think the most important part of this article is how people focus too much on evaluating interactions with the model, but not evaluating the whole system that enables a feature or workflow that uses an LLM.
This is totally true! And I've talked with people who have found that "the LLM is a problem" only to find that upstream calls to services that produce data to be fed into the LLM were actually the ones causing problems.
by iLoveOncall on 3/3/25, 7:15 PM
There are so many "if you do that then it's not good" in this article that it actually seems like evals can in fact be all you need as long as you do them right?
I build an LLM-based platform at work, with a lot of agents and datasources, and yet we still don't fit in any of those "ifs".
by groodt on 3/3/25, 9:57 PM
The article doesn’t provide any alternatives?
I think there are indeed many challenges when evaluating Compound AI Systems (http://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems...)
But evals in complex systems are the best we have at the moment. It’s a “best-practice” just like all the forms of testing in the “test pyramid” (https://martinfowler.com/articles/practical-test-pyramid.htm...)
Nothing is a silver bullet. Just hard won, ideally automated, integrated quality and verification checks, built deep into the system and SDLC.