by amarble on 3/3/25, 6:09 PM with 12 comments
by intellectronica on 3/3/25, 8:09 PM
There are generic evals (like MMLU), for which the proper term is really "benchmark".
But task-related evals, where you evaluate how a specific model/implementation is doing on performing a task in your project are, if not _all_ you need, at the very least the most important component by a wide margin. They do not _guarantee_ software performance in the same way that unit tests do for traditional software. But they are the main mechanism we have for evolving a system towards being good enough to use it in production. I am not aware of any workable alternative.
by shahules on 3/3/25, 8:37 PM
1. Evals are used throughout the article in the sense of LLM benchmarking, but this is not the point. One could effectively evaluate any AI system by building custom evals.
2. The purpose of evals is to help devs systematically improve their AI systems (at least how we look at it) not any of the ones listed in your article. It's not a one-time thing, it's a practice like the scientific method.
by tcdent on 3/3/25, 8:01 PM
There's potentially never going to be a silver bullet approach to this, or something that satisfies our need for determinism as in unit testing, but we can still try.
Would love to see as much effort put into this in the open source framework sense as there is being put into agentic workflows.
by phillipcarter on 3/3/25, 8:10 PM
This is totally true! And I've talked with people who have found that "the LLM is a problem" only to find that upstream calls to services that produce data to be fed into the LLM were actually the ones causing problems.
by iLoveOncall on 3/3/25, 7:15 PM
I build an LLM-based platform at work, with a lot of agents and datasources, and yet we still don't fit in any of those "ifs".
by groodt on 3/3/25, 9:57 PM
I think there are indeed many challenges when evaluating Compound AI Systems (http://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems...)
But evals in complex systems are the best we have at the moment. It’s a “best-practice” just like all the forms of testing in the “test pyramid” (https://martinfowler.com/articles/practical-test-pyramid.htm...)
Nothing is a silver bullet. Just hard won, ideally automated, integrated quality and verification checks, built deep into the system and SDLC.