by mfiguiere on 5/30/25, 8:03 PM with 217 comments
by miki123211 on 5/31/25, 10:46 AM
Most people think of agents like they think of human employees. They set up a limited number of agents to run in parallel (often just one), with each agent running in a loop and doing one task at a time. They're still in a world where you have a fixed (on the timescale of hours or days) number of employees, each employee can only do one thing at a time, and transferring tasks between employees is slow and costly.
LLMs don't really work like that. You effectively have an infinite number of agents that you can conjure out of thin air at any time. There's no cost advantage to performing LLM requests in series rather than in parallel.
If you realize this, the pattern of each agent fanning out and forking itself into as many sub-agents as are needed to fulfill the task becomes obvious. This is exactly what the authors have done.
I think a better way to think of agents is as "tasks" or "jobs", like those you might find in Celery or sidekik, and apply the learnings from those.
by ekelsen on 5/30/25, 10:18 PM
People haven't spent time optimizing the fp32 versions of these kernels in years. This will be much more interesting if they can improve the kernels where developer effort has gone and that are actually used.
by thorum on 5/30/25, 9:52 PM
[1] https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...
[2] https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...
by ekelsen on 5/30/25, 10:42 PM
that's a huge tolerance and allows them to use fp16 operations to replace the "fp32" kernel.
by userbinator on 5/31/25, 10:39 AM
by vessenes on 5/31/25, 12:42 AM
by FL33TW00D on 5/31/25, 8:39 AM
EDIT: looks like they've since generated another one that is numerically stable! great work
by Workaccount2 on 5/30/25, 9:24 PM
Who knows if this is the actual fabled path of "self improvement", but results like this are what we expect to find on such a path.
by yahoozoo on 5/30/25, 9:18 PM
by bgwalter on 5/31/25, 2:04 PM
The PyTorch code base is NOT written by performance experts in any way. This is the wrong base line. Nothing about that code base is clean or hand optimized .
The "AI" generation methodology seems to give many instructions and even descends into instruction trees, manually throwing away results etc. So it requires, as usual, extreme guidance.
by poltomo on 5/31/25, 4:13 PM
ML guided heuristic search over compute schedules is as old as 2013 (Halide for image processing)
by constantcrying on 5/30/25, 10:29 PM
This is fundamentally different to how any human would approach this problem. And also different to how some recent advances in this area were made, where AI actually came up with superior and correct algorithms.
This approach also seems quite unfortunate and makes many of theses results somewhat doubtful.
by brrrrrm on 5/30/25, 9:57 PM
by klingenm on 5/31/25, 7:34 AM
(Edit, typo)
by adityamwagh on 5/30/25, 10:22 PM
by david-gpu on 5/31/25, 1:01 AM
That said, after quickly skimming the example AI-generated kernel I am not seeing anything novel there. While working at nVidia I did see a handful of techniques that, frankly, blew my mind.
Thus, I wonder what makes this AI-generated kernel faster than the standard pyTorch kernel, which I presume is simply delegating all the heavy lifting onto cuDNN. My guess, and it's just a guess, is that they are comparing the fastest AI-generated kernel they produced for a very particular set of parameters against whatever kernel cuDNN is picking for that same scenario, and perhaps the subsystem inside cuDNN that picks which kernel to execute out of the very large database it manages chose a suboptimal candidate. Researchers tend to completely ignore this issue and assume that cuDNN is always able to choose the very best kernel in every possible scenario, something that is just not realistic.
Maybe there is something else going on, but these sort of "we have beaten this heavily optimized proprietary library" always seem to miss this very important point.
Kind regards to any NVidia insiders who may read this. You guys are the brightest people I've ever met.
by MangoToupe on 5/31/25, 6:00 AM
At the very least they could have used consumer hardware. I don't even know how to parse that model it's so consumer-alien.
by JSR_FDED on 5/30/25, 11:52 PM
by reliabilityguy on 5/30/25, 9:20 PM
If so, why is it surprising that generic implementations in PyTorch are worse?
by Mathnerd314 on 6/1/25, 2:17 AM
I was thinking this was about leaking the kernels or something, but no, they are "publishing" them in the sense of putting out the blog post - they just mean they are skipping the peer review process and not doing a formal paper.
by t-vi on 6/1/25, 4:19 PM