from Hacker News

Surprisingly fast AI-generated kernels we didn't mean to publish yet

by mfiguiere on 5/30/25, 8:03 PM with 217 comments

  • by miki123211 on 5/31/25, 10:46 AM

    I think how the authors of this post think about "AI agents" is really interesting.

    Most people think of agents like they think of human employees. They set up a limited number of agents to run in parallel (often just one), with each agent running in a loop and doing one task at a time. They're still in a world where you have a fixed (on the timescale of hours or days) number of employees, each employee can only do one thing at a time, and transferring tasks between employees is slow and costly.

    LLMs don't really work like that. You effectively have an infinite number of agents that you can conjure out of thin air at any time. There's no cost advantage to performing LLM requests in series rather than in parallel.

    If you realize this, the pattern of each agent fanning out and forking itself into as many sub-agents as are needed to fulfill the task becomes obvious. This is exactly what the authors have done.

    I think a better way to think of agents is as "tasks" or "jobs", like those you might find in Celery or sidekik, and apply the learnings from those.

  • by ekelsen on 5/30/25, 10:18 PM

    "FP32 is less common in modern ML workloads and often less optimized on recent hardware compared to FP16 or BF16, which may partly explain why it’s easier to achieve performance gains over PyTorch with FP32 kernels."

    People haven't spent time optimizing the fp32 versions of these kernels in years. This will be much more interesting if they can improve the kernels where developer effort has gone and that are actually used.

  • by thorum on 5/30/25, 9:52 PM

    My takeaway - from this article, from Google’s AlphaEvolve [1], and the recent announcement about o3 finding a zero day in the Linux kernel [2] - is that Gemini Pro 2.5 and o3 in particular have reached a new level of capability where these ideas that were tried unsuccessfully with other models, suddenly just work.

    [1] https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...

    [2] https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...

  • by ekelsen on 5/30/25, 10:42 PM

    "the reference code is in the default FP32, and given a tolerance threshold (1e-02)"

    that's a huge tolerance and allows them to use fp16 operations to replace the "fp32" kernel.

  • by userbinator on 5/31/25, 10:39 AM

    Am I the only one who was enticed into this article by thinking they had AI generate an OS kernel?
  • by vessenes on 5/31/25, 12:42 AM

    By far the most interesting part (after the 400% speed up in some cases) is the methodology: rather than hill climb on operations, they forced a language reasoning step between iterations to encourage diversity of search. This seems to have worked. Very very interesting.
  • by FL33TW00D on 5/31/25, 8:39 AM

    Tried a replication here. The LayerNorm kernel is not numerically stable so cannot be counted as valid. They only test with zero mean and unit std, so the catastrophic cancellation doesn't show up until after.

    EDIT: looks like they've since generated another one that is numerically stable! great work

  • by Workaccount2 on 5/30/25, 9:24 PM

    Very fascinating result, and it seems they wrote this blog post out of pure excitement to share their findings, and maybe to have someone throw cold water on it before publishing, ha.

    Who knows if this is the actual fabled path of "self improvement", but results like this are what we expect to find on such a path.

  • by yahoozoo on 5/30/25, 9:18 PM

    Very cool. They used o3 and Gemini 2.5 Pro but unfortunately they don’t mention which one produced the better kernels.
  • by bgwalter on 5/31/25, 2:04 PM

    > They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch.

    The PyTorch code base is NOT written by performance experts in any way. This is the wrong base line. Nothing about that code base is clean or hand optimized .

    The "AI" generation methodology seems to give many instructions and even descends into instruction trees, manually throwing away results etc. So it requires, as usual, extreme guidance.

  • by poltomo on 5/31/25, 4:13 PM

    Beating pytorch and tensorflow kernels has been easy to do with ml compilers since ~2018. You typically train and evaluate your model in one of these frameworks then hand off the computation graph to a compiler like Apache TVM or your hardware vendor’s proprietary one. They should test their kernels against those kernels.

    ML guided heuristic search over compute schedules is as old as 2013 (Halide for image processing)

  • by constantcrying on 5/30/25, 10:29 PM

    >and test for correctness by checking the numerical equality of the two outputs over many random inputs.

    This is fundamentally different to how any human would approach this problem. And also different to how some recent advances in this area were made, where AI actually came up with superior and correct algorithms.

    This approach also seems quite unfortunate and makes many of theses results somewhat doubtful.

  • by brrrrrm on 5/30/25, 9:57 PM

    what's going to be interesting is to see the large space of fused kernels being tackled by AI generated code. that might include gemm + relu + gemm + a norm of some kind - which would be annoyingly exhaustive to 1. sweep with a tuner and 2. handwrite as a human
  • by klingenm on 5/31/25, 7:34 AM

    This sounds more like using AI (llm) as one small step, where the randomness in the output is used to implement a Genetic Algorithm, than being "AI-generated" (admittedly technically correct).

    (Edit, typo)

  • by adityamwagh on 5/30/25, 10:22 PM

    Sometimes I think of LLMs as kind of a hive mind. It’s trained on thought processes of so many humans. I think that’s why it’s able to do these kinds of things given the fact that it has so much information and context compressed in weights.
  • by david-gpu on 5/31/25, 1:01 AM

    Disclaimer: This used to be my bread and butter, but I'm really rusty after five years of not working on this sort of stuff.

    That said, after quickly skimming the example AI-generated kernel I am not seeing anything novel there. While working at nVidia I did see a handful of techniques that, frankly, blew my mind.

    Thus, I wonder what makes this AI-generated kernel faster than the standard pyTorch kernel, which I presume is simply delegating all the heavy lifting onto cuDNN. My guess, and it's just a guess, is that they are comparing the fastest AI-generated kernel they produced for a very particular set of parameters against whatever kernel cuDNN is picking for that same scenario, and perhaps the subsystem inside cuDNN that picks which kernel to execute out of the very large database it manages chose a suboptimal candidate. Researchers tend to completely ignore this issue and assume that cuDNN is always able to choose the very best kernel in every possible scenario, something that is just not realistic.

    Maybe there is something else going on, but these sort of "we have beaten this heavily optimized proprietary library" always seem to miss this very important point.

    Kind regards to any NVidia insiders who may read this. You guys are the brightest people I've ever met.

  • by MangoToupe on 5/31/25, 6:00 AM

    > Our results are benchmarked on an Nvidia L40S

    At the very least they could have used consumer hardware. I don't even know how to parse that model it's so consumer-alien.

  • by JSR_FDED on 5/30/25, 11:52 PM

    Could this be used to create kernels for OpenCL, ROCm, etc?
  • by reliabilityguy on 5/30/25, 9:20 PM

    Is my understanding correct that they assumed a fixed size of the input?

    If so, why is it surprising that generic implementations in PyTorch are worse?

  • by Mathnerd314 on 6/1/25, 2:17 AM

    > we didn't mean to publish yet

    I was thinking this was about leaking the kernels or something, but no, they are "publishing" them in the sense of putting out the blog post - they just mean they are skipping the peer review process and not doing a formal paper.

  • by t-vi on 6/1/25, 4:19 PM

    Note that PyTorch's kernels are somewhat generic in shape. It has always been relatively easy to get speedups by specializing the shape, e.g. Apache TVM had that (back before it was "Apache" even).