from Hacker News

The FFT Strikes Back: An Efficient Alternative to Self-Attention

by iNic on 2/26/25, 9:57 AM with 168 comments

  • by xeonmc on 2/26/25, 11:22 AM

    Basically leverages convolution theorem[0]: expensive convolutions in direct space becomes simple multiplications in reciprocal space, and vice versa.

    Whereever you have a convolution operation on your data, transform them to the conjugate domain to turn it into multiplication.

    In other words, work in the domain that is natural to your data.

    [0] https://en.wikipedia.org/wiki/Convolution_theorem

  • by yagizdegirmenci on 2/26/25, 11:24 AM

    Google introduced this idea in 2022 with "FNet: Mixing Tokens with Fourier Transforms" [0].

    Later they found out that, performance of their TPU(s) for matrix multiplication was faster than FFT in the most scenarios.

    [0]: https://arxiv.org/abs/2105.03824

  • by markisus on 2/26/25, 1:38 PM

    The Fourier transform is taken along the “token” dimension. However, in many applications, this dimension is not meaningful. That’s why transformers are a great option for consuming data which is permutation invariant.

    I would like to see additional experiments using the lesser known Fourier transform over finite groups [1], which is permutation invariant but shares many properties with the standard Fourier transform.

    I also wonder if this becomes the next big thing for LLMs, how easy will it be for inference engines(eg vLLM, llama.cpp) to integrate it?

    [1] https://en.wikipedia.org/wiki/Fourier_transform_on_finite_gr...

  • by pointlessone on 2/26/25, 11:09 AM

    OK, I admit that the math flies way over my head and I barely understand the text around the math. Can someone please explain (in basic English) how this is equivalent to attention mechanism? What friquencies does it talk about? How does it encode positional relations between tokens?
  • by yorwba on 2/26/25, 11:24 AM

    I don't see how you could fit causal masking into this framework without having to do n different FFTs, and there's no mention of positional embeddings either, so I guess the self-attention implementation being compared against is noncausal NoPE, which would make this a case of baseline sandbagging and maybe not so impressive.

    If the results were close to state-of-the-art, probably the author would've mentioned it?

  • by wafngar on 2/26/25, 2:54 PM

    Should be a relevant reference: https://arxiv.org/abs/2111.13587

    Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro

  • by johntb86 on 2/26/25, 4:45 PM

    Does anyone have an intuition about why looking at things in the frequency domain is helpful here? The DC term I can see, but I wouldn't expect the input data is periodic enough that other frequencies would be meaningful.
  • by semiinfinitely on 2/26/25, 8:20 PM

    I see no mention of prior work Hyena Operator which already demonstrated O(n log n) full context mixing several years ago.

    https://arxiv.org/abs/2302.10866

  • by tonetegeatinst on 2/26/25, 3:12 PM

    I sort of grasp big O notation...but this is sort of over my head like most stuff that has to do with computer or electrical engineering.

    As someone who is absolutely terrible at math, I envy the people who grasp or at least can learn this type of stuff and get an engineering degree and license.

    All I really know about FFT is that is changes a signal, its somehow used in processing signals of some kind, and it apparently from what I heard was the key to detecting nuclear detonations back in the day.

  • by hinkley on 2/26/25, 10:20 PM

    I think in the age of telemetry we are also missing a substantial trick by not applying FFTs to cloud telemetry to smoke out epicycles and metastable systems before rather than after they trigger drama. This is unfortunately within my level of notice but not within my level of skill, and my dance card is already full.

    "SLA's are most likely to be violated 23-25 minutes after a service deployment. Hmm, I wonder why that is... Oh no."

  • by bjenik on 2/26/25, 3:53 PM

    Working in the Fourier domain has been a staple of scientific and engineering applications. Learning those interactions rather than just hardcoding them has been fairly widely explored as well - the term to look for is Fourier Neural Operators [1][2]. It turns out you can prove universality even if the nonlinearity remains in the real domain [3].

    The concept is fairly mainstream nowadays, to the degree that Jensen talked about it in his GTC keynote in 2021 [4] and there’s even a mainstage TED talk about its applications [5].

    A nice property of doing things this way is that your model ends up being resolution-invariant which is particularly interesting for engineering domains. Scaling these methods has sparked the "let’s do a fully deep-learning-based weather model"-race [6][7].

    As for using this on text data: my intuition would be that is going to not work as well because of a fairly unique property of text: for image, video and scientific data each individual element is of approximately equal importance, whereas in text you can have discrete tokens like a "not" somewhere in there that change the meaning of everything around it fairly significantly and you’d want that all to all interaction to capture that. Any kind of mixing that smoothes things out is going to inherently be at a disadvantage - probably true to some degree for most of those efficiency saving methods and why we’re seeing more limited adoption on text.

    [1] https://arxiv.org/abs/2010.08895

    [2] https://www.nature.com/articles/s42254-024-00712-5

    [3] https://jmlr.org/papers/v22/21-0806.html

    [4] https://www.youtube.com/watch?v=jhDiaUL_RaM&t=2472s

    [5] https://www.ted.com/talks/anima_anandkumar_ai_that_connects_...

    [6] https://arxiv.org/abs/2202.11214 (Feb 2022)

    [7] https://www.wired.com/story/ai-hurricane-predictions-are-sto...

  • by antoineMoPa on 2/26/25, 2:30 PM

    What I don't get about attention is why it would be necessary when a fully connected layer can also "attend" to all of the input. With very small datasets (think 0 - 500 tokens), I found that attention makes training longer and results worse. I guess the benefits show up with much larger datasets. Note that I'm an AI noob just doing some personal AI projects, so I'm not exactly a reference.
  • by gotoeleven on 2/27/25, 3:05 AM

    At a high level, it's very unintuitive to me that transforming to the frequency domain for analyzing long sequences of tokens--when there is in general no expectation that the sequences have a periodic structure--can improve efficiency.

    Stated another way, how can it be possible that it is more efficient to translate the sequence into a series of N variables, where the nth variable is the sum of every nth term of the sequence, if it is unlikely that any relationship between these variables holds for any fixed period? If I combine the 1st 4th 7th 10th .... elements of the sequence, how do we expect the addition of anything beyond the first two elements to add anything but noise?

    Stated another another way, if I'm going to approximate a function as a sum of sine waves, this is most efficient when the function is periodic and requires more and more sine waves in the sum to approximate the function on a larger and larger domain when the function is not periodic.

  • by DrNosferatu on 2/26/25, 1:10 PM

    Well done!

    Besides immediate speed gains, I guess this opens the door to ultra-long contexts. Larger than say, 16M tokens.

  • by avereveard on 2/26/25, 10:56 AM

    Isn't flash attention already n log n?
  • by pk-protect-ai on 2/26/25, 2:34 PM

    Do I get it wrong, that this will be incompatible with OrthoGrad¹ optimizer?

    [¹] https://arxiv.org/abs/2501.04697

  • by nialv7 on 2/27/25, 8:24 AM

    How does this compare with Mamba? Intuitively it feels like this should have similar performance to Mamba, but the math is a lot simpler and easier to understand.
  • by soulofmischief on 2/26/25, 2:35 PM

    Amazing. I had a similar insight and did exactly this some time ago, transforming the input into the frequency domain in order to preserve input energy and encode information in the complex space, plus filtering based on global context (different filter implementation than discussed in this paper) but only trained the model locally and didn't bother seeing how it scaled. Congrats on the hard work!
  • by quantadev on 2/26/25, 8:11 PM

    Since I believe consciousness itself is made of EMF waves, generated by neural activity (rather than synaptic firings themselves, which I view merely as signal carriers like the I/O to/from brains), I'm glad to see it any time FFTs are used in any way in NNs or AI research.

    I started to develop my own custom type of MLP (multilayer perceptron), that was going to use frequencies and phase angles (FFT) as the "model weights", but then I decided probably it would only outperform the standard MLP if the training data itself was periodic in nature, rather than with language tokens or even image data. Not sure if that's correct or not since Fourier Series shows us ANY arbitrary function can be simulated via a superposition of waves.

    I still believe if we do achieve something amazing (i.e. competitive with SOTA AI models) with a wave-based NN, it won't create any 'true' qualia however, because simulating EMF waves in a computer is not the same as real EMF waves existing. I think even a 100% perfect simulation of a brain in a computer, for example, will always be a 'zombie' (no qualia). This is obvious if consciousness is indeed made of waves; but it's astounding how few NN-researchers seem to be so illiterate in the field of neuroscience that they don't realize how much evidence there is that consciousness is a wave phenomena.

  • by mkw5053 on 2/26/25, 3:52 PM

    I'm interested in how this would work for generative models. It's not obvious how you'd implement causal masking in the frequency domain. And the modReLU activation seems critical but adds implementation complexity. Would love to see how this scales on truly massive context lengths where the theoretical advantages should really shine.
  • by leecarraher on 2/26/25, 2:51 PM

    this seems to follow the similar FNO work by nvidia, and switching to frequency domain is usually in any computer scientist's toolbox at this point, however, I'm curious if this translates to real gains for real architectures. FFT makes use of imaginary numbers to encode the harmonics of the signal, these are generally not amenable to gpu architectures. Would fast walsh hadamard suffice? Sometimes the 'signal mixing' is more important than the harmonics of a compositions of sines. Or do we go further down the rabbit hole of trained transformation and try out wavelets? I am an avid FFT fan, (love fast johnson lindenstrauss transform using the embedded uncertainty principle for RIP), but sometimes real hardware and good theory dont always align (eg there are sub ternary matrix multiplies, but they are rarely used in DL)
  • by a-dub on 2/27/25, 7:48 AM

    hm. nothing that happens in the fourier domain can touch the phases, which seems like a constraint that would change the behavior of other layers.

    the default bias of -0.1 with relus and what i would expect to be a flattish spectrum also seems like it would make for a sparse representation in the fourier domain.

    i assume this is learning the text embeddings at training time, if so, i'd be curious how the constraints of going through the fft and filtering magnitudes would/could change how the embeddings look.

  • by bizarrevr on 2/26/25, 1:01 PM

    I am very out of my depth here haha with the math, appreciate those below taking the time to baby walk me along with the understanding! Great links too!
  • by farhanhubble on 2/26/25, 12:48 PM

    Makes me think what if NNs are treated as black box signal processing units. What other techniques can we borrow from signal processing?
  • by ipunchghosts on 2/26/25, 1:15 PM

    This guy's has had a lot bangers lately.
  • by EGreg on 2/26/25, 2:51 PM

    But does it allow unlimited context windows, like RoPE does, which LLaMa 2 now features?
  • by alexkranias on 3/3/25, 1:40 PM

    Is this not just a global convolution?
  • by DrNosferatu on 2/26/25, 12:03 PM

    Can someone confirm the big O time complexities of

    1. traditional Self-Attention;

    2. Flash-Attention?

    3. Any novel others?

  • by A7C3D5 on 2/26/25, 11:59 AM

    I'll never not read FFT as Final Fantasy Tactics.
  • by whoisthemachine on 2/27/25, 2:22 PM

    My EE education keeps coming back.
  • by fithisux on 2/27/25, 7:05 AM

    Very interesting idea.
  • by 29athrowaway on 2/26/25, 10:47 PM

    Is this why biological neural networks are spiking?
  • by gunian on 2/26/25, 10:27 PM

    if simping is bad doesnt that mean polynomial time complexity would be the best?
  • by cs702 on 2/26/25, 11:26 AM

    TL;DR:

    1. Take FNet (https://arxiv.org/abs/2105.03824).

    2. Replace the fixed (frequency-domain) convolution filter with one that is dynamically computed from the data.

    3. Apply non-linear functions to both real and imaginary components, before mapping the convolved data back to the time domain.

  • by larodi on 2/26/25, 10:53 AM

    Man, this all really gets increasingly more complex in increasingly complex math…
  • by TheDudeMan on 2/26/25, 1:06 PM

    Machines going to be thinking in the frequency domain. We cooked.