by iNic on 2/26/25, 9:57 AM with 168 comments
by xeonmc on 2/26/25, 11:22 AM
Whereever you have a convolution operation on your data, transform them to the conjugate domain to turn it into multiplication.
In other words, work in the domain that is natural to your data.
by yagizdegirmenci on 2/26/25, 11:24 AM
Later they found out that, performance of their TPU(s) for matrix multiplication was faster than FFT in the most scenarios.
by markisus on 2/26/25, 1:38 PM
I would like to see additional experiments using the lesser known Fourier transform over finite groups [1], which is permutation invariant but shares many properties with the standard Fourier transform.
I also wonder if this becomes the next big thing for LLMs, how easy will it be for inference engines(eg vLLM, llama.cpp) to integrate it?
[1] https://en.wikipedia.org/wiki/Fourier_transform_on_finite_gr...
by pointlessone on 2/26/25, 11:09 AM
by yorwba on 2/26/25, 11:24 AM
If the results were close to state-of-the-art, probably the author would've mentioned it?
by wafngar on 2/26/25, 2:54 PM
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro
by johntb86 on 2/26/25, 4:45 PM
by semiinfinitely on 2/26/25, 8:20 PM
by tonetegeatinst on 2/26/25, 3:12 PM
As someone who is absolutely terrible at math, I envy the people who grasp or at least can learn this type of stuff and get an engineering degree and license.
All I really know about FFT is that is changes a signal, its somehow used in processing signals of some kind, and it apparently from what I heard was the key to detecting nuclear detonations back in the day.
by hinkley on 2/26/25, 10:20 PM
"SLA's are most likely to be violated 23-25 minutes after a service deployment. Hmm, I wonder why that is... Oh no."
by bjenik on 2/26/25, 3:53 PM
The concept is fairly mainstream nowadays, to the degree that Jensen talked about it in his GTC keynote in 2021 [4] and there’s even a mainstage TED talk about its applications [5].
A nice property of doing things this way is that your model ends up being resolution-invariant which is particularly interesting for engineering domains. Scaling these methods has sparked the "let’s do a fully deep-learning-based weather model"-race [6][7].
As for using this on text data: my intuition would be that is going to not work as well because of a fairly unique property of text: for image, video and scientific data each individual element is of approximately equal importance, whereas in text you can have discrete tokens like a "not" somewhere in there that change the meaning of everything around it fairly significantly and you’d want that all to all interaction to capture that. Any kind of mixing that smoothes things out is going to inherently be at a disadvantage - probably true to some degree for most of those efficiency saving methods and why we’re seeing more limited adoption on text.
[1] https://arxiv.org/abs/2010.08895
[2] https://www.nature.com/articles/s42254-024-00712-5
[3] https://jmlr.org/papers/v22/21-0806.html
[4] https://www.youtube.com/watch?v=jhDiaUL_RaM&t=2472s
[5] https://www.ted.com/talks/anima_anandkumar_ai_that_connects_...
[6] https://arxiv.org/abs/2202.11214 (Feb 2022)
[7] https://www.wired.com/story/ai-hurricane-predictions-are-sto...
by antoineMoPa on 2/26/25, 2:30 PM
by gotoeleven on 2/27/25, 3:05 AM
Stated another way, how can it be possible that it is more efficient to translate the sequence into a series of N variables, where the nth variable is the sum of every nth term of the sequence, if it is unlikely that any relationship between these variables holds for any fixed period? If I combine the 1st 4th 7th 10th .... elements of the sequence, how do we expect the addition of anything beyond the first two elements to add anything but noise?
Stated another another way, if I'm going to approximate a function as a sum of sine waves, this is most efficient when the function is periodic and requires more and more sine waves in the sum to approximate the function on a larger and larger domain when the function is not periodic.
by DrNosferatu on 2/26/25, 1:10 PM
Besides immediate speed gains, I guess this opens the door to ultra-long contexts. Larger than say, 16M tokens.
by avereveard on 2/26/25, 10:56 AM
by pk-protect-ai on 2/26/25, 2:34 PM
by nialv7 on 2/27/25, 8:24 AM
by soulofmischief on 2/26/25, 2:35 PM
by quantadev on 2/26/25, 8:11 PM
I started to develop my own custom type of MLP (multilayer perceptron), that was going to use frequencies and phase angles (FFT) as the "model weights", but then I decided probably it would only outperform the standard MLP if the training data itself was periodic in nature, rather than with language tokens or even image data. Not sure if that's correct or not since Fourier Series shows us ANY arbitrary function can be simulated via a superposition of waves.
I still believe if we do achieve something amazing (i.e. competitive with SOTA AI models) with a wave-based NN, it won't create any 'true' qualia however, because simulating EMF waves in a computer is not the same as real EMF waves existing. I think even a 100% perfect simulation of a brain in a computer, for example, will always be a 'zombie' (no qualia). This is obvious if consciousness is indeed made of waves; but it's astounding how few NN-researchers seem to be so illiterate in the field of neuroscience that they don't realize how much evidence there is that consciousness is a wave phenomena.
by mkw5053 on 2/26/25, 3:52 PM
by leecarraher on 2/26/25, 2:51 PM
by a-dub on 2/27/25, 7:48 AM
the default bias of -0.1 with relus and what i would expect to be a flattish spectrum also seems like it would make for a sparse representation in the fourier domain.
i assume this is learning the text embeddings at training time, if so, i'd be curious how the constraints of going through the fft and filtering magnitudes would/could change how the embeddings look.
by bizarrevr on 2/26/25, 1:01 PM
by farhanhubble on 2/26/25, 12:48 PM
by ipunchghosts on 2/26/25, 1:15 PM
by EGreg on 2/26/25, 2:51 PM
by alexkranias on 3/3/25, 1:40 PM
by DrNosferatu on 2/26/25, 12:03 PM
1. traditional Self-Attention;
2. Flash-Attention?
3. Any novel others?
by A7C3D5 on 2/26/25, 11:59 AM
by whoisthemachine on 2/27/25, 2:22 PM
by fithisux on 2/27/25, 7:05 AM
by 29athrowaway on 2/26/25, 10:47 PM
by gunian on 2/26/25, 10:27 PM
by cs702 on 2/26/25, 11:26 AM
1. Take FNet (https://arxiv.org/abs/2105.03824).
2. Replace the fixed (frequency-domain) convolution filter with one that is dynamically computed from the data.
3. Apply non-linear functions to both real and imaginary components, before mapping the convolved data back to the time domain.
by larodi on 2/26/25, 10:53 AM
by TheDudeMan on 2/26/25, 1:06 PM