by tipsytoad on 3/22/24, 6:13 PM with 33 comments
by p1esk on 3/22/24, 9:10 PM
by valine on 3/22/24, 11:16 PM
by tbalsam on 3/22/24, 10:52 PM
However, maybe this is not the case. I have a bit of a history of messing with residuals in neural networks, seeing more work on it is good. Fast training networks of course are a very slightly mild obsession of mine as well, and very useful to the field. Here's hoping it pans out as a motif, curious to see where it goes.
by sp332 on 3/22/24, 8:41 PM
by ml_basics on 3/22/24, 9:37 PM
by danieldk on 3/23/24, 9:35 AM
I only glanced the paper, but they don't seem to softmax ⍺_i for normalization?
by zwaps on 3/23/24, 4:06 AM
2. The difference seems to diminish with scale. Real life transformers obviously are much larger and train on many more tokens.
3. A very significant part of training transformer models are the throughoutput and memory optimizations. I wonder how their model would work with such fused kernels or specialized paged KV cache schemes. Or activation checkpointing, if run locally.
4. Indeed they claim no memory impact, but their code shows that their experiments are conducted with a special optimized version which requires all activations to reside in a single tensor at all times. Not sure this would work with 3d parallelism on multiple nodes etc.
by matteopagli on 3/23/24, 1:55 PM
by efrank3 on 3/23/24, 12:49 AM
by aoeusnth1 on 3/22/24, 10:23 PM
> This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
I found this particularly charming.