by lykahb on 6/9/24, 12:00 AM with 30 comments
by cpldcpu on 6/9/24, 3:41 PM
https://arxiv.org/abs/2402.17764
The main addition of the new paper seems to be the implementation of optimized and fused kernels using triton, as seen here:
https://github.com/ridgerchu/matmulfreellm/blob/master/mmfre...
This is quite useful, as this should make training this type of LLMs much more efficient.
So this is a ternary weight LLM using quantization aware training (QAT). The activations are quantized to 8 bits. The matmal is still there, but it is multiplying the 8 bit activations by one bit values.
Quantization aware training with low bit weights seems to lead to reduced overfitting by an intrensic tendency to regularize. However, also the model capacity should be reduced compared to a model with the same number of weights and a higher number of bits per weights. It's quite possible that this only becomes apparent after the models have been trained with a significant number of tokens, as LLMs seem to be quite sparse.
Edit: In addition to the QAT they also changed the model architecture to use a linear transformer to reduce reliance on multiplications in the attention mechanism. Thanks to logicchains for pointing this out.
by buildbot on 6/9/24, 4:39 AM
They get real (61%!?) memory savings during training, and inference too.
On top of all that, they then go build an FPGA core which is programmed with a custom assembler. And their code is posted and works seamlessly with huggingface transformers?! Absolutely going to test this out.
by jph00 on 6/9/24, 4:01 AM
by naasking on 6/9/24, 6:05 PM
by WithinReason on 6/9/24, 11:14 AM
by throwaway71271 on 6/9/24, 11:00 AM
it is super easy to try it out, the 2.7B, 1.3B, 0.37B models are on huggingface, and the generate.py example just works if you have triton 2.2 installed
by amluto on 6/9/24, 8:44 AM
So what’s the extra trick to make the model stay quantized? Does one evaluate the gradients on a whole bunch of training inputs, add them up, apply some randomness, and then re-quantize the model? Or is it something else?
by sva_ on 6/9/24, 12:42 PM
by PaulHoule on 6/9/24, 11:50 AM
by hisoka44 on 6/9/24, 1:33 PM
by nuz on 6/9/24, 11:35 AM
by gabesullice on 6/9/24, 4:58 AM