by raymond_goo on 7/30/24, 3:42 AM with 18 comments
by danieldk on 7/30/24, 6:22 AM
It uses asymmetric quantization and does so layer by layer such that each layer is processed independently before continuing to the next
GPTQ also supports symmetric quantization and almost everyone uses it. The problem with GPTQ asymmetric quantization is that all popular implementations have a bug [1] where all zero/bias values of 0 are reset to 1 during packing (out of 16 possible biases in 4-bit quantization), leading to quite a large loss in quality. Interestingly, it seems that people initially observed that symmetric quantization worked better than asymmetric quantization (which is very counter-intuitive, but made GPTQ symmetric quantization far more popular) and only discovered later that it is due to a bug.
[1] https://notes.danieldk.eu/ML/Formats/GPTQ#Packing+integers
by jillesvangurp on 7/30/24, 8:23 AM
Intuitively, I like the idea of asymmetric scales as well. Treating all values as equal seems like it's probably wasteful in terms of memory. It would be interesting to see where typical values fall statistically in an LLM. I bet it's nowhere near a random distribution of values.
by hazrmard on 7/30/24, 5:36 PM
by woodson on 7/30/24, 2:26 PM
by torginus on 7/30/24, 10:16 AM
by llm_trw on 7/30/24, 3:28 PM
Floats are not distributed evenly across the number line. The number of floats between 0 and 1 is the same as the number of floats between 1 and 3, then between 3 and 7 and so on. Quantising well to integers means that you take this sensitivity into account since the spacing between integers is always the same.
by dleeftink on 7/30/24, 2:58 PM
by cheptsov on 7/30/24, 8:42 PM