My attempt to explain groupsize and act order in GPTQ
My attempt to explain groupsize and act order in GPTQ
Take this with a grain of salt, it is heavily researched but only by me and I am no expert, will link relevant studies at the bottom
For background for anyone who doesn't know, GPTQ is a quantization method used to bring fp16 models down to 4 bits in order to run it on consumer hardware. Quantizing is a long and arduous process, largely because by its very nature there will be losses in information, and to account for this, we attempt to adjust the unquantized weights after every quantization step in order to minimize the losses. This involves updating both an inverse hessian matrix and all the relevant weights, but notably only for non-quantized weights, anything that's been quantized already is locked in to place.
Okay, so now groupsize. Basically, groupsize indicates how many columns are grouped together during quantization, and this grouping has two major benefits that increase as group size goes up. The first is that you save space. When quantizing a column, you first attempt to normalize the values across the range, and store a quantization factor which tells your software how much to multiply the values by to get their "real" value. Because of this normalization, if you group weights together and use the same quantization factor, you can save a surprising amount of space with larger groups.
The second advantage is parallelism during the quantization process itself. If you group columns up and quantize them in a vacuum, you can then only do a single update to the inverse Hessian and weights per, say, 128 columns instead of per column, which will dramatically reduce the sequential calculations required with only a small loss in quality.
Obviously there is still a loss in quality, since there's now several columns being quantized together without any loss adjustments.
That basically sums up group size, bigger size = smaller model and faster quantization at the cost of a small decrease in quality/increase in perplexity.
Next up is actorder.
Activation Order is a method of examining which columns make most sense to quantize first in order to maintain important information. We start by observing which columns have the largest activation magnitude, that is, the columns which most contribute to the final output of the model because they have the most activate activation of neurons.
After gathering that information, we start our quantization with those columns, because that means they will most closely reflect their original values after the full quantization is done. Remember, after we quantize a column, that's it, it's locked in. That means that if we left some of our important columns until the end, not only might they have been adjusted several times during the process, but more importantly there remain very few extra columns that we can adjust to make up for the quantization loss. So starting with these values, IE act-order or desc_act (used interchangeably) should result in a minor increase in performance.
Side note, I'm not positive at this time why it results in an increase to model size, my best guess is that it involves re-arranging the columns in memory in ways that are no longer optimal and can't be properly mapped into the VRAM without wasting space, but that's a pure guess and I would love if someone chimed in with more info.
And that's it! To sum it up, group size means quantizing in groups rather than individually, resulting in smaller models that are quantized faster, and act order means to quantize in order of activation magnitude to try to preserve as much of the important information as possible.
If you stuck through that wall of text, thanks! I hope it was insightful (and accurate)
Sources:
https://arxiv.org/abs/2210.17323 (group size explanation) https://arxiv.org/abs/2306.02272 (act order explanation)