2y ago

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at this https URL.

8 comments

I don't know how other people feel about this, but I would like if you gave a bit more context on links like this. Just so I can decide if I want to read that paper (or click the link) or not. I'm not really an expert so this paragraph doesn't help me to contextualize anything.
For example in this case, I remember skimming through "QLORA: Efficient Finetuning of Quantized LLMs" from May. But it needs to be dumbed down a bit so I can figure out the new achievement. Or link a news report that contextualizes it and not just throw the paper at me.
Feel free to ignore my comment if everyone else is an expert. I'm just saying because it's kind of time consuming to click on the link and read the abstract and conclusion and look up a few things just to understand what we're talking about. Once we're discussing several papers a week and I'm just a hobbyist, I'd like the link to come along a summary and the context.
- Thank you very much for your explanation. I can understand that one. This is exactly the important difference. In my words it'd be: They figured out a way to improve on the maths, making the calculations faster. (by reducing an important matrix multiplication in dimensionality)
  But there is another important aspect to it. They keep the quanzized property after the fine-tuning which QLoRA doesn't. Which makes it a bit more precise than doing another (lossy) quantization after the fact.
  Your explanation got me on track to figure it out. Thanks. I wrote another longer reply to noneabove1182. I'm not going to repeat everything, but I think I'm satisfied now.
- Sure, I can try to add a couple lines on top of the abstract just to give a super brief synopsis
  In this case it would be something like:
  This paper discusses a new technique in which we can create a LORA for an already quantized model, this is unique from QLora which quantizes the full model on the fly to create a quantized lora. With this approach you can take your small model and work with it as is, saving a ton of resources and speeding up the process massively
  
  I'm sorry today's not my day... I still don't get it. Did you write that summary or is the paragraph/synopsis AI generated?

8 comments