LLM Training
QLoRA and Quantization — Running LLMs on Consumer Hardware
Quantization reduces model memory by representing weights with fewer bits, enabling deployment on consumer hardware. This guide covers INT8, INT4, GPTQ, AWQ, and practical BitsAndBytes integration for accessible LLM fine-tuning.
- NF4 Quantization — Optimal for normally distributed neural network weights
- GPTQ & AWQ — Post-training quantization with minimal quality loss
- Consumer GPU Training — Fine-tune 7B models on a single GPU with QLoRA
Democratizing AI means making it run on the hardware people already have.
QLoRA and Quantization
Quantization reduces model memory by representing weights with fewer bits. This tutorial covers the theory and practice of quantization for LLMs, enabling deployment on consumer hardware.
DfQuantization
Quantization is the process of mapping continuous or high-precision values to a discrete set of lower-precision values. For neural networks, this typically means converting FP32/FP16 weights to INT8, INT4, or other low-bit formats.
Quantization Formats
FP16 (Half Precision)
FP16 Range
Here,
- =16-bit floating point (1 sign + 5 exponent + 10 mantissa bits)
BF16 (Brain Float 16)
BF16 Range
Here,
- =16-bit brain float (1 sign + 8 exponent + 7 mantissa bits)
BF16 has the same dynamic range as FP32 but lower precision. It is preferred for training because it reduces overflow/underflow. FP16 has higher precision but narrower range, requiring loss scaling.
INT8 Quantization
INT8 Quantization
Here,
- =Original FP16/BF16 value
- =Quantized INT8 value
- =Absolute maximum value for scaling
INT4 Quantization
INT4 Quantization
Here,
- =Original value
- =Quantized INT4 value (0-15)
- =Range bounds for quantization
NormalFloat 4-bit (NF4)
DfNF4
NF4 is an information-theoretically optimal 4-bit data type for normally distributed data. It uses quantile-based quantization where each quantization bin has equal probability mass under a standard normal distribution.
NF4 Quantization Levels
Here,
- =Quantization level i
- =Inverse normal CDF
- =Number of bits (4 for NF4)
Here,
- =Original weight matrix
- =Quantized weight matrix
- =Quantization error bound
- =Frobenius norm
Quantization Methods
GPTQ
GPTQ (Frantar et al., 2023) performs post-training quantization using optimal brain quantization:
DfGPTQ
GPTQ quantizes model weights column by column, minimizing the squared error between the original and quantized weight matrices. It uses the inverse Hessian of the layer's output to determine optimal quantization order.
GPTQ Objective
Here,
- =Original weight matrix
- =Quantized weight matrix
- =Input activations (calibration data)
AWQ
AWQ (Lin et al., 2024) performs activation-aware weight quantization:
DfAWQ
AWQ identifies important weight channels based on activation magnitudes and quantizes them with higher precision. It scales important weights before quantization to preserve their information.
AWQ Scaling
Here,
- =Original weight
- =Scale factor (larger for important channels)
- =Quantized weight
GGML/GGUF
GGML and GGUF are quantization formats designed for CPU inference:
GGUF (GGML Unified Format) is the successor to GGML and is used by llama.cpp. It supports multiple quantization types (Q4_0, Q4_K_M, Q5_K_M, etc.) with mixed precision across layers.
BitsAndBytes Integration
BitsAndBytes provides easy-to-use quantization for PyTorch models:
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
Trainer
)
from peft import LoraConfig, get_peft_model
# 4-bit quantization config (QLoRA)
bnb_config_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config_4bit,
device_map="auto",
)
# 8-bit quantization config
bnb_config_8bit = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.float16,
)
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config_8bit,
device_map="auto",
)
QLoRA Training Example
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 3,744,571,392 || trainable%: 0.1820
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_steps=100,
max_steps=1000,
fp16=True,
optim="paged_adamw_32bit",
logging_steps=50,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
trainer.train()
`
## Memory Savings Calculation
<MathFormula
title="Memory Savings"
tex={`\\text{Savings} = \\left(1 - \\frac{\\text{bits}}{16}\\right) \\times 100\\%`}
variables={[
{ symbol: "bits", description: "Target quantization bits" }
]}
/>
| Quantization | Bits | Memory per Param | 7B Model Size | Savings vs FP16 |
|-------------|------|-------------------|---------------|-----------------|
| FP32 | 32 | 4 bytes | 28 GB | - |
| FP16 | 16 | 2 bytes | 14 GB | Baseline |
| INT8 | 8 | 1 byte | 7 GB | 50% |
| INT4/NF4 | 4 | 0.5 bytes | 3.5 GB | 75% |
| INT4 (double) | 4 | 0.625 bytes | 4.375 GB | 69% |
<MathNote type="tip">
Double quantization (QLoRA's bnb_4bit_use_double_quant=True) quantizes the quantization constants themselves, saving an additional ~0.37 GB per billion parameters with negligible quality loss.
</MathNote>
### Practical Example: Fine-tuning 7B on Consumer GPU
`python
# RTX 3060 (12GB VRAM) can fine-tune Llama-2-7B with QLoRA!
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# Quantize and load
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# Add LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
# Memory usage: ~5GB for 4-bit model + LoRA + optimizer states
# Fits in 12GB VRAM!
`
## Practice Exercises
1. **Mathematical**: Calculate the VRAM required to load a 13B parameter model in NF4 with double quantization. Include the KV cache for sequence length 2048.
2. **Implementation**: Use GPTQ to quantize a 7B model to 4-bit and measure the perplexity degradation on WikiText-2.
3. **Analysis**: Compare the quality of NF4 vs INT4 quantization for models of different sizes (1B, 7B, 13B). At what model size does 4-bit quantization become lossless?
4. **Research**: Investigate mixed-precision quantization strategies. Can you achieve better quality by using 8-bit for sensitive layers and 4-bit for others?
<MathSummary>
**Key Takeaways:**
- Quantization maps high-precision weights to lower-bit formats
- NF4 is optimal for normally distributed neural network weights
- GPTQ and AWQ are post-training quantization methods
- BitsAndBytes provides easy INT8/INT4 quantization for PyTorch
- QLoRA enables fine-tuning 7B models on a single consumer GPU
- INT4 quantization saves 75% memory with minimal quality loss for models 7B+
</MathSummary>
---
## What to Learn Next
<div className="grid gap-4 md:grid-cols-3">
<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">
**-> [LoRA and PEFT](/learn/llm/06-lora-and-peft)**
Efficient fine-tuning without full retraining using low-rank adaptation.
</div>
<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">
**-> [Fine-Tuning LLMs](/learn/llm/05-fine-tuning-llms)**
Customizing language models for your specific tasks and domains.
</div>
<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">
**-> [LLM Inference Optimization](/learn/llm/16-llm-inference-optimization)**
Speeding up model inference for production deployment.
</div>
<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">
**-> [LLM Safety and Red Teaming](/learn/llm/22-llm-safety-red-teaming)**
Testing and hardening LLMs against adversarial attacks.
</div>
<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">
**-> [Building Production LLM Apps](/learn/llm/25-building-production-llm-applications)**
From prototype to production: deploying LLMs at scale.
</div>
<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">
**-> [Open-Source LLM Ecosystem](/learn/llm/23-open-source-llm-ecosystem)**
Navigating the landscape of open-weight models and communities.
</div>
</div>