LLM Training

QLoRA and Quantization — Running LLMs on Consumer Hardware

Quantization reduces model memory by representing weights with fewer bits, enabling deployment on consumer hardware. This guide covers INT8, INT4, GPTQ, AWQ, and practical BitsAndBytes integration for accessible LLM fine-tuning.

NF4 Quantization — Optimal for normally distributed neural network weights
GPTQ & AWQ — Post-training quantization with minimal quality loss
Consumer GPU Training — Fine-tune 7B models on a single GPU with QLoRA

Democratizing AI means making it run on the hardware people already have.

QLoRA and Quantization

Quantization reduces model memory by representing weights with fewer bits. This tutorial covers the theory and practice of quantization for LLMs, enabling deployment on consumer hardware.

DfQuantization

Quantization is the process of mapping continuous or high-precision values to a discrete set of lower-precision values. For neural networks, this typically means converting FP32/FP16 weights to INT8, INT4, or other low-bit formats.

Quantization Formats

FP16 (Half Precision)

FP16 Range

\text{FP16}: \pm 65504, \text{ precision} \approx 10^{-3}

Here,

$FP16$ =16-bit floating point (1 sign + 5 exponent + 10 mantissa bits)

BF16 (Brain Float 16)

BF16 Range

\text{BF16}: \pm 3.4 \times 10^{38}, \text{ precision} \approx 10^{-2}

Here,

$BF16$ =16-bit brain float (1 sign + 8 exponent + 7 mantissa bits)

BF16 has the same dynamic range as FP32 but lower precision. It is preferred for training because it reduces overflow/underflow. FP16 has higher precision but narrower range, requiring loss scaling.

INT8 Quantization

x_{\text{int8}} = \text{round}\left(\frac{x}{\text{amax}} \cdot 127\right)

Here,

$x$ =Original FP16/BF16 value
$x_{\text{int8}}$ =Quantized INT8 value
$amax$ =Absolute maximum value for scaling

INT4 Quantization

x_{\text{int4}} = \text{round}\left(\frac{x - \text{min}}{\text{max} - \text{min}} \cdot 15\right)

Here,

$x$ =Original value
$x_{\text{int4}}$ =Quantized INT4 value (0-15)
$min, max$ =Range bounds for quantization

NormalFloat 4-bit (NF4)

DfNF4

NF4 is an information-theoretically optimal 4-bit data type for normally distributed data. It uses quantile-based quantization where each quantization bin has equal probability mass under a standard normal distribution.

NF4 Quantization Levels

q_i = \Phi^{-1}\left(\frac{i + 0.5}{2^b}\right), \quad i = 0, 1, \ldots, 2^b - 1

Here,

$q_i$ =Quantization level i
$\Phi^{-1}$ =Inverse normal CDF
$b$ =Number of bits (4 for NF4)

Quantization Error Bound

\|W - \hat{W}\|_F \leq \epsilon \cdot \|W\|_F

Here,

$W$ =Original weight matrix
$\hat{W}$ =Quantized weight matrix
$\epsilon$ =Quantization error bound
$\|\cdot\|_F$ =Frobenius norm

Quantization Methods

GPTQ

GPTQ (Frantar et al., 2023) performs post-training quantization using optimal brain quantization:

DfGPTQ

GPTQ quantizes model weights column by column, minimizing the squared error between the original and quantized weight matrices. It uses the inverse Hessian of the layer's output to determine optimal quantization order.

GPTQ Objective

\min_{\hat{W}} \|WX - \hat{W}X\|_2^2

Here,

$W$ =Original weight matrix
$\hat{W}$ =Quantized weight matrix
$X$ =Input activations (calibration data)

AWQ

AWQ (Lin et al., 2024) performs activation-aware weight quantization:

DfAWQ

AWQ identifies important weight channels based on activation magnitudes and quantizes them with higher precision. It scales important weights before quantization to preserve their information.

AWQ Scaling

\hat{w} = \text{quantize}(w \cdot s) / s

Here,

$w$ =Original weight
$s$ =Scale factor (larger for important channels)
$\hat{w}$ =Quantized weight

GGML/GGUF

GGML and GGUF are quantization formats designed for CPU inference:

GGUF (GGML Unified Format) is the successor to GGML and is used by llama.cpp. It supports multiple quantization types (Q4_0, Q4_K_M, Q5_K_M, etc.) with mixed precision across layers.

BitsAndBytes Integration

BitsAndBytes provides easy-to-use quantization for PyTorch models:

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model

# 4-bit quantization config (QLoRA)
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config_4bit,
    device_map="auto",
)

# 8-bit quantization config
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.float16,
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config_8bit,
    device_map="auto",
)

QLoRA Training Example

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 3,744,571,392 || trainable%: 0.1820

training_args = TrainingArguments(
    output_dir="./qlora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=100,
    max_steps=1000,
    fp16=True,
    optim="paged_adamw_32bit",
    logging_steps=50,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()
`

## Memory Savings Calculation

<MathFormula
  title="Memory Savings"
  tex={`\\text{Savings} = \\left(1 - \\frac{\\text{bits}}{16}\\right) \\times 100\\%`}
  variables={[
    { symbol: "bits", description: "Target quantization bits" }
  ]}
/>

| Quantization | Bits | Memory per Param | 7B Model Size | Savings vs FP16 |
|-------------|------|-------------------|---------------|-----------------|
| FP32 | 32 | 4 bytes | 28 GB | - |
| FP16 | 16 | 2 bytes | 14 GB | Baseline |
| INT8 | 8 | 1 byte | 7 GB | 50% |
| INT4/NF4 | 4 | 0.5 bytes | 3.5 GB | 75% |
| INT4 (double) | 4 | 0.625 bytes | 4.375 GB | 69% |

<MathNote type="tip">
Double quantization (QLoRA's bnb_4bit_use_double_quant=True) quantizes the quantization constants themselves, saving an additional ~0.37 GB per billion parameters with negligible quality loss.
</MathNote>

### Practical Example: Fine-tuning 7B on Consumer GPU

`python
# RTX 3060 (12GB VRAM) can fine-tune Llama-2-7B with QLoRA!
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# Quantize and load
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# Add LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)

# Memory usage: ~5GB for 4-bit model + LoRA + optimizer states
# Fits in 12GB VRAM!
`

## Practice Exercises

1. **Mathematical**: Calculate the VRAM required to load a 13B parameter model in NF4 with double quantization. Include the KV cache for sequence length 2048.

2. **Implementation**: Use GPTQ to quantize a 7B model to 4-bit and measure the perplexity degradation on WikiText-2.

3. **Analysis**: Compare the quality of NF4 vs INT4 quantization for models of different sizes (1B, 7B, 13B). At what model size does 4-bit quantization become lossless?

4. **Research**: Investigate mixed-precision quantization strategies. Can you achieve better quality by using 8-bit for sensitive layers and 4-bit for others?

<MathSummary>
**Key Takeaways:**
- Quantization maps high-precision weights to lower-bit formats
- NF4 is optimal for normally distributed neural network weights
- GPTQ and AWQ are post-training quantization methods
- BitsAndBytes provides easy INT8/INT4 quantization for PyTorch
- QLoRA enables fine-tuning 7B models on a single consumer GPU
- INT4 quantization saves 75% memory with minimal quality loss for models 7B+
</MathSummary>

---

## What to Learn Next

<div className="grid gap-4 md:grid-cols-3">

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [LoRA and PEFT](/learn/llm/06-lora-and-peft)**
Efficient fine-tuning without full retraining using low-rank adaptation.

</div>

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [Fine-Tuning LLMs](/learn/llm/05-fine-tuning-llms)**
Customizing language models for your specific tasks and domains.

</div>

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [LLM Inference Optimization](/learn/llm/16-llm-inference-optimization)**
Speeding up model inference for production deployment.

</div>

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [LLM Safety and Red Teaming](/learn/llm/22-llm-safety-red-teaming)**
Testing and hardening LLMs against adversarial attacks.

</div>

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [Building Production LLM Apps](/learn/llm/25-building-production-llm-applications)**
From prototype to production: deploying LLMs at scale.

</div>

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [Open-Source LLM Ecosystem](/learn/llm/23-open-source-llm-ecosystem)**
Navigating the landscape of open-weight models and communities.

</div>

</div>

QLoRA and Quantization

QLoRA and Quantization — Running LLMs on Consumer Hardware

QLoRA and Quantization

DfQuantization

Quantization Formats

FP16 (Half Precision)

FP16 Range

BF16 (Brain Float 16)

BF16 Range

INT8 Quantization

INT8 Quantization

INT4 Quantization

INT4 Quantization

NormalFloat 4-bit (NF4)

DfNF4

NF4 Quantization Levels

Quantization Methods

GPTQ

DfGPTQ

GPTQ Objective

AWQ

DfAWQ

AWQ Scaling

GGML/GGUF

BitsAndBytes Integration

QLoRA Training Example

Premium Content

Need Expert LLM Help?