Fine-Tuning Transformers
Fine-tuning adapts pre-trained models to specific tasks. Modern approaches range from full fine-tuning to parameter-efficient methods like LoRA and QLoRA.
Hugging Face Trainer
The Trainer API provides a complete training loop with logging, evaluation, and checkpointing.
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
import numpy as np
# Load dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# Define metrics
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = (predictions == labels).mean()
return {"accuracy": accuracy}
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=10,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
learning_rate=2e-5,
)
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train,
eval_dataset=small_eval,
compute_metrics=compute_metrics,
)
# Train
trainer.train()
LoRA (Low-Rank Adaptation)
LoRA freezes pre-trained weights and injects trainable low-rank decomposition matrices into each transformer layer.
DfLoRA Weight Update
Where:
- W: Original frozen weight
(d Γ d) - B: Low-rank matrix
(d Γ r) - A: Low-rank matrix
(r Γ d) - r: Rank (typically 4-64, much smaller than d)
DfLoRA Forward Pass
Where Ξ± is a scaling hyperparameter.
from peft import LoraConfig, get_peft_model, TaskType
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
r=8, # Rank
lora_alpha=32, # Scaling factor
lora_dropout=0.1,
target_modules=["query", "value"], # Apply to attention layers
bias="none",
)
# Create PEFT model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
peft_model = get_peft_model(model, lora_config)
# Print trainable parameters
peft_model.print_trainable_parameters()
# trainable params: 667,906 || all params: 109,485,314 || trainable%: 0.61
LoRA reduces trainable parameters by ~1000Γ while maintaining comparable performance to full fine-tuning. The low-rank constraint assumes that weight updates have a low intrinsic rank.
QLoRA (Quantized LoRA)
QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# LoRA config for QLoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Memory Comparison
| Method | GPU Memory | Trainable Params | Speed |
|---|---|---|---|
| Full fine-tuning | ~40GB (7B model) | 100% | 1x |
| LoRA (r=8) | ~18GB (7B model) | ~0.1% | ~1.2x |
| QLoRA (r=16) | ~6GB (7B model) | ~0.1% | ~0.8x |
| Adapter | ~12GB (7B model) | ~1% | ~1.1x |
Comparison of Fine-Tuning Methods
# Method 1: Full fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# All parameters are trainable
# Method 2: Freeze base, train head only
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
for param in model.base_model.parameters():
param.requires_grad = False
# Method 3: LoRA
from peft import get_peft_model, LoraConfig
lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["query", "value"])
model = get_peft_model(model, lora_config)
# Method 4: Adapter layers
from peft import get_peft_model, AdapterConfig
adapter_config = AdapterConfig(
peft_type="ADAPTER",
adapter_hidden_size=256,
)
model = get_peft_model(model, adapter_config)
Training Best Practices
| Practice | Recommendation | Impact |
|---|---|---|
| Learning rate | 2e-5 to 5e-5 (full), 1e-4 to 3e-4 (LoRA) | Stability |
| Batch size | 16-32 effective | Generalization |
| Warmup | 6-10% of steps | Convergence |
| Weight decay | 0.01-0.1 | Regularization |
| Epochs | 2-5 | Overfitting |
LoRA Forward Pass Computation
Mixed Precision Training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
with autocast():
outputs = model(batch['input_ids'], labels=batch['labels'])
loss = outputs.loss
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Mixed precision training uses FP16 for forward/backward passes and FP32 for weight updates, reducing memory usage by ~50% while maintaining training stability.