Fine-Tuning Transformers

Fine-tuning adapts pre-trained models to specific tasks. Modern approaches range from full fine-tuning to parameter-efficient methods like LoRA and QLoRA.

Hugging Face Trainer

The Trainer API provides a complete training loop with logging, evaluation, and checkpointing.

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset
import numpy as np

# Load dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# Define metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = (predictions == labels).mean()
    return {"accuracy": accuracy}

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=2e-5,
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train,
    eval_dataset=small_eval,
    compute_metrics=compute_metrics,
)

# Train
trainer.train()

LoRA (Low-Rank Adaptation)

LoRA freezes pre-trained weights and injects trainable low-rank decomposition matrices into each transformer layer.

DfLoRA Weight Update

Where:

W: Original frozen weight (d × d)
B: Low-rank matrix (d × r)
A: Low-rank matrix (r × d)
r: Rank (typically 4-64, much smaller than d)

DfLoRA Forward Pass

Where α is a scaling hyperparameter.

from peft import LoraConfig, get_peft_model, TaskType

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,                          # Rank
    lora_alpha=32,                # Scaling factor
    lora_dropout=0.1,
    target_modules=["query", "value"],  # Apply to attention layers
    bias="none",
)

# Create PEFT model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)
peft_model = get_peft_model(model, lora_config)

# Print trainable parameters
peft_model.print_trainable_parameters()
# trainable params: 667,906 || all params: 109,485,314 || trainable%: 0.61

LoRA reduces trainable parameters by ~1000× while maintaining comparable performance to full fine-tuning. The low-rank constraint assumes that weight updates have a low intrinsic rank.

QLoRA (Quantized LoRA)

QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA config for QLoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Memory Comparison

Method	GPU Memory	Trainable Params	Speed
Full fine-tuning	~40GB (7B model)	100%	1x
LoRA (r=8)	~18GB (7B model)	~0.1%	~1.2x
QLoRA (r=16)	~6GB (7B model)	~0.1%	~0.8x
Adapter	~12GB (7B model)	~1%	~1.1x

Comparison of Fine-Tuning Methods

# Method 1: Full fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)
# All parameters are trainable

# Method 2: Freeze base, train head only
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)
for param in model.base_model.parameters():
    param.requires_grad = False

# Method 3: LoRA
from peft import get_peft_model, LoraConfig
lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["query", "value"])
model = get_peft_model(model, lora_config)

# Method 4: Adapter layers
from peft import get_peft_model, AdapterConfig
adapter_config = AdapterConfig(
    peft_type="ADAPTER",
    adapter_hidden_size=256,
)
model = get_peft_model(model, adapter_config)

Training Best Practices

Practice	Recommendation	Impact
Learning rate	2e-5 to 5e-5 (full), 1e-4 to 3e-4 (LoRA)	Stability
Batch size	16-32 effective	Generalization
Warmup	6-10% of steps	Convergence
Weight decay	0.01-0.1	Regularization
Epochs	2-5	Overfitting

LoRA Forward Pass Computation

Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast():
        outputs = model(batch['input_ids'], labels=batch['labels'])
        loss = outputs.loss

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

Mixed precision training uses FP16 for forward/backward passes and FP32 for weight updates, reducing memory usage by ~50% while maintaining training stability.

Fine-Tuning Transformers