🎯 The Interview Question
"Explain the different techniques for model compression: quantization, pruning, and knowledge distillation. What is the difference between post-training quantization and quantization-aware training? How does structured vs unstructured pruning differ? What is the mathematical formulation of knowledge distillation?"
This question is critical for deploying models on edge devices at NVIDIA (Jetson) and Apple (iOS).
📚 Detailed Answer
Why Model Compression?
Large models are impractical for deployment:
| Model | Parameters | Size (FP32) | Inference Time |
|---|---|---|---|
| BERT-Large | 340M | 1.3 GB | 50 ms |
| GPT-2 | 1.5B | 6 GB | 200 ms |
| Stable Diffusion | 1B | 4 GB | 500 ms |
Compression enables:
- Edge deployment (mobile, IoT)
- Lower latency
- Reduced energy consumption
- Cost savings
Quantization
Reduce precision of weights and activations:
Types of Quantization
| Type | Precision | Size Reduction | Accuracy Loss |
|---|---|---|---|
| FP32 → FP16 | 16-bit | 2× | Minimal |
| FP32 → INT8 | 8-bit | 4× | Small (1-2%) |
| FP32 → INT4 | 4-bit | 8× | Moderate (2-5%) |
| FP32 → INT2 | 2-bit | 16× | Significant |
Post-Training Quantization (PTQ)
Quantize after training:
where is zero-point, is scale factor.
Calibration: Compute and from representative data:
- Min-Max:
- KL divergence: Minimize divergence between distributions
Limitations:
- May lose accuracy on sensitive layers
- No adaptation to quantization noise
Quantization-Aware Training (QAT)
Simulate quantization during training:
Straight-through estimator for gradients:
Advantages:
- Better accuracy than PTQ
- Model learns to handle quantization noise
- Can achieve INT4 with minimal loss
💡
For LLMs, GPTQ and AWQ are state-of-the-art quantization methods. GPTQ uses optimal brain quantization; AWQ protects salient channels during quantization.
Pruning
Remove redundant weights/neurons:
Unstructured Pruning
Remove individual weights:
Magnitude pruning: Remove smallest weights by absolute value.
Advantages:
- High sparsity (90%+) possible
- Minimal accuracy loss
Disadvantages:
- No speedup without sparse hardware
- Irregular memory access
Structured Pruning
Remove entire filters/channels/attention heads:
Remove filters with lowest importance scores.
Advantages:
- Real speedup on existing hardware
- Regular memory access
Disadvantages:
- Lower sparsity achievable
- More accuracy loss per parameter removed
Knowledge Distillation
Train a small (student) model to mimic a large (teacher) model:
Mathematical Formulation
where:
- : teacher soft targets with temperature
- : student soft targets
- : balance between hard and soft losses
- : temperature (higher = softer distribution)
Soft targets contain:
- Inter-class relationships (cat is more similar to dog than to car)
- Dark knowledge (uncertainty information)
Combined Approaches
Best practice: Combine multiple techniques:
- Prune → Remove redundant parameters
- Distill → Recover accuracy from pruning
- Quantize → Reduce precision of pruned model
Evaluation Metrics
| Metric | Description |
|---|---|
| Compression Ratio | Original size / Compressed size |
| Speedup | Original time / Compressed time |
| Accuracy Drop | Original acc - Compressed acc |
| FLOPs Reduction | Original FLOPs / Compressed FLOPs |
Follow-Up Questions
Q: What is the difference between INT8 and FP8 quantization? A: INT8 uses integers (uniform quantization). FP8 uses floating point (E4M3 or E5M2), allowing different ranges for different layers. FP8 is better for transformers.
Q: How does structured pruning affect attention heads? A: Can remove entire heads if they contribute little to output. Use importance scores based on attention weight norms or gradient-based methods.
Q: When should you use knowledge distillation vs fine-tuning? A: Distillation when you need a smaller model for deployment. Fine-tuning when you need better performance on a specific task. Often combined: distill then fine-tune.