Model Compression: Quantization, Pruning, Distillation — Asked at NVIDIA & Apple

🎯 The Interview Question

"Explain the different techniques for model compression: quantization, pruning, and knowledge distillation. What is the difference between post-training quantization and quantization-aware training? How does structured vs unstructured pruning differ? What is the mathematical formulation of knowledge distillation?"

This question is critical for deploying models on edge devices at NVIDIA (Jetson) and Apple (iOS).

📚 Detailed Answer

Why Model Compression?

Large models are impractical for deployment:

Model	Parameters	Size (FP32)	Inference Time
BERT-Large	340M	1.3 GB	50 ms
GPT-2	1.5B	6 GB	200 ms
Stable Diffusion	1B	4 GB	500 ms

Compression enables:

Edge deployment (mobile, IoT)
Lower latency
Reduced energy consumption
Cost savings

Quantization

Reduce precision of weights and activations:

Types of Quantization

Type	Precision	Size Reduction	Accuracy Loss
FP32 → FP16	16-bit	2×	Minimal
FP32 → INT8	8-bit	4×	Small (1-2%)
FP32 → INT4	4-bit	8×	Moderate (2-5%)
FP32 → INT2	2-bit	16×	Significant

Post-Training Quantization (PTQ)

Quantize after training:

x_q = \text{round}\left(\frac{x - z}{s}\right)

where $z$ is zero-point, $s$ is scale factor.

Calibration: Compute $s$ and $z$ from representative data:

Min-Max: $s = \frac{\max(x) - \min(x)}{2^b - 1}$
KL divergence: Minimize divergence between distributions

Limitations:

May lose accuracy on sensitive layers
No adaptation to quantization noise

Quantization-Aware Training (QAT)

Simulate quantization during training:

\hat{x} = s \cdot \text{clamp}\left(\text{round}\left(\frac{x}{s}\right), q_{min}, q_{max}\right)

Straight-through estimator for gradients:

\frac{\partial \mathcal{L}}{\partial x} \approx \frac{\partial \mathcal{L}}{\partial \hat{x}}

Advantages:

Better accuracy than PTQ
Model learns to handle quantization noise
Can achieve INT4 with minimal loss

💡

For LLMs, GPTQ and AWQ are state-of-the-art quantization methods. GPTQ uses optimal brain quantization; AWQ protects salient channels during quantization.

Pruning

Remove redundant weights/neurons:

Unstructured Pruning

Remove individual weights:

m_{ij} = \begin{cases} 1 & \text{if } |w_{ij}| > \theta \\ 0 & \text{otherwise} \end{cases}

w_{pruned} = w \odot m

Magnitude pruning: Remove smallest weights by absolute value.

Advantages:

High sparsity (90%+) possible
Minimal accuracy loss

Disadvantages:

No speedup without sparse hardware
Irregular memory access

Structured Pruning

Remove entire filters/channels/attention heads:

\text{Filter importance} = \sum_{i,j} |w_{i,j}|

Remove filters with lowest importance scores.

Advantages:

Real speedup on existing hardware
Regular memory access

Disadvantages:

Lower sparsity achievable
More accuracy loss per parameter removed

Knowledge Distillation

Train a small (student) model to mimic a large (teacher) model:

Mathematical Formulation

\mathcal{L}_{KD} = \alpha \mathcal{L}_{CE}(y, p_s) + (1-\alpha) T^2 D_{KL}(p_t^T \| p_s^T)

where:

$p_t^T = \text{softmax}(z_t/T)$ : teacher soft targets with temperature $T$
$p_s^T = \text{softmax}(z_s/T)$ : student soft targets
$\alpha$ : balance between hard and soft losses
$T$ : temperature (higher = softer distribution)

Soft targets contain:

Inter-class relationships (cat is more similar to dog than to car)
Dark knowledge (uncertainty information)

Combined Approaches

Best practice: Combine multiple techniques:

Prune → Remove redundant parameters
Distill → Recover accuracy from pruning
Quantize → Reduce precision of pruned model

Evaluation Metrics

Metric	Description
Compression Ratio	Original size / Compressed size
Speedup	Original time / Compressed time
Accuracy Drop	Original acc - Compressed acc
FLOPs Reduction	Original FLOPs / Compressed FLOPs

Follow-Up Questions

Q: What is the difference between INT8 and FP8 quantization? A: INT8 uses integers (uniform quantization). FP8 uses floating point (E4M3 or E5M2), allowing different ranges for different layers. FP8 is better for transformers.

Q: How does structured pruning affect attention heads? A: Can remove entire heads if they contribute little to output. Use importance scores based on attention weight norms or gradient-based methods.

Q: When should you use knowledge distillation vs fine-tuning? A: Distillation when you need a smaller model for deployment. Fine-tuning when you need better performance on a specific task. Often combined: distill then fine-tune.

Model Compression: Quantization, Pruning, Distillation — Asked at NVIDIA & Apple

Model Compression: Quantization, Pruning & Distillation

🎯 The Interview Question

📚 Detailed Answer

Why Model Compression?

Quantization

Types of Quantization

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

Pruning

Unstructured Pruning

Structured Pruning

Knowledge Distillation

Mathematical Formulation

Combined Approaches

Evaluation Metrics

Follow-Up Questions

Related Topics