🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Model Compression: Quantization, Pruning, Distillation — Asked at NVIDIA & Apple

Deep Learning Premium InterviewsModel Compression⭐ Premium

Advertisement

NVIDIA & Apple

Model Compression: Quantization, Pruning & Distillation

Premium Interview Preparation — Model Compression Mastery

🎯 The Interview Question

"Explain the different techniques for model compression: quantization, pruning, and knowledge distillation. What is the difference between post-training quantization and quantization-aware training? How does structured vs unstructured pruning differ? What is the mathematical formulation of knowledge distillation?"

This question is critical for deploying models on edge devices at NVIDIA (Jetson) and Apple (iOS).


📚 Detailed Answer

Why Model Compression?

Large models are impractical for deployment:

ModelParametersSize (FP32)Inference Time
BERT-Large340M1.3 GB50 ms
GPT-21.5B6 GB200 ms
Stable Diffusion1B4 GB500 ms

Compression enables:

  • Edge deployment (mobile, IoT)
  • Lower latency
  • Reduced energy consumption
  • Cost savings

Quantization

Reduce precision of weights and activations:

Types of Quantization

TypePrecisionSize ReductionAccuracy Loss
FP32 → FP1616-bitMinimal
FP32 → INT88-bitSmall (1-2%)
FP32 → INT44-bitModerate (2-5%)
FP32 → INT22-bit16×Significant

Post-Training Quantization (PTQ)

Quantize after training:

xq=round(xzs)x_q = \text{round}\left(\frac{x - z}{s}\right)

where zz is zero-point, ss is scale factor.

Calibration: Compute ss and zz from representative data:

  • Min-Max: s=max(x)min(x)2b1s = \frac{\max(x) - \min(x)}{2^b - 1}
  • KL divergence: Minimize divergence between distributions

Limitations:

  • May lose accuracy on sensitive layers
  • No adaptation to quantization noise

Quantization-Aware Training (QAT)

Simulate quantization during training:

x^=sclamp(round(xs),qmin,qmax)\hat{x} = s \cdot \text{clamp}\left(\text{round}\left(\frac{x}{s}\right), q_{min}, q_{max}\right)

Straight-through estimator for gradients:

LxLx^\frac{\partial \mathcal{L}}{\partial x} \approx \frac{\partial \mathcal{L}}{\partial \hat{x}}

Advantages:

  • Better accuracy than PTQ
  • Model learns to handle quantization noise
  • Can achieve INT4 with minimal loss

💡

For LLMs, GPTQ and AWQ are state-of-the-art quantization methods. GPTQ uses optimal brain quantization; AWQ protects salient channels during quantization.

Pruning

Remove redundant weights/neurons:

Unstructured Pruning

Remove individual weights:

mij={1if wij>θ0otherwisem_{ij} = \begin{cases} 1 & \text{if } |w_{ij}| > \theta \\ 0 & \text{otherwise} \end{cases}
wpruned=wmw_{pruned} = w \odot m

Magnitude pruning: Remove smallest weights by absolute value.

Advantages:

  • High sparsity (90%+) possible
  • Minimal accuracy loss

Disadvantages:

  • No speedup without sparse hardware
  • Irregular memory access

Structured Pruning

Remove entire filters/channels/attention heads:

Filter importance=i,jwi,j\text{Filter importance} = \sum_{i,j} |w_{i,j}|

Remove filters with lowest importance scores.

Advantages:

  • Real speedup on existing hardware
  • Regular memory access

Disadvantages:

  • Lower sparsity achievable
  • More accuracy loss per parameter removed

Knowledge Distillation

Train a small (student) model to mimic a large (teacher) model:

Mathematical Formulation

LKD=αLCE(y,ps)+(1α)T2DKL(ptTpsT)\mathcal{L}_{KD} = \alpha \mathcal{L}_{CE}(y, p_s) + (1-\alpha) T^2 D_{KL}(p_t^T \| p_s^T)

where:

  • ptT=softmax(zt/T)p_t^T = \text{softmax}(z_t/T): teacher soft targets with temperature TT
  • psT=softmax(zs/T)p_s^T = \text{softmax}(z_s/T): student soft targets
  • α\alpha: balance between hard and soft losses
  • TT: temperature (higher = softer distribution)

Soft targets contain:

  • Inter-class relationships (cat is more similar to dog than to car)
  • Dark knowledge (uncertainty information)

Combined Approaches

Best practice: Combine multiple techniques:

  1. Prune → Remove redundant parameters
  2. Distill → Recover accuracy from pruning
  3. Quantize → Reduce precision of pruned model

Evaluation Metrics

MetricDescription
Compression RatioOriginal size / Compressed size
SpeedupOriginal time / Compressed time
Accuracy DropOriginal acc - Compressed acc
FLOPs ReductionOriginal FLOPs / Compressed FLOPs

Follow-Up Questions

Q: What is the difference between INT8 and FP8 quantization? A: INT8 uses integers (uniform quantization). FP8 uses floating point (E4M3 or E5M2), allowing different ranges for different layers. FP8 is better for transformers.

Q: How does structured pruning affect attention heads? A: Can remove entire heads if they contribute little to output. Use importance scores based on attention weight norms or gradient-based methods.

Q: When should you use knowledge distillation vs fine-tuning? A: Distillation when you need a smaller model for deployment. Fine-tuning when you need better performance on a specific task. Often combined: distill then fine-tune.

Related Topics

Advertisement