Principal Component Analysis: Dimensionality Reduction & Variance Explained
The most important unsupervised learning algorithm
Interview Question
"Explain the mathematical foundation of PCA. How do you determine the number of components to keep? What are the limitations of PCA and when would you use alternatives like t-SNE or UMAP?"
Difficulty: Hard | Frequently asked at NVIDIA, Meta, Google
Theoretical Foundation
What is PCA?
Principal Component Analysis (PCA) is a dimensionality reduction technique that finds orthogonal directions (principal components) of maximum variance in the data.
Goal: Find a linear projection that preserves maximum variance.
Mathematical Formulation
Given centered data (columns have zero mean):
- Compute covariance matrix:
- Eigenvalue decomposition:
where is the -th eigenvector and is the corresponding eigenvalue.
-
Sort by eigenvalues:
-
Projection: where contains the top eigenvectors.
Variance Explained
The proportion of variance explained by the -th component:
Cumulative variance explained:
ℹ️
Key Insight: PCA finds directions that maximize variance. The first principal component (PC1) has the maximum variance, PC2 has the maximum variance orthogonal to PC1, and so on.
Choosing the Number of Components
1. Variance Threshold
Keep components that explain at least of variance (typically 90-95%).
2. Scree Plot
Plot eigenvalues and look for an "elbow" where the rate of decrease slows.
3. Kaiser's Rule
Keep components with eigenvalues > 1 (for standardized data).
4. Cross-Validation
Use downstream task performance to choose the number of components.
PCA Algorithms
1. Full PCA (Eigenvalue Decomposition)
- Computes all eigenvectors
- Time complexity:
- Best for
2. Truncated SVD
- Computes only top components
- Time complexity:
- Best for large or sparse data
3. Randomized PCA
- Uses random projection + power iteration
- Time complexity:
- Best for very large datasets
Limitations of PCA
- Linear Only: Cannot capture non-linear relationships
- Variance-Based: May discard important features with low variance
- Sensitive to Scaling: Features must be standardized
- Interpretability: Components are linear combinations of original features
- Outlier Sensitivity: Outliers can dominate principal components
⚠️
Common Misconception: PCA doesn't always preserve the most important information. If a feature with low variance is crucial for the task, PCA might discard it. Always validate PCA with your specific task.
Alternatives to PCA
1. t-SNE (t-Distributed Stochastic Neighbor Embedding)
- Preserves local structure (neighbors remain close)
- Non-linear dimensionality reduction
- Good for visualization (2-3D)
- Slow for large datasets
2. UMAP (Uniform Manifold Approximation and Projection)
- Preserves both local and global structure
- Faster than t-SNE
- Better preserves global geometry
- Can be used for general dimensionality reduction
3. Kernel PCA
- Uses kernel trick for non-linear PCA
- Maps to high-dimensional space
- Computationally expensive
4. Autoencoders
- Neural network-based dimensionality reduction
- Can learn complex non-linear mappings
- Requires more data and tuning
When to Use Each Method
| Method | Use Case | Preserves | Speed |
|---|---|---|---|
| PCA | Linear data, feature reduction | Global variance | Fast |
| t-SNE | Visualization, clusters | Local structure | Slow |
| UMAP | Visualization, general reduction | Local + global | Medium |
| Kernel PCA | Non-linear relationships | Non-linear variance | Medium |
| Autoencoders | Complex patterns, images | Task-specific | Slow |
Code Implementation
Explanation of Code
-
Basic PCA: Shows variance explained and component loadings.
-
Choosing Components: Demonstrates different methods for selecting the number of components.
-
Scree Plot: Visualizes variance explained and 2D projection.
-
Noise Reduction: Shows how PCA can denoise data by keeping only important components.
-
t-SNE Comparison: Compares PCA with t-SNE for visualization.
-
Kernel PCA: Demonstrates PCA on non-linear data.
-
ML Pipeline: Shows PCA as a preprocessing step for classification.
Real-World Applications
NVIDIA: Image Compression
NVIDIA uses PCA for:
- Feature Compression: Reducing CNN feature dimensions
- GPU Memory Optimization: Compressing activations
- Data Augmentation: PCA-based color augmentation
Meta: Dimensionality Reduction
Meta uses PCA for:
- Face Recognition: Eigenfaces approach
- Recommendation Systems: Reducing user/item embeddings
- Ad Targeting: Compressing user features
💡
NVIDIA Interview Tip: Be prepared to discuss PCA for GPU memory optimization. Mention how PCA can reduce the memory footprint of feature representations while preserving most information.
Common Follow-Up Questions
Q1: Why is scaling necessary before PCA?
PCA finds directions of maximum variance. If features have different scales, features with larger scales will dominate. Standardizing ensures all features contribute equally.
Q2: What is the difference between PCA and SVD?
SVD decomposes . PCA computes eigenvectors of . They're mathematically equivalent, but SVD is more numerically stable.
Q3: Can PCA be used for classification?
PCA is unsupervised and maximizes variance, not class separability. Linear Discriminant Analysis (LDA) is better for classification as it maximizes between-class variance.
Q4: How does PCA handle missing values?
Standard PCA cannot handle missing values. You must impute first or use robust PCA variants that handle missing data.
Company-Specific Tips
NVIDIA Interview Tips
- Discuss GPU-accelerated PCA implementations
- Be ready to explain truncated SVD for sparse data
- Mention randomized PCA for large datasets
- Talk about online PCA for streaming data
Meta Interview Tips
- Focus on scalability to billions of data points
- Discuss distributed PCA implementations
- Be prepared to explain incremental PCA
- Mention PCA for privacy (differential privacy)