πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

K-Nearest Neighbors: Distance Metrics & Curse of Dimensionality

Machine LearningK-Nearest Neighbors⭐ Premium

Advertisement

Amazon & Stripe Interview

K-Nearest Neighbors: Distance Metrics & Curse of Dimensionality

The simplest yet most insightful instance-based learning algorithm

Interview Question

"Explain the KNN algorithm and its decision boundary. What is the curse of dimensionality and how does it affect KNN? Compare different distance metrics and their use cases."

Difficulty: Medium | Frequently asked at Amazon, Stripe, Google


Theoretical Foundation

KNN Algorithm

K-Nearest Neighbors is a lazy learning algorithm that makes predictions based on the majority class (classification) or average value (regression) of the KK closest training examples.

Algorithm:

  1. Choose the number of neighbors KK
  2. Calculate the distance between the query point and all training points
  3. Select the KK nearest neighbors
  4. For classification: Return the majority class
  5. For regression: Return the average value

Key Properties:

  • Instance-based: No explicit training phase
  • Non-parametric: No assumptions about data distribution
  • Lazy learning: All computation happens at prediction time
  • Local decision: Decisions based on local neighborhoods

Decision Boundary

The decision boundary of KNN is piecewise linear and depends on:

  • K value: Larger K β†’ smoother boundary
  • Distance metric: Different metrics β†’ different boundaries
  • Feature scaling: Essential for meaningful distances

Choosing K

  • Small K (e.g., K=1): Low bias, high variance (overfitting)
  • Large K (e.g., K=n): High bias, low variance (underfitting)
  • Optimal K: Typically n\sqrt{n} or determined by cross-validation

ℹ️

Key Insight: KNN is a "voting" classifier. With K=1, it memorizes the training data (overfitting). With K=n, it always predicts the majority class (underfitting). The optimal K balances bias and variance.

Distance Metrics

1. Euclidean Distance (L2)

d(x,y)=βˆ‘i=1p(xiβˆ’yi)2d(x, y) = \sqrt{\sum_{i=1}^{p} (x_i - y_i)^2}
  • Most commonly used
  • Sensitive to outliers
  • Assumes features are on the same scale
  • Best for continuous features

2. Manhattan Distance (L1)

d(x,y)=βˆ‘i=1p∣xiβˆ’yi∣d(x, y) = \sum_{i=1}^{p} |x_i - y_i|
  • More robust to outliers than Euclidean
  • Used when features are sparse
  • Best for high-dimensional data

3. Minkowski Distance (Generalized)

d(x,y)=(βˆ‘i=1p∣xiβˆ’yi∣q)1/qd(x, y) = \left(\sum_{i=1}^{p} |x_i - y_i|^q\right)^{1/q}
  • q=1q=1: Manhattan distance
  • q=2q=2: Euclidean distance
  • qβ†’βˆžq \to \infty: Chebyshev distance

4. Cosine Similarity

cos⁑(ΞΈ)=xβ‹…yβˆ₯xβˆ₯βˆ₯yβˆ₯\cos(\theta) = \frac{x \cdot y}{\|x\| \|y\|}
  • Measures angle between vectors
  • Ignores magnitude, only direction
  • Best for text classification (TF-IDF)
  • Range: [βˆ’1,1][-1, 1] (1 = identical direction)

5. Hamming Distance

d(x,y)=1pβˆ‘i=1p1[xiβ‰ yi]d(x, y) = \frac{1}{p} \sum_{i=1}^{p} \mathbb{1}[x_i \neq y_i]
  • For categorical features
  • Measures proportion of differing features
  • Best for binary/categorical data

Distance Metric Comparison

MetricRangeSensitive to OutliersBest ForTime Complexity
Euclidean[0,∞)[0, \infty)YesContinuous, low-dimO(p)O(p)
Manhattan[0,∞)[0, \infty)LessSparse, high-dimO(p)O(p)
Cosine[βˆ’1,1][-1, 1]NoText, high-dimO(p)O(p)
Hamming[0,1][0, 1]NoCategoricalO(p)O(p)

Curse of Dimensionality

As the number of features increases, several problems arise:

  1. Distance Concentration: All points become equidistant
dmaxβ‘βˆ’dmin⁑dmin⁑→0Β asΒ pβ†’βˆž\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \text{ as } p \to \infty
  1. Data Sparsity: Points become increasingly isolated

    • Required data grows exponentially with dimensions
    • Volume of unit hypersphere decreases relative to hypercube
  2. Spurious Correlations: In high dimensions, random correlations appear

⚠️

Critical Concept: In high dimensions, the concept of "nearest neighbor" becomes meaningless because all points are approximately equidistant. This is why KNN performs poorly without dimensionality reduction.

Mathematical Explanation:

For pp-dimensional uniform data in [0,1]p[0,1]^p:

  • Expected distance between two random points: p/6\sqrt{p/6}
  • Ratio of max to min distance: p/3βˆ’p/6p/6=2βˆ’1β‰ˆ0.414\frac{\sqrt{p/3} - \sqrt{p/6}}{\sqrt{p/6}} = \sqrt{2} - 1 \approx 0.414

This ratio doesn't depend on nn, meaning even with infinite data, the curse persists.

Optimizations for KNN

1. KD-Tree

  • Binary tree partitioning by feature values
  • Query time: O(log⁑n)O(\log n) average, O(n)O(n) worst case
  • Best for low to medium dimensions (p<20p < 20)

2. Ball Tree

  • Hierarchical clustering of data points
  • Better for high dimensions than KD-Tree
  • Handles non-Euclidean distances

3. Approximate Nearest Neighbors (ANN)

  • Locality-Sensitive Hashing (LSH)
  • Annoy (Approximate Nearest Neighbors Oh Yeah)
  • HNSW (Hierarchical Navigable Small World)
  • Trade accuracy for speed

πŸ’‘

Production Tip: For large datasets (>100K samples), use approximate nearest neighbor libraries like Faiss (Facebook) or Annoy (Spotify) instead of exact KNN.

Weighted KNN

Instead of equal votes, weight neighbors by distance:

y^=arg⁑max⁑kβˆ‘i∈Nk(x)wiβ‹…1[yi=k]\hat{y} = \arg\max_k \sum_{i \in N_k(x)} w_i \cdot \mathbb{1}[y_i = k]

where wi=1/d(x,xi)w_i = 1/d(x, x_i) or wi=eβˆ’d(x,xi)w_i = e^{-d(x, x_i)}

Weighted KNN gives more influence to closer neighbors and often outperforms unweighted KNN.


Code Implementation

Explanation of Code

  1. Effect of K: Shows how increasing K smooths the decision boundary and affects bias-variance tradeoff.

  2. Distance Metrics: Compares Euclidean, Manhattan, Chebyshev, and Cosine distances.

  3. Feature Scaling: Demonstrates that KNN is highly sensitive to feature scales.

  4. Curse of Dimensionality: Shows how accuracy degrades as dimensions increase.

  5. KNN Regression: Demonstrates KNN for regression tasks.

  6. Weighted KNN: Compares uniform vs distance-weighted voting.

  7. Algorithm Comparison: Benchmarks different KNN implementations.


Real-World Applications

Amazon: Recommendation Systems

Amazon uses KNN for:

  • Product Recommendations: Finding similar products based on purchase history
  • Collaborative Filtering: User-based and item-based KNN
  • Content-Based Filtering: Similar item features

Stripe: Fraud Detection

Stripe uses KNN variants for:

  • Anomaly Detection: Identifying unusual transaction patterns
  • Customer Segmentation: Grouping similar merchants
  • Risk Scoring: Similarity to known fraud cases

πŸ’‘

Amazon Interview Tip: Be prepared to discuss the tradeoff between KNN and model-based methods. KNN is interpretable but slow; model-based methods are faster but less interpretable.


Common Follow-Up Questions

Q1: Why is feature scaling essential for KNN?

KNN uses distance metrics, which are sensitive to feature scales. Without scaling, features with larger ranges dominate the distance calculation. For example, income (in thousands) would dominate age (in years).

Q2: How do you handle categorical features in KNN?

  • Use Hamming distance for binary/categorical features
  • One-hot encode categorical features (increases dimensionality)
  • Use Gower distance (handles mixed feature types)

Q3: What is the computational complexity of KNN?

  • Training: O(1)O(1) (lazy learning)
  • Prediction: O(np)O(np) for brute force, O(plog⁑n)O(p \log n) for KD-Tree
  • This makes KNN slow for large datasets

Q4: How do you choose the optimal K?

  • Use cross-validation
  • Start with K=nK = \sqrt{n}
  • Odd K avoids ties in binary classification
  • Larger K reduces variance but increases bias

Company-Specific Tips

Amazon Interview Tips

  • Discuss scalability challenges with large datasets
  • Be ready to explain approximate nearest neighbors
  • Mention hybrid approaches (KNN + deep learning)
  • Talk about real-time serving requirements

Stripe Interview Tips

  • Focus on anomaly detection applications
  • Discuss interpretability for fraud explanations
  • Be prepared to explain feature engineering for transactions
  • Mention online learning for concept drift

Related Topics

Advertisement