πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Clustering: K-Means, DBSCAN & Hierarchical

Machine LearningClustering⭐ Premium

Advertisement

Google & LinkedIn Interview

Clustering: K-Means, DBSCAN & Hierarchical

Unsupervised learning for discovering natural groupings

Interview Question

"Compare K-Means, DBSCAN, and hierarchical clustering. What are the strengths and weaknesses of each? How do you choose the number of clusters and evaluate clustering quality?"

Difficulty: Medium | Frequently asked at Google, LinkedIn, Amazon


Theoretical Foundation

K-Means Clustering

Algorithm:

  1. Initialize KK centroids randomly
  2. Assign each point to nearest centroid
  3. Update centroids as mean of assigned points
  4. Repeat until convergence

Objective:

minβ‘βˆ‘k=1Kβˆ‘xi∈Ckβˆ₯xiβˆ’ΞΌkβˆ₯2\min \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Properties:

  • Assumes spherical clusters
  • Sensitive to initialization (use K-Means++)
  • O(nKd)O(nKd) per iteration
  • Converges to local minimum

DBSCAN (Density-Based Spatial Clustering)

Key Concepts:

  • Core point: Has at least \minPts\minPts neighbors within Ο΅\epsilon
  • Border point: Within Ο΅\epsilon of a core point but not a core point itself
  • Noise point: Neither core nor border

Algorithm:

  1. Find all core points
  2. Expand clusters from core points
  3. Border points join nearest core point cluster
  4. Noise points remain unclustered

Properties:

  • Finds arbitrary-shaped clusters
  • Handles outliers naturally
  • No need to specify KK
  • Sensitive to Ο΅\epsilon and \minPts\minPts

Hierarchical Clustering

Agglomerative (Bottom-Up):

  1. Start with each point as its own cluster
  2. Merge closest pair of clusters
  3. Repeat until KK clusters remain

Divisive (Top-Down):

  1. Start with all points in one cluster
  2. Split clusters recursively
  3. Stop when KK clusters remain

Linkage Methods:

  • Single: Minimum distance between clusters
  • Complete: Maximum distance between clusters
  • Average: Average distance between clusters
  • Ward: Minimize within-cluster variance

Clustering Comparison

MethodShapeOutliersScalabilityParameters
K-MeansSphericalSensitiveHighKK
DBSCANArbitraryRobustMediumΟ΅\epsilon, \minPts\minPts
HierarchicalFlexibleModerateLowLinkage, KK

Choosing the Number of Clusters

Elbow Method

Plot within-cluster sum of squares (WCSS) vs KK. Look for the "elbow" where improvement slows.

Silhouette Score

s(i)=b(i)βˆ’a(i)max⁑(a(i),b(i))s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

where a(i)a(i) is average distance to same cluster, b(i)b(i) is average distance to nearest other cluster.

Gap Statistic

Compares WCSS to expected WCSS under null reference distribution.

ℹ️

Key Insight: The elbow method can be ambiguous. The silhouette score provides a more principled approach, measuring both cohesion (within-cluster) and separation (between-cluster).

Evaluation Metrics

Silhouette Coefficient

Range: [βˆ’1,1][-1, 1]. Higher is better.

Adjusted Rand Index (ARI)

Corrected for chance. Range: [βˆ’1,1][-1, 1]. 1 = perfect.

Normalized Mutual Information (NMI)

Measures agreement between cluster assignments. Range: [0,1][0, 1].


Code Implementation


Real-World Applications

Google: Search Clustering

  • Query Clustering: Grouping similar search queries
  • Document Clustering: Organizing web pages by topic
  • User Segmentation: Clustering users by behavior

LinkedIn: People You May Know

  • Social Network Clustering: Identifying communities
  • Skill Clustering: Grouping professionals by expertise
  • Company Clustering: Categorizing businesses

πŸ’‘

Google Interview Tip: Be prepared to discuss scalability. Mention Mini-Batch K-Means for large datasets and distributed clustering with MapReduce.


Common Follow-Up Questions

Q1: How do you handle the scalability of K-Means? Use Mini-Batch K-Means, which processes small random batches. Also consider distributed implementations with MapReduce or Spark.

Q2: When would you choose DBSCAN over K-Means? When clusters have arbitrary shapes, when you expect noise/outliers, or when you don't know the number of clusters.

Q3: How do you evaluate clustering without ground truth? Use intrinsic metrics: silhouette score, Davies-Bouldin index, Calinski-Harabasz index.

Q4: What is the curse of dimensionality for clustering? In high dimensions, distance metrics become less meaningful. Use dimensionality reduction before clustering.


Related Topics

Advertisement