🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Clustering — K-Means, DBSCAN, Hierarchical Complete Guide

ML FoundationsUnsupervised Learning🟢 Free Lesson

Advertisement

Unsupervised Learning

Grouping the Ungrouped — Finding Hidden Structure in Data

Clustering algorithms discover natural groupings in data without labels. They are essential for customer segmentation, anomaly detection, and exploratory analysis.

  • K-Means — Fast partitioning into K clusters using centroids
  • DBSCAN — Density-based clustering that finds arbitrary shapes
  • Hierarchical — Building dendrograms for multi-level groupings

"The greatest value of a picture is when it forces us to notice what we never expected to see."

Clustering — Complete Guide

Clustering groups similar data points together without labels. It is the most common unsupervised learning task.


K-Means Clustering

DfK-Means Clustering

Given dataset {x(i)}i=1N\{x^{(i)}\}_{i=1}^{N} and number of clusters KK, K-Means minimizes the within-cluster sum of squares (WCSS):

J=k=1KiCkx(i)μk2J = \sum_{k=1}^{K}\sum_{i \in C_k} \|x^{(i)} - \mu_k\|^2

where μk=1CkiCkx(i)\mu_k = \frac{1}{|C_k|}\sum_{i \in C_k} x^{(i)} is the centroid of cluster CkC_k.

K-Means Algorithm: Iterative Assignment and UpdateStep 1: InitializeRandom K=2 centroids ⊙Step 2: AssignAssign each point to nearest ⊙Step 3: UpdateRecompute ⊙ = mean of clusterStep 4: ConvergeConverged! ✓Choosing K: Elbow MethodK=4 (elbow)K-Means Limitations• Assumes spherical clusters • Sensitive to initialization (use K-Means++) • Must specify K • Sensitive to outliers• Fails with non-convex shapes • Use Mini-Batch K-Means for large datasets

DBSCAN

DfDBSCAN (Density-Based Spatial Clustering)

DBSCAN groups together points that are closely packed (high density), marking as outliers points that lie alone in low-density regions. Parameters: ε\varepsilon (neighborhood radius) and minPts\text{minPts} (minimum points to form a dense region).

DBSCAN: Density-Based ClusteringCluster 1Cluster 2 (crescent)Cluster 3noisenoisenoiseDBSCAN Properties✓ Advantages:• No need to specify K• Finds arbitrary shapes• Handles outliers• Robust to noise✗ Limitations:• Struggles with varying density clusters• Sensitive to ε, minPts• O(N²) without indexing• High-dim struggles

Hierarchical Clustering

DfAgglomerative Hierarchical Clustering

A bottom-up approach: start with each point as its own cluster, then repeatedly merge the closest pair of clusters until only KK remain. The distance between clusters is defined by a linkage criterion.

Linkage Criteria

  • Single: d(A,B)=minaA,bBd(a,b)d(A,B) = \min_{a \in A, b \in B} d(a,b) — nearest point
  • Complete: d(A,B)=maxaA,bBd(a,b)d(A,B) = \max_{a \in A, b \in B} d(a,b) — farthest point
  • Average (UPGMA): d(A,B)=1ABaAbBd(a,b)d(A,B) = \frac{1}{|A||B|}\sum_{a \in A}\sum_{b \in B} d(a,b)
  • Ward: Minimizes within-cluster variance after merging
Dendrogram: Hierarchical Clustering TreeABCDEFG23456HeightCut → K=3{ABC}, {DEF}, {G}

Evaluation Metrics

Silhouette Score

s(i)=b(i)a(i)max{a(i),b(i)}s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}

Here,

  • a(i)a(i)=Mean intra-cluster distance for point i
  • b(i)b(i)=Mean nearest-cluster distance for point i
  • s(i)s(i)=Silhouette coefficient ∈ [−1, 1]

Interpreting Silhouette Score

  • s(i)1s(i) \approx 1: Well-clustered (tight, well-separated)
  • s(i)0s(i) \approx 0: On cluster boundary
  • s(i)<0s(i) < 0: Likely assigned to wrong cluster
  • Overall: sˉ=1Nis(i)\bar{s} = \frac{1}{N}\sum_i s(i) — higher is better
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(X_raw)

# K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels_km = kmeans.fit_predict(X)
print(f"K-Means Silhouette: {silhouette_score(X, labels_km):.3f}")

# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_db = dbscan.fit_predict(X)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
print(f"DBSCAN found {n_clusters} clusters, noise: {(labels_db == -1).sum()}")

# Hierarchical
hier = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels_hier = hier.fit_predict(X)
print(f"Hierarchical Silhouette: {silhouette_score(X, labels_hier):.3f}")

Key Takeaways

Summary: Clustering

  1. K-Means minimizes WCSS J=kiCkx(i)μk2J = \sum_k \sum_{i \in C_k}\|x^{(i)} - \mu_k\|^2 — fast, but assumes spherical clusters
  2. DBSCAN finds arbitrary shapes and handles outliers — parameters: ε\varepsilon, minPts
  3. Hierarchical produces interpretable dendrograms — cut at height hh to get KK clusters
  4. Silhouette score s(i)=b(i)a(i)max(a(i),b(i))s(i) = \frac{b(i)-a(i)}{\max(a(i),b(i))} is the most common internal evaluation
  5. Always scale features before clustering (especially K-Means, DBSCAN)
  6. Use the elbow method or silhouette analysis to choose KK
  7. Clustering is exploratory — results require domain interpretation
  8. No single algorithm works best for all data distributions

What to Learn Next

-> Dimensionality Reduction Reduce features while preserving information with PCA and t-SNE.

-> KNN Instance-based learning where your neighbors tell the story.

-> Recommendation Systems Collaborative and content-based filtering for personalized experiences.

Premium Content

Clustering — K-Means, DBSCAN, Hierarchical Complete Guide

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement