Unsupervised Learning

Grouping the Ungrouped — Finding Hidden Structure in Data

Clustering algorithms discover natural groupings in data without labels. They are essential for customer segmentation, anomaly detection, and exploratory analysis.

K-Means — Fast partitioning into K clusters using centroids
DBSCAN — Density-based clustering that finds arbitrary shapes
Hierarchical — Building dendrograms for multi-level groupings

"The greatest value of a picture is when it forces us to notice what we never expected to see."

Clustering — Complete Guide

Clustering groups similar data points together without labels. It is the most common unsupervised learning task.

K-Means Clustering

DfK-Means Clustering

Given dataset $\{x^{(i)}\}_{i=1}^{N}$ and number of clusters $K$ , K-Means minimizes the within-cluster sum of squares (WCSS):

J = \sum_{k=1}^{K}\sum_{i \in C_k} \|x^{(i)} - \mu_k\|^2

where $\mu_k = \frac{1}{|C_k|}\sum_{i \in C_k} x^{(i)}$ is the centroid of cluster $C_k$ .

DBSCAN

DfDBSCAN (Density-Based Spatial Clustering)

DBSCAN groups together points that are closely packed (high density), marking as outliers points that lie alone in low-density regions. Parameters: $\varepsilon$ (neighborhood radius) and $\text{minPts}$ (minimum points to form a dense region).

Hierarchical Clustering

DfAgglomerative Hierarchical Clustering

A bottom-up approach: start with each point as its own cluster, then repeatedly merge the closest pair of clusters until only $K$ remain. The distance between clusters is defined by a linkage criterion.

Linkage Criteria

Single: $d(A,B) = \min_{a \in A, b \in B} d(a,b)$ — nearest point
Complete: $d(A,B) = \max_{a \in A, b \in B} d(a,b)$ — farthest point
Average (UPGMA): $d(A,B) = \frac{1}{|A||B|}\sum_{a \in A}\sum_{b \in B} d(a,b)$
Ward: Minimizes within-cluster variance after merging

Evaluation Metrics

Silhouette Score

s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}

Here,

$a(i)$ =Mean intra-cluster distance for point i
$b(i)$ =Mean nearest-cluster distance for point i
$s(i)$ =Silhouette coefficient ∈ [−1, 1]

Interpreting Silhouette Score

$s(i) \approx 1$ : Well-clustered (tight, well-separated)
$s(i) \approx 0$ : On cluster boundary
$s(i) < 0$ : Likely assigned to wrong cluster
Overall: $\bar{s} = \frac{1}{N}\sum_i s(i)$ — higher is better

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(X_raw)

# K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels_km = kmeans.fit_predict(X)
print(f"K-Means Silhouette: {silhouette_score(X, labels_km):.3f}")

# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_db = dbscan.fit_predict(X)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
print(f"DBSCAN found {n_clusters} clusters, noise: {(labels_db == -1).sum()}")

# Hierarchical
hier = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels_hier = hier.fit_predict(X)
print(f"Hierarchical Silhouette: {silhouette_score(X, labels_hier):.3f}")

Key Takeaways

Summary: Clustering

K-Means minimizes WCSS $J = \sum_k \sum_{i \in C_k}\|x^{(i)} - \mu_k\|^2$ — fast, but assumes spherical clusters
DBSCAN finds arbitrary shapes and handles outliers — parameters: $\varepsilon$ , minPts
Hierarchical produces interpretable dendrograms — cut at height $h$ to get $K$ clusters
Silhouette score $s(i) = \frac{b(i)-a(i)}{\max(a(i),b(i))}$ is the most common internal evaluation
Always scale features before clustering (especially K-Means, DBSCAN)
Use the elbow method or silhouette analysis to choose $K$
Clustering is exploratory — results require domain interpretation
No single algorithm works best for all data distributions

What to Learn Next

-> Dimensionality Reduction Reduce features while preserving information with PCA and t-SNE.

-> KNN Instance-based learning where your neighbors tell the story.

-> Recommendation Systems Collaborative and content-based filtering for personalized experiences.

Clustering — K-Means, DBSCAN, Hierarchical Complete Guide