Cluster Analysis — k-Means, Hierarchical, DBSCAN
Statistics
Grouping Similar Observations Without Labels
Cluster analysis partitions data into groups where members are more similar to each other than to outsiders. Each algorithm — k-means, hierarchical, DBSCAN — offers different strengths for different data structures.
-
Customer Segmentation — Group buyers by purchasing behavior for targeted campaigns
-
Genomics — Classify genes with similar expression patterns into functional groups
-
Anomaly Detection — Identify fraud or defects as points that don't belong to any cluster
Finding groups that nature created but didn't label is the essence of unsupervised discovery.
Cluster analysis groups similar observations together without predefined labels. The goal is to maximize within-cluster similarity and between-cluster differences.
DfClustering
A technique for partitioning a dataset into groups (clusters) such that observations within the same group are more similar to each other than to those in other groups.
k-Means Clustering
k-Means partitions data into k clusters by minimizing the within-cluster sum of squares.
k-Means Objective
Here,
- =Number of clusters
- =Set of observations assigned to cluster k
- =Centroid (mean) of cluster k
- =Observation i
How k-Means Works
-
Randomly initialize k centroids
-
Assign each point to the nearest centroid
-
Recompute centroids as the mean of assigned points
-
Repeat steps 2-3 until convergence
Choosing k: The Elbow Method
Plot against and look for the "elbow" where the rate of decrease slows.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
inertias = []
K_range = range(1, 10)
for k in K_range:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X)
inertias.append(km.inertia_)
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method')
plt.show()
Hierarchical Clustering
Builds a dendrogram — a tree-like hierarchy of clusters.
Agglomerative (Bottom-Up)
-
Start with each point as its own cluster
-
Merge the two closest clusters
-
Repeat until one cluster remains
Distance Metrics
| Metric | Formula | When to Use |
|--------|---------|-------------|
| Euclidean | | Continuous data |
| Manhattan | | High-dimensional data |
| Cosine | | Text data |
Linkage Methods
| Linkage | Definition |
|---------|-----------|
| Single | Minimum distance between any two points in different clusters |
| Complete | Maximum distance between any two points in different clusters |
| Average | Mean distance between all pairs of points |
| Ward | Minimizes total within-cluster variance |
Ward's Method
Ward's linkage tends to produce compact, equally-sized clusters and is often the best default choice for Euclidean distance data.
DBSCAN
Density-Based Spatial Clustering of Applications with Noise finds clusters of arbitrary shape and identifies outliers.
DBSCAN Parameters
Here,
- =Radius of the neighborhood
- =Set of points within distance e of x
- =Minimum number of points to form a dense region
| Point Type | Condition |
|-----------|-----------|
| Core | Has at least MinPts within e |
| Border | Within e of a core point but has < MinPts neighbors |
| Noise | Neither core nor border — treated as outlier |
DBSCAN Sensitivity
DBSCAN performance is highly sensitive to the choice of e and MinPts. Use a k-distance plot to find a good e value.
Cluster Evaluation
Silhouette Score
Measures how similar each point is to its own cluster vs. other clusters.
Silhouette Coefficient
Here,
- =Mean distance from i to other points in the same cluster
- =Mean distance from i to points in the nearest neighboring cluster
- =Silhouette coefficient, ranges from -1 to 1
| Score | Interpretation |
|-------|---------------|
| 0.7 - 1.0 | Strong structure |
| 0.5 - 0.7 | Reasonable structure |
| 0.25 - 0.5 | Weak structure |
| < 0.25 | No substantial structure |
Python Implementation
import numpy as np
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
# k-Means
km = KMeans(n_clusters=4, random_state=42, n_init=10)
labels_km = km.fit_predict(X)
print(f"k-Means Silhouette: {silhouette_score(X, labels_km):.3f}")
# Hierarchical
hc = AgglomerativeClustering(n_clusters=4)
labels_hc = hc.fit_predict(X)
print(f"Hierarchical Silhouette: {silhouette_score(X, labels_hc):.3f}")
# DBSCAN
db = DBSCAN(eps=0.5, min_samples=5)
labels_db = db.fit_predict(X)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
print(f"DBSCAN found {n_clusters} clusters")
if n_clusters > 1:
mask = labels_db != -1
print(f"DBSCAN Silhouette: {silhouette_score(X[mask], labels_db[mask]):.3f}")
Comparison
| Method | Pros | Cons |
|--------|------|------|
| k-Means | Fast, simple | Must specify k; assumes spherical clusters |
| Hierarchical | Dendrogram; no k needed | Slow for large datasets; irreversible merges |
| DBSCAN | Arbitrary shapes; handles outliers | Sensitive to parameters; struggles with varying densities |
Key Takeaways
Summary: Cluster Analysis
-
k-Means is fast and effective but requires specifying k and assumes convex clusters
-
Hierarchical clustering produces a dendrogram and does not require k upfront
-
DBSCAN finds arbitrary-shaped clusters and naturally identifies outliers
-
Use the elbow method and silhouette score to evaluate cluster quality
-
Ward's linkage is a robust default for hierarchical clustering
-
Always standardize variables before clustering to avoid scale effects
Related Topics
-
See Dimensionality Reduction for preprocessing before clustering
-
See Gaussian Mixture Models for probabilistic clustering
-
See k-Nearest Neighbors for distance-based classification