Cluster Analysis — k-Means, Hierarchical, DBSCAN

Statistics

Grouping Similar Observations Without Labels

Cluster analysis partitions data into groups where members are more similar to each other than to outsiders. Each algorithm — k-means, hierarchical, DBSCAN — offers different strengths for different data structures.

Customer Segmentation — Group buyers by purchasing behavior for targeted campaigns
Genomics — Classify genes with similar expression patterns into functional groups
Anomaly Detection — Identify fraud or defects as points that don't belong to any cluster

Finding groups that nature created but didn't label is the essence of unsupervised discovery.

Cluster analysis groups similar observations together without predefined labels. The goal is to maximize within-cluster similarity and between-cluster differences.

DfClustering

A technique for partitioning a dataset into groups (clusters) such that observations within the same group are more similar to each other than to those in other groups.

k-Means Clustering

k-Means partitions data into k clusters by minimizing the within-cluster sum of squares.

k-Means Objective

J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Here,

$K$ =Number of clusters
$C_k$ =Set of observations assigned to cluster k
$\mu_k$ =Centroid (mean) of cluster k
$x_i$ =Observation i

How k-Means Works

Randomly initialize k centroids
Assign each point to the nearest centroid
Recompute centroids as the mean of assigned points
Repeat steps 2-3 until convergence

Choosing k: The Elbow Method

Plot $J(k)$ against $k$ and look for the "elbow" where the rate of decrease slows.


import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs



X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)



inertias = []

K_range = range(1, 10)

for k in K_range:

    km = KMeans(n_clusters=k, random_state=42, n_init=10)

    km.fit(X)

    inertias.append(km.inertia_)



plt.figure(figsize=(8, 4))

plt.plot(K_range, inertias, 'bo-')

plt.xlabel('Number of Clusters (k)')

plt.ylabel('Inertia (WCSS)')

plt.title('Elbow Method')

plt.show()

Hierarchical Clustering

Builds a dendrogram — a tree-like hierarchy of clusters.

Agglomerative (Bottom-Up)

Start with each point as its own cluster
Merge the two closest clusters
Repeat until one cluster remains

Distance Metrics

| Metric | Formula | When to Use |

|--------|---------|-------------|

| Euclidean | $d(x,y) = \sqrt{\sum(x_i - y_i)^2}$ | Continuous data |

| Cosine | $d(x,y) = 1 - \frac{x \cdot y}{\|x\|\|y\|}$ | Text data |

Linkage Methods

| Linkage | Definition |

|---------|-----------|

| Single | Minimum distance between any two points in different clusters |

| Complete | Maximum distance between any two points in different clusters |

| Average | Mean distance between all pairs of points |

| Ward | Minimizes total within-cluster variance |

Ward's Method

Ward's linkage tends to produce compact, equally-sized clusters and is often the best default choice for Euclidean distance data.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise finds clusters of arbitrary shape and identifies outliers.

DBSCAN Parameters

\text{Core point: } |N_\varepsilon(x)| \geq \text{MinPts}

Here,

$\varepsilon$ =Radius of the neighborhood
$N_\varepsilon(x)$ =Set of points within distance e of x
$MinPts$ =Minimum number of points to form a dense region

| Point Type | Condition |

|-----------|-----------|

| Core | Has at least MinPts within e |

| Border | Within e of a core point but has < MinPts neighbors |

| Noise | Neither core nor border — treated as outlier |

DBSCAN Sensitivity

DBSCAN performance is highly sensitive to the choice of e and MinPts. Use a k-distance plot to find a good e value.

Cluster Evaluation

Silhouette Score

Measures how similar each point is to its own cluster vs. other clusters.

Silhouette Coefficient

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Here,

$a(i)$ =Mean distance from i to other points in the same cluster
$b(i)$ =Mean distance from i to points in the nearest neighboring cluster
$s(i)$ =Silhouette coefficient, ranges from -1 to 1

| Score | Interpretation |

|-------|---------------|

| 0.7 - 1.0 | Strong structure |

| 0.5 - 0.7 | Reasonable structure |

| 0.25 - 0.5 | Weak structure |

| < 0.25 | No substantial structure |

Python Implementation


import numpy as np

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

from sklearn.metrics import silhouette_score

from sklearn.datasets import make_blobs



X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)



# k-Means

km = KMeans(n_clusters=4, random_state=42, n_init=10)

labels_km = km.fit_predict(X)

print(f"k-Means Silhouette: {silhouette_score(X, labels_km):.3f}")



# Hierarchical

hc = AgglomerativeClustering(n_clusters=4)

labels_hc = hc.fit_predict(X)

print(f"Hierarchical Silhouette: {silhouette_score(X, labels_hc):.3f}")



# DBSCAN

db = DBSCAN(eps=0.5, min_samples=5)

labels_db = db.fit_predict(X)

n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)

print(f"DBSCAN found {n_clusters} clusters")

if n_clusters > 1:

    mask = labels_db != -1

    print(f"DBSCAN Silhouette: {silhouette_score(X[mask], labels_db[mask]):.3f}")

Comparison

| Method | Pros | Cons |

|--------|------|------|

| k-Means | Fast, simple | Must specify k; assumes spherical clusters |

| Hierarchical | Dendrogram; no k needed | Slow for large datasets; irreversible merges |

| DBSCAN | Arbitrary shapes; handles outliers | Sensitive to parameters; struggles with varying densities |

Key Takeaways

Summary: Cluster Analysis

k-Means is fast and effective but requires specifying k and assumes convex clusters
Hierarchical clustering produces a dendrogram and does not require k upfront
DBSCAN finds arbitrary-shaped clusters and naturally identifies outliers
Use the elbow method and silhouette score to evaluate cluster quality
Ward's linkage is a robust default for hierarchical clustering
Always standardize variables before clustering to avoid scale effects

Cluster Analysis — k-Means, Hierarchical, DBSCAN

Cluster Analysis — k-Means, Hierarchical, DBSCAN

Grouping Similar Observations Without Labels

DfClustering

k-Means Clustering

k-Means Objective

Choosing k: The Elbow Method

Hierarchical Clustering

Agglomerative (Bottom-Up)

Distance Metrics

Linkage Methods

DBSCAN

DBSCAN Parameters

Cluster Evaluation

Silhouette Score

Silhouette Coefficient

Python Implementation

Comparison

Key Takeaways

Summary: Cluster Analysis

Related Topics

Premium Content

Need Expert Statistics Help?