🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Cluster Analysis — k-Means, Hierarchical, DBSCAN

StatisticsUnsupervised Learning🟢 Free Lesson

Advertisement

Cluster Analysis — k-Means, Hierarchical, DBSCAN

Statistics

Grouping Similar Observations Without Labels

Cluster analysis partitions data into groups where members are more similar to each other than to outsiders. Each algorithm — k-means, hierarchical, DBSCAN — offers different strengths for different data structures.

  • Customer Segmentation — Group buyers by purchasing behavior for targeted campaigns

  • Genomics — Classify genes with similar expression patterns into functional groups

  • Anomaly Detection — Identify fraud or defects as points that don't belong to any cluster

Finding groups that nature created but didn't label is the essence of unsupervised discovery.


Cluster analysis groups similar observations together without predefined labels. The goal is to maximize within-cluster similarity and between-cluster differences.

DfClustering

A technique for partitioning a dataset into groups (clusters) such that observations within the same group are more similar to each other than to those in other groups.


k-Means Clustering

k-Means partitions data into k clusters by minimizing the within-cluster sum of squares.

k-Means Objective

J=k=1KxiCkxiμk2J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Here,

  • KK=Number of clusters
  • CkC_k=Set of observations assigned to cluster k
  • μk\mu_k=Centroid (mean) of cluster k
  • xix_i=Observation i

How k-Means Works

  1. Randomly initialize k centroids

  2. Assign each point to the nearest centroid

  3. Recompute centroids as the mean of assigned points

  4. Repeat steps 2-3 until convergence

Choosing k: The Elbow Method

Plot J(k)J(k) against kk and look for the "elbow" where the rate of decrease slows.


import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs



X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)



inertias = []

K_range = range(1, 10)

for k in K_range:

    km = KMeans(n_clusters=k, random_state=42, n_init=10)

    km.fit(X)

    inertias.append(km.inertia_)



plt.figure(figsize=(8, 4))

plt.plot(K_range, inertias, 'bo-')

plt.xlabel('Number of Clusters (k)')

plt.ylabel('Inertia (WCSS)')

plt.title('Elbow Method')

plt.show()

Hierarchical Clustering

Builds a dendrogram — a tree-like hierarchy of clusters.

Agglomerative (Bottom-Up)

  1. Start with each point as its own cluster

  2. Merge the two closest clusters

  3. Repeat until one cluster remains

Distance Metrics

| Metric | Formula | When to Use |

|--------|---------|-------------|

| Euclidean | d(x,y)=(xiyi)2d(x,y) = \sqrt{\sum(x_i - y_i)^2} | Continuous data |

| Manhattan | d(x,y)=xiyid(x,y) = \sum|x_i - y_i| | High-dimensional data |

| Cosine | d(x,y)=1xyxyd(x,y) = 1 - \frac{x \cdot y}{\|x\|\|y\|} | Text data |

Linkage Methods

| Linkage | Definition |

|---------|-----------|

| Single | Minimum distance between any two points in different clusters |

| Complete | Maximum distance between any two points in different clusters |

| Average | Mean distance between all pairs of points |

| Ward | Minimizes total within-cluster variance |

Ward's Method

Ward's linkage tends to produce compact, equally-sized clusters and is often the best default choice for Euclidean distance data.


DBSCAN

Density-Based Spatial Clustering of Applications with Noise finds clusters of arbitrary shape and identifies outliers.

DBSCAN Parameters

Core point: Nε(x)MinPts\text{Core point: } |N_\varepsilon(x)| \geq \text{MinPts}

Here,

  • ε\varepsilon=Radius of the neighborhood
  • Nε(x)N_\varepsilon(x)=Set of points within distance e of x
  • MinPtsMinPts=Minimum number of points to form a dense region

| Point Type | Condition |

|-----------|-----------|

| Core | Has at least MinPts within e |

| Border | Within e of a core point but has < MinPts neighbors |

| Noise | Neither core nor border — treated as outlier |

DBSCAN Sensitivity

DBSCAN performance is highly sensitive to the choice of e and MinPts. Use a k-distance plot to find a good e value.


Cluster Evaluation

Silhouette Score

Measures how similar each point is to its own cluster vs. other clusters.

Silhouette Coefficient

s(i)=b(i)a(i)max(a(i),b(i))s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Here,

  • a(i)a(i)=Mean distance from i to other points in the same cluster
  • b(i)b(i)=Mean distance from i to points in the nearest neighboring cluster
  • s(i)s(i)=Silhouette coefficient, ranges from -1 to 1

| Score | Interpretation |

|-------|---------------|

| 0.7 - 1.0 | Strong structure |

| 0.5 - 0.7 | Reasonable structure |

| 0.25 - 0.5 | Weak structure |

| < 0.25 | No substantial structure |


Python Implementation


import numpy as np

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

from sklearn.metrics import silhouette_score

from sklearn.datasets import make_blobs



X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)



# k-Means

km = KMeans(n_clusters=4, random_state=42, n_init=10)

labels_km = km.fit_predict(X)

print(f"k-Means Silhouette: {silhouette_score(X, labels_km):.3f}")



# Hierarchical

hc = AgglomerativeClustering(n_clusters=4)

labels_hc = hc.fit_predict(X)

print(f"Hierarchical Silhouette: {silhouette_score(X, labels_hc):.3f}")



# DBSCAN

db = DBSCAN(eps=0.5, min_samples=5)

labels_db = db.fit_predict(X)

n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)

print(f"DBSCAN found {n_clusters} clusters")

if n_clusters > 1:

    mask = labels_db != -1

    print(f"DBSCAN Silhouette: {silhouette_score(X[mask], labels_db[mask]):.3f}")

Comparison

| Method | Pros | Cons |

|--------|------|------|

| k-Means | Fast, simple | Must specify k; assumes spherical clusters |

| Hierarchical | Dendrogram; no k needed | Slow for large datasets; irreversible merges |

| DBSCAN | Arbitrary shapes; handles outliers | Sensitive to parameters; struggles with varying densities |


Key Takeaways

Summary: Cluster Analysis

  • k-Means is fast and effective but requires specifying k and assumes convex clusters

  • Hierarchical clustering produces a dendrogram and does not require k upfront

  • DBSCAN finds arbitrary-shaped clusters and naturally identifies outliers

  • Use the elbow method and silhouette score to evaluate cluster quality

  • Ward's linkage is a robust default for hierarchical clustering

  • Always standardize variables before clustering to avoid scale effects


Related Topics

Premium Content

Cluster Analysis — k-Means, Hierarchical, DBSCAN

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement