Grouping the Ungrouped — Finding Hidden Structure in Data
Clustering algorithms discover natural groupings in data without labels. They are essential for customer segmentation, anomaly detection, and exploratory analysis.
- K-Means — Fast partitioning into K clusters using centroids
- DBSCAN — Density-based clustering that finds arbitrary shapes
- Hierarchical — Building dendrograms for multi-level groupings
"The greatest value of a picture is when it forces us to notice what we never expected to see."
Clustering — Complete Guide
Clustering groups similar data points together without labels. It is the most common unsupervised learning task.
K-Means Clustering
DfK-Means Clustering
Given dataset and number of clusters , K-Means minimizes the within-cluster sum of squares (WCSS):
where is the centroid of cluster .
DBSCAN
DfDBSCAN (Density-Based Spatial Clustering)
DBSCAN groups together points that are closely packed (high density), marking as outliers points that lie alone in low-density regions. Parameters: (neighborhood radius) and (minimum points to form a dense region).
Hierarchical Clustering
DfAgglomerative Hierarchical Clustering
A bottom-up approach: start with each point as its own cluster, then repeatedly merge the closest pair of clusters until only remain. The distance between clusters is defined by a linkage criterion.
Linkage Criteria
- Single: — nearest point
- Complete: — farthest point
- Average (UPGMA):
- Ward: Minimizes within-cluster variance after merging
Evaluation Metrics
Silhouette Score
Here,
- =Mean intra-cluster distance for point i
- =Mean nearest-cluster distance for point i
- =Silhouette coefficient ∈ [−1, 1]
Interpreting Silhouette Score
- : Well-clustered (tight, well-separated)
- : On cluster boundary
- : Likely assigned to wrong cluster
- Overall: — higher is better
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X_raw)
# K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels_km = kmeans.fit_predict(X)
print(f"K-Means Silhouette: {silhouette_score(X, labels_km):.3f}")
# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_db = dbscan.fit_predict(X)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
print(f"DBSCAN found {n_clusters} clusters, noise: {(labels_db == -1).sum()}")
# Hierarchical
hier = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels_hier = hier.fit_predict(X)
print(f"Hierarchical Silhouette: {silhouette_score(X, labels_hier):.3f}")
Key Takeaways
Summary: Clustering
- K-Means minimizes WCSS — fast, but assumes spherical clusters
- DBSCAN finds arbitrary shapes and handles outliers — parameters: , minPts
- Hierarchical produces interpretable dendrograms — cut at height to get clusters
- Silhouette score is the most common internal evaluation
- Always scale features before clustering (especially K-Means, DBSCAN)
- Use the elbow method or silhouette analysis to choose
- Clustering is exploratory — results require domain interpretation
- No single algorithm works best for all data distributions
What to Learn Next
-> Dimensionality Reduction Reduce features while preserving information with PCA and t-SNE.
-> KNN Instance-based learning where your neighbors tell the story.
-> Recommendation Systems Collaborative and content-based filtering for personalized experiences.