Supervised Learning

If-Then Rules That Learn — The Most Interpretable Algorithm

Decision trees split data using simple if-then-else rules. They are easy to visualize, handle mixed data types, and form the basis for powerful ensemble methods.

Gini Impurity — Measuring node purity for optimal splits
Information Gain — Entropy-based splitting criterion
Pruning — Preventing overfitting by limiting tree complexity

"A decision tree is the only ML algorithm that can be explained to your grandmother."

Decision Trees — Complete Guide

Decision trees make predictions by learning simple rules from data — like a flowchart of if-then-else decisions.

How Decision Trees Work

DfDecision Tree

A decision tree recursively partitions the feature space $\mathcal{X}$ into axis-aligned regions $R_1, R_2, \ldots, R_M$ by learning decision rules from data. For classification, each leaf node $R_m$ predicts the majority class; for regression, the mean of training samples in $R_m$ .

Splitting Criteria

Gini Impurity

DfGini Impurity

Gini impurity measures how often a randomly chosen element would be incorrectly labeled. For a node with class proportions $p_1, p_2, \ldots, p_K$ :

\text{Gini}(t) = 1 - \sum_{i=1}^{K} p_i^2 = \sum_{i=1}^{K} p_i(1-p_i)

Range: $[0, 1-1/K]$ where 0 = pure, $1-1/K$ = maximally impure.

Information Gain (Entropy)

DfEntropy

Entropy measures disorder in a set. For binary classification with proportion $p$ of positive class:

H(p) = -p\log_2(p) - (1-p)\log_2(1-p)

For $K$ classes: $H = -\sum_{i=1}^{K} p_i \log_2(p_i)$ Range: $[0, \log_2 K]$ where 0 = pure, $\log_2 K$ = uniform.

DfInformation Gain

The reduction in entropy after splitting on feature $A$ :

\text{IG}(t, A) = H(t) - \sum_{j=1}^{V} \frac{|t_j|}{|t|} H(t_j)

where $t_j$ are the child nodes after splitting on $A$ .

Example: Gini vs Entropy

Pure node (all same class): Gini = $1 - (1^2) = 0$ , Entropy = $0$

50/50 split (binary): Gini = $1 - (0.5^2 + 0.5^2) = 0.5$ , Entropy = $1.0$

3-way uniform (K=3): Gini = $1 - 3(1/3)^2 = 2/3 \approx 0.667$ , Entropy = $\log_2(3) \approx 1.585$

CART Algorithm

DfCART (Classification and Regression Trees)

CART builds a binary tree by greedily choosing the split $(A, v)$ that minimizes:

\min_{A,v} \left[\frac{N_{\text{left}}}{N} G(t_{\text{left}}) + \frac{N_{\text{right}}}{N} G(t_{\text{right}})\right]

where $G$ is Gini impurity (classification) or MSE (regression). This is repeated recursively until stopping criteria are met.

from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)

tree = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
tree.fit(X_train, y_train)
print(f"Accuracy: {tree.score(X_test, y_test):.3f}")
print(export_text(tree, feature_names=iris.feature_names))

for name, imp in zip(iris.feature_names, tree.feature_importances_):
    print(f"{name}: {imp:.3f}")

Pruning

DfCost-Complexity Pruning

Pruning minimizes: $R_\alpha(T) = R(T) + \alpha|\tilde{T}|$ where $R(T)$ is the misclassification rate, $|\tilde{T}|$ is the number of leaf nodes, and $\alpha$ is the complexity parameter controlling the trade-off between accuracy and tree size.

Pre-pruning Hyperparameters

max_depth: Maximum tree depth (typically 3-10)
min_samples_split: Minimum samples to split a node (typically 2-20)
min_samples_leaf: Minimum samples in a leaf (typically 1-10)
ccp_alpha: Cost-complexity pruning parameter

Feature Importance

DfFeature Importance (MDI)

Mean Decrease in Impurity: the importance of feature $j$ is the total reduction in Gini (or entropy) weighted by the fraction of samples reaching each node:

\text{FI}_j = \sum_{t \in \text{splits on } j} \frac{|R_t|}{N} \Delta G(t)

Key Takeaways

Summary: Decision Trees

Decision trees partition $\mathcal{X}$ into axis-aligned regions using if-then-else rules
Gini impurity $= 1 - \sum p_i^2$ or Information Gain $= H(parent) - \sum \frac{N_j}{N}H(child_j)$
CART builds binary trees greedily — locally optimal splits, not globally optimal
Pruning (cost-complexity $R_\alpha = R(T) + \alpha|\tilde{T}|$ ) prevents overfitting
Feature importance shows which features drive predictions (MDI or permutation-based)
Decision trees handle mixed data types (numerical + categorical) natively
Non-parametric — no assumptions about data distribution
Unstable — small data changes create very different trees → use ensembles (RF, XGBoost)

What to Learn Next

-> Random Forest Ensemble of decision trees for better accuracy and stability.

-> XGBoost Gradient boosting taken to the extreme — state-of-the-art performance.

-> Ensemble Methods Bagging, boosting, and stacking for stronger models.

Decision Trees — Complete Guide with Visualizations