🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Decision Trees — Complete Guide with Visualizations

ML FoundationsClassification🟢 Free Lesson

Advertisement

Supervised Learning

If-Then Rules That Learn — The Most Interpretable Algorithm

Decision trees split data using simple if-then-else rules. They are easy to visualize, handle mixed data types, and form the basis for powerful ensemble methods.

  • Gini Impurity — Measuring node purity for optimal splits
  • Information Gain — Entropy-based splitting criterion
  • Pruning — Preventing overfitting by limiting tree complexity

"A decision tree is the only ML algorithm that can be explained to your grandmother."

Decision Trees — Complete Guide

Decision trees make predictions by learning simple rules from data — like a flowchart of if-then-else decisions.


How Decision Trees Work

DfDecision Tree

A decision tree recursively partitions the feature space X\mathcal{X} into axis-aligned regions R1,R2,,RMR_1, R_2, \ldots, R_M by learning decision rules from data. For classification, each leaf node RmR_m predicts the majority class; for regression, the mean of training samples in RmR_m.

Decision Tree Structure: Should I Play Tennis?Outlook?SunnyOvercastRainHumidity?Yes ✓Wind?HighNormalNo ✗Yes ✓WeakStrongYes ✓No ✗Feature Importance:Outlook > Humidity > Wind > TempRoot node split = highest information gain

Splitting Criteria

Gini Impurity

DfGini Impurity

Gini impurity measures how often a randomly chosen element would be incorrectly labeled. For a node with class proportions p1,p2,,pKp_1, p_2, \ldots, p_K:

Gini(t)=1i=1Kpi2=i=1Kpi(1pi)\text{Gini}(t) = 1 - \sum_{i=1}^{K} p_i^2 = \sum_{i=1}^{K} p_i(1-p_i)

Range: [0,11/K][0, 1-1/K] where 0 = pure, 11/K1-1/K = maximally impure.

Information Gain (Entropy)

DfEntropy

Entropy measures disorder in a set. For binary classification with proportion pp of positive class:

H(p)=plog2(p)(1p)log2(1p)H(p) = -p\log_2(p) - (1-p)\log_2(1-p)

For KK classes: H=i=1Kpilog2(pi)H = -\sum_{i=1}^{K} p_i \log_2(p_i) Range: [0,log2K][0, \log_2 K] where 0 = pure, log2K\log_2 K = uniform.

DfInformation Gain

The reduction in entropy after splitting on feature AA:

IG(t,A)=H(t)j=1VtjtH(tj)\text{IG}(t, A) = H(t) - \sum_{j=1}^{V} \frac{|t_j|}{|t|} H(t_j)

where tjt_j are the child nodes after splitting on AA.

Gini Impurity vs Entropy ComparisonGini Impurityp (proportion of class 1)010.5Gini = 1 − p² − (1−p)²Max at p=0.5Entropyp (proportion of class 1)011.0H = −p log₂(p) − (1−p) log₂(1−p)Max at p=0.5Both curves are concave, maximized at p=0.5. CART uses Gini; ID3/C4.5 use Entropy.

Example: Gini vs Entropy

Pure node (all same class): Gini = 1(12)=01 - (1^2) = 0, Entropy = 00

50/50 split (binary): Gini = 1(0.52+0.52)=0.51 - (0.5^2 + 0.5^2) = 0.5, Entropy = 1.01.0

3-way uniform (K=3): Gini = 13(1/3)2=2/30.6671 - 3(1/3)^2 = 2/3 \approx 0.667, Entropy = log2(3)1.585\log_2(3) \approx 1.585


CART Algorithm

DfCART (Classification and Regression Trees)

CART builds a binary tree by greedily choosing the split (A,v)(A, v) that minimizes:

minA,v[NleftNG(tleft)+NrightNG(tright)]\min_{A,v} \left[\frac{N_{\text{left}}}{N} G(t_{\text{left}}) + \frac{N_{\text{right}}}{N} G(t_{\text{right}})\right]

where GG is Gini impurity (classification) or MSE (regression). This is repeated recursively until stopping criteria are met.

from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)

tree = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
tree.fit(X_train, y_train)
print(f"Accuracy: {tree.score(X_test, y_test):.3f}")
print(export_text(tree, feature_names=iris.feature_names))

for name, imp in zip(iris.feature_names, tree.feature_importances_):
    print(f"{name}: {imp:.3f}")

Pruning

DfCost-Complexity Pruning

Pruning minimizes: Rα(T)=R(T)+αT~R_\alpha(T) = R(T) + \alpha|\tilde{T}| where R(T)R(T) is the misclassification rate, T~|\tilde{T}| is the number of leaf nodes, and α\alpha is the complexity parameter controlling the trade-off between accuracy and tree size.

Pruning: Before and AfterBefore Pruning (Overfitting)RootNode ANode BLeafLeafLeafLeafTrain: 100% | Test: 70%7 leaf nodes — memorizes noiseAfter Pruning (Good Fit)RootLeaf ✓Leaf ✓Train: 95% | Test: 93%3 leaf nodes — better generalization

Pre-pruning Hyperparameters

  • max_depth: Maximum tree depth (typically 3-10)
  • min_samples_split: Minimum samples to split a node (typically 2-20)
  • min_samples_leaf: Minimum samples in a leaf (typically 1-10)
  • ccp_alpha: Cost-complexity pruning parameter

Feature Importance

DfFeature Importance (MDI)

Mean Decrease in Impurity: the importance of feature jj is the total reduction in Gini (or entropy) weighted by the fraction of samples reaching each node:

FIj=tsplits on jRtNΔG(t)\text{FI}_j = \sum_{t \in \text{splits on } j} \frac{|R_t|}{N} \Delta G(t)

Key Takeaways

Summary: Decision Trees

  1. Decision trees partition X\mathcal{X} into axis-aligned regions using if-then-else rules
  2. Gini impurity =1pi2= 1 - \sum p_i^2 or Information Gain =H(parent)NjNH(childj)= H(parent) - \sum \frac{N_j}{N}H(child_j)
  3. CART builds binary trees greedily — locally optimal splits, not globally optimal
  4. Pruning (cost-complexity Rα=R(T)+αT~R_\alpha = R(T) + \alpha|\tilde{T}|) prevents overfitting
  5. Feature importance shows which features drive predictions (MDI or permutation-based)
  6. Decision trees handle mixed data types (numerical + categorical) natively
  7. Non-parametric — no assumptions about data distribution
  8. Unstable — small data changes create very different trees → use ensembles (RF, XGBoost)

What to Learn Next

-> Random Forest Ensemble of decision trees for better accuracy and stability.

-> XGBoost Gradient boosting taken to the extreme — state-of-the-art performance.

-> Ensemble Methods Bagging, boosting, and stacking for stronger models.

Premium Content

Decision Trees — Complete Guide with Visualizations

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement