If-Then Rules That Learn — The Most Interpretable Algorithm
Decision trees split data using simple if-then-else rules. They are easy to visualize, handle mixed data types, and form the basis for powerful ensemble methods.
- Gini Impurity — Measuring node purity for optimal splits
- Information Gain — Entropy-based splitting criterion
- Pruning — Preventing overfitting by limiting tree complexity
"A decision tree is the only ML algorithm that can be explained to your grandmother."
Decision Trees — Complete Guide
Decision trees make predictions by learning simple rules from data — like a flowchart of if-then-else decisions.
How Decision Trees Work
DfDecision Tree
A decision tree recursively partitions the feature space into axis-aligned regions by learning decision rules from data. For classification, each leaf node predicts the majority class; for regression, the mean of training samples in .
Splitting Criteria
Gini Impurity
DfGini Impurity
Gini impurity measures how often a randomly chosen element would be incorrectly labeled. For a node with class proportions :
Range: where 0 = pure, = maximally impure.
Information Gain (Entropy)
DfEntropy
Entropy measures disorder in a set. For binary classification with proportion of positive class:
For classes: Range: where 0 = pure, = uniform.
DfInformation Gain
The reduction in entropy after splitting on feature :
where are the child nodes after splitting on .
Example: Gini vs Entropy
Pure node (all same class): Gini = , Entropy =
50/50 split (binary): Gini = , Entropy =
3-way uniform (K=3): Gini = , Entropy =
CART Algorithm
DfCART (Classification and Regression Trees)
CART builds a binary tree by greedily choosing the split that minimizes:
where is Gini impurity (classification) or MSE (regression). This is repeated recursively until stopping criteria are met.
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)
tree = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
tree.fit(X_train, y_train)
print(f"Accuracy: {tree.score(X_test, y_test):.3f}")
print(export_text(tree, feature_names=iris.feature_names))
for name, imp in zip(iris.feature_names, tree.feature_importances_):
print(f"{name}: {imp:.3f}")
Pruning
DfCost-Complexity Pruning
Pruning minimizes: where is the misclassification rate, is the number of leaf nodes, and is the complexity parameter controlling the trade-off between accuracy and tree size.
Pre-pruning Hyperparameters
max_depth: Maximum tree depth (typically 3-10)min_samples_split: Minimum samples to split a node (typically 2-20)min_samples_leaf: Minimum samples in a leaf (typically 1-10)ccp_alpha: Cost-complexity pruning parameter
Feature Importance
DfFeature Importance (MDI)
Mean Decrease in Impurity: the importance of feature is the total reduction in Gini (or entropy) weighted by the fraction of samples reaching each node:
Key Takeaways
Summary: Decision Trees
- Decision trees partition into axis-aligned regions using if-then-else rules
- Gini impurity or Information Gain
- CART builds binary trees greedily — locally optimal splits, not globally optimal
- Pruning (cost-complexity ) prevents overfitting
- Feature importance shows which features drive predictions (MDI or permutation-based)
- Decision trees handle mixed data types (numerical + categorical) natively
- Non-parametric — no assumptions about data distribution
- Unstable — small data changes create very different trees → use ensembles (RF, XGBoost)
What to Learn Next
-> Random Forest Ensemble of decision trees for better accuracy and stability.
-> XGBoost Gradient boosting taken to the extreme — state-of-the-art performance.
-> Ensemble Methods Bagging, boosting, and stacking for stronger models.