Ensemble Methods
Many Trees Make a Forest — The Power of Ensemble Learning
Random Forest builds hundreds of decision trees and merges their predictions to achieve higher accuracy and stability. By combining bagging with random feature selection, it reduces overfitting while maintaining the interpretability of individual trees.
- Bootstrap Aggregating — reduces variance by averaging predictions from multiple trees trained on different data samples
- Random Feature Selection — decorrelates trees by considering only a subset of features at each split
- Out-of-Bag Evaluation — provides a free validation estimate without needing a separate holdout set
"The forest is much wiser than any single tree."
Random Forest — Complete Guide
Random Forest builds many decision trees and combines their predictions. It's one of the most popular and effective ML algorithms.
How Random Forest Works
DfRandom Forest
Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
DfBagging (Bootstrap Aggregating)
A technique where multiple models are trained on different random subsets of the training data (with replacement), then combined to produce a final prediction.
Bootstrap Sampling Diagram
Random Forest = Bagging + Random Feature Selection
Step 1: Bootstrap Sampling
Create N random samples (with replacement)
Each sample ~63% of original data
~37% left out (out-of-bag samples)
Step 2: Train Decision Tree on Each Sample
At each split, consider only √p features (classification)
Or p/3 features (regression)
This decorrelates the trees
Step 3: Aggregate Predictions
Classification: Majority vote
Regression: Average
Why it works:
Each tree is different (bootstrap + random features)
Errors of individual trees cancel out
Combining reduces variance without increasing bias
Parallel Forest Architecture
Feature Importance Diagram
Mathematical Foundation
Variance Reduction via Bagging
For
independent trees with variance
and pairwise correlation
:
As
, the second term vanishes, leaving:
Key insight: Reducing
(tree correlation) reduces ensemble variance. Random feature selection achieves this by ensuring trees split on different feature subsets.
Optimal Number of Features
For classification with
total features, the theoretical optimum is:
For regression,
typically works well.
Python Implementation
Python Implementation
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Generate data
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train
rf = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_split=5,
min_samples_leaf=2,
max_features='sqrt',
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
# Evaluate
y_pred = rf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
# Feature importance
importances = rf.feature_importances_
for i, imp in enumerate(importances):
print(f"Feature {i}: {imp:.3f}")
Out-of-Bag (OOB) Evaluation
DfOut-of-Bag (OOB) Evaluation
A method of evaluating Random Forest models using the data not included in the bootstrap sample for each tree. About 37% of data is left out for each tree.
The OOB error estimator:
where
is the prediction for sample
using only trees that did not include
in their bootstrap sample.
Each tree sees ~63% of data
The remaining ~37% (OOB samples) can be used for evaluation
rf = RandomForestClassifier(oob_score=True)
rf.fit(X, y)
print(f"OOB Score: {rf.oob_score_:.3f}")
Advantage: No need for separate validation set!
Why OOB Evaluation is Useful
OOB evaluation provides a free validation estimate without needing a separate validation set, making it efficient for model evaluation. The OOB estimate is approximately equivalent to leave-one-out cross-validation.
Hyperparameters
n_estimators: Number of trees
More = better (up to a point)
100-500 is usually good
Diminishing returns after 500
max_depth: Maximum tree depth
None = unlimited (may overfit)
10-30 is usually good
Deeper = more complex
min_samples_split: Minimum samples to split
2 = default (grow fully)
5-20 = more regularization
Higher = simpler trees
max_features: Features per split
'sqrt' = √p features (classification)
'log2' = log₂(p) features
0.3 = 30% of features
Bias-Variance Analysis
Bias-Variance Tradeoff in Random Forest
Random Forest primarily reduces variance while keeping bias approximately equal to a single deep tree.
- Single tree: Low bias, high variance
- Random Forest: Low bias, low variance (due to averaging)
The ensemble error decomposition:
where
is the correlation between trees. Random feature selection reduces
.
Key Takeaways
Summary: Random Forest
- Random Forest combines many decision trees for better performance
- Bootstrap sampling + random feature selection decorrelates trees
- Feature importance shows which features matter most
- OOB evaluation provides free validation
- Robust to overfitting — more trees generally help
- Handles missing values and mixed data types
- Parallel training — trees are independent
- Great baseline — often competitive with tuned models
What to Learn Next
-> Decision Trees Understand the building blocks of Random Forest — how individual trees split data and make predictions.
-> XGBoost Learn the gradient boosting alternative that often outperforms Random Forest on structured data.
-> Ensemble Methods Explore the broader theory behind bagging, boosting, and stacking ensemble strategies.
-> Model Evaluation Master cross-validation, bias-variance tradeoff, and metrics for assessing Random Forest performance.
-> Interpretability Use SHAP and LIME to explain what your Random Forest model learned from the data.
-> Feature Engineering Create better input features that help Random Forest models achieve even higher accuracy.