ML Foundations
Choosing the Right Model β The Art and Science of ML
Model selection balances algorithm choice with hyperparameter tuning to find the best fit for your data. The right approach saves time and dramatically improves results.
- Algorithm Comparison β match data characteristics to model strengths (small data vs. large data, tabular vs. text)
- Hyperparameter Tuning β Grid Search, Random Search, and Bayesian Optimization with Optuna
- Cross-Validation β reliable performance estimation that prevents overfitting to a single split
"All models are wrong, but some are useful." β George Box
Model Selection and Hyperparameter Tuning
Choosing the right model and tuning it properly is crucial for ML success.
Mathematical Foundations
Bias-Variance Decomposition
For a model
with true function
:
where:
is irreducible error
Cross-Validation Error
Regularized Objective (for tuning)
Model Selection Framework
DfModel Selection
The process of choosing the best machine learning algorithm for a given problem based on data characteristics, performance requirements, and constraints.
Quick Guide:
Small dataset (<1K samples):
SVM with RBF kernel
KNN
Naive Bayes
Random Forest
Medium dataset (1K-100K):
XGBoost / LightGBM
Random Forest
Neural Networks (simple)
SVM with linear kernel
Large dataset (>100K):
XGBoost / LightGBM
Neural Networks
Linear models
SGDClassifier
High dimensional (features > samples):
Linear models (L1/L2)
SVM
Naive Bayes
Interpretability needed:
Decision Trees
Linear/Logistic Regression
Rule-based models
Hyperparameter Tuning
DfGrid Search
An exhaustive search over specified parameter values. Tries every combination in the grid to find the best parameters.
DfRandom Search
Randomly samples parameter combinations. Often finds good results faster than grid search and makes better use of computational budget.
DfBayesian Optimization
Uses past results to guide the search for optimal parameters. More efficient than grid or random search, especially for expensive models.
Bias-Variance Curve
Learning Curves
Grid Search:
Try EVERY combination
Guaranteed to find best in grid
Exponentially expensive
Use for small parameter spaces
Random Search:
Random combinations
Often finds good results faster
Better use of budget
Default choice for most cases
Bayesian Optimization:
Uses past results to guide search
Most efficient
Best for expensive models
Use library: Optuna
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
# Grid Search
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10]
}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(f"Best: {grid.best_params_}")
# Random Search (faster)
random = RandomizedSearchCV(RandomForestClassifier(), param_grid, n_iter=20, cv=5)
random.fit(X_train, y_train)
Optuna (Bayesian Optimization)
Python Implementation
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 300),
'max_depth': trial.suggest_int('max_depth', 3, 20),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True)
}
model = xgb.XGBClassifier(**params)
return cross_val_score(model, X, y, cv=5).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(f"Best params: {study.best_params}")
Key Takeaways
Summary: Model Selection
- Start with simple models as baselines
- Random search is usually better than grid search
- Bayesian optimization (Optuna) is most efficient
- Always use cross-validation for evaluation
- XGBoost/LightGBM are often the best tabular models
- Scale data for SVM, KNN, Neural Networks
- Feature engineering matters more than model choice
- Ensemble multiple models for best performance
What to Learn Next
-> Model Evaluation Master cross-validation, bias-variance tradeoff, and the metrics that guide model selection.
-> Regularization Control model complexity with Ridge, Lasso, and Elastic Net to prevent overfitting.
-> Linear Regression Start with the simplest baseline model and understand when linear approaches are sufficient.
-> Decision Trees Learn interpretable models that are often strong baselines for structured data.
-> Ensemble Methods Combine multiple models to achieve better performance than any single algorithm.
-> Model Deployment Take your selected model from notebook to production with APIs and containerization.