Feature Selection Methods
Feature selection is the process of identifying the most relevant variables for your model. Too many features introduce noise, increase computational cost, and cause overfitting. Too few lose predictive signal. The art lies in finding the right subset.
Feature Selection Methods Overview
Why Feature Selection Matters
The curse of dimensionality is real: as features grow, data becomes sparse, models become complex, and generalization suffers. Feature selection reduces dimensionality while preserving (or even improving) predictive power.
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.feature_selection import (
SelectKBest, mutual_info_classif, f_classif,
RFE, SequentialFeatureSelector, VarianceThreshold
)
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
Generate a Realistic Dataset
We'll create a dataset with informative, redundant, and noise features to demonstrate selection methods.
X, y = make_classification(
n_samples=2000, n_features=50, n_informative=10,
n_redundant=10, n_clusters_per_class=3,
flip_y=0.05, random_state=42
)
feature_names = [f'feat_{i}' for i in range(50)]
X = pd.DataFrame(X, columns=feature_names)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Informative features: 10, Redundant: 10, Noise: 30")
Filter Methods
Filter methods evaluate features independently of any model, using statistical tests.
Variance Threshold
Remove features with near-zero variance – they carry little information.
vt = VarianceThreshold(threshold=0.1)
X_vt = vt.fit_transform(X)
selected = X.columns[vt.get_support()]
removed = X.columns[~vt.get_support()]
print(f"Removed {len(removed)} low-variance features: {list(removed[:10])}")
Mutual Information
Mutual information captures any statistical dependency – linear or non-linear – between features and the target. It measures how much knowing one variable reduces uncertainty about another:
For continuous variables, the sum becomes an integral. Higher MI indicates stronger dependency – the feature is more informative about the target.
mi_scores = mutual_info_classif(X, y, random_state=42)
mi_series = pd.Series(mi_scores, index=feature_names).sort_values(ascending=False)
print("Top 15 features by mutual information:")
print(mi_series.head(15))
# Select top k features
k = 15
selector = SelectKBest(mutual_info_classif, k=k)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
print(f"\nSelected {len(selected_features)} features: {list(selected_features)}")
F-Test (ANOVA)
The F-test measures linear dependency between each feature and the target.
f_scores, p_values = f_classif(X, y)
f_series = pd.Series(f_scores, index=feature_names).sort_values(ascending=False)
p_series = pd.Series(p_values, index=feature_names)
print("Top 10 features by F-score:")
print(f_series.head(10))
# Features with p-value < 0.05
significant = p_series[p_series < 0.05]
print(f"\n{len(significant)} features statistically significant at p < 0.05")
Correlation-Based Selection
Remove highly correlated features to reduce multicollinearity.
corr_matrix = X.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
print(f"Removing {len(to_drop)} features with correlation > 0.95: {to_drop[:5]}")
X_uncorr = X.drop(columns=to_drop)
Wrapper Methods
Wrapper methods use a specific model to evaluate feature subsets.
Recursive Feature Elimination (RFE)
RFE recursively removes the least important feature until the desired number remains.
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rfe = RFE(estimator=rf, n_features_to_select=15, step=5)
rfe.fit(X, y)
selected_rfe = X.columns[rfe.support_]
ranking = pd.Series(rfe.ranking_, index=feature_names).sort_values()
print("RFE ranking (1 = selected):")
print(ranking.head(20))
print(f"\nRFE selected {len(selected_rfe)} features")
Sequential Feature Selection
Greedy forward or backward selection using model performance as the criterion.
# Forward selection – computationally expensive but thorough
sfs_forward = SequentialFeatureSelector(
rf, n_features_to_select=15, direction='forward',
scoring='accuracy', cv=3, n_jobs=-1
)
sfs_forward.fit(X, y)
selected_fwd = X.columns[sfs_forward.get_support()]
print(f"Forward SFS selected: {len(selected_fwd)} features")
# Backward elimination
sfs_backward = SequentialFeatureSelector(
rf, n_features_to_select=15, direction='backward',
scoring='accuracy', cv=3, n_jobs=-1
)
sfs_backward.fit(X, y)
selected_bwd = X.columns[sfs_backward.get_support()]
print(f"Backward SFS selected: {len(selected_bwd)} features")
# Compare selections
overlap = set(selected_fwd) & set(selected_bwd)
print(f"Overlap between forward and backward: {len(overlap)} features")
Embedded Methods
Embedded methods perform feature selection as part of the model training process.
L1 Regularization (Lasso)
L1 penalty drives coefficients of irrelevant features to exactly zero.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Lasso with cross-validation to find optimal regularization
lasso = LassoCV(cv=5, random_state=42, n_jobs=-1)
lasso.fit(X_scaled, y)
lasso_coef = pd.Series(lasso.coef_, index=feature_names)
selected_lasso = lasso_coef[lasso_coef.abs() > 0.01].index
print(f"Lasso selected {len(selected_lasso)} features with non-zero coefficients")
print(f"Optimal alpha: {lasso.alpha_:.4f}")
print("\nCoefficient magnitudes:")
print(lasso_coef.abs().sort_values(ascending=False).head(15))
Tree-Based Feature Importance
Tree models provide built-in feature importance scores.
rf.fit(X, y)
importances = pd.Series(rf.feature_importances_, index=feature_names).sort_values(ascending=False)
print("Top 20 features by Random Forest importance:")
print(importances.head(20))
# Select features above a threshold
threshold = importances.mean()
selected_tree = importances[importances > threshold].index
print(f"\n{len(selected_tree)} features above mean importance ({threshold:.4f})")
Permutation Importance
Model-agnostic importance based on how much performance drops when a feature is shuffled.
# Use a smaller model for demonstration
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb.fit(X, y)
perm_result = permutation_importance(gb, X, y, n_repeats=10, random_state=42, n_jobs=-1)
perm_importance = pd.Series(
perm_result.importances_mean, index=feature_names
).sort_values(ascending=False)
print("Permutation importance (top 15):")
print(perm_importance.head(15))
# Select features with positive permutation importance
selected_perm = perm_importance[perm_importance > 0].index
print(f"\n{len(selected_perm)} features with positive importance")
Comparing Selection Methods
selections = {
'Mutual Info': set(selected_features),
'RFE': set(selected_rfe),
'Lasso': set(selected_lasso),
'Tree Importance': set(selected_tree),
'Permutation': set(selected_perm),
}
# Voting across methods
from collections import Counter
all_selected = []
for method, feats in selections.items():
all_selected.extend(feats)
vote_counts = Counter(all_selected)
consensus_features = [f for f, c in vote_counts.items() if c >= 3]
print(f"Features selected by 3+ methods: {len(consensus_features)}")
print(f"Consensus features: {consensus_features}")
# Evaluate with consensus features
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
score_all = cross_val_score(rf, X, y, cv=cv, scoring='accuracy').mean()
score_consensus = cross_val_score(rf, X[consensus_features], y, cv=cv, scoring='accuracy').mean()
print(f"\nAccuracy with all 50 features: {score_all:.4f}")
print(f"Accuracy with {len(consensus_features)} consensus features: {score_consensus:.4f}")
Stability Selection
Combines subsampling with Lasso to identify features that are consistently selected.
from sklearn.linear_model import Lasso
n_bootstrap = 50
selection_frequency = pd.Series(0, index=feature_names)
for i in range(n_bootstrap):
idx = np.random.choice(len(X), size=len(X), replace=True)
X_boot, y_boot = X_scaled[idx], y[idx]
lasso_boot = Lasso(alpha=0.01, random_state=i)
lasso_boot.fit(X_boot, y_boot)
selected_boot = np.abs(lasso_boot.coef_) > 0
selection_frequency += selected_boot.astype(int)
selection_freq = selection_frequency / n_bootstrap
stable_features = selection_freq[selection_freq > 0.7].index
print(f"Stably selected features (>70% frequency): {len(stable_features)}")
print(selection_freq.sort_values(ascending=False).head(10))
Practical Workflow
def feature_selection_pipeline(X, y, k=15):
"""Multi-method feature selection with voting."""
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
results = {}
# Mutual information
mi = mutual_info_classif(X, y, random_state=42)
results['mi'] = set(pd.Series(mi, index=X.columns).nlargest(k).index)
# Lasso
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_scaled, y)
results['lasso'] = set(pd.Series(
np.abs(lasso.coef_), index=X.columns
).nlargest(k).index)
# Random Forest importance
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)
results['rf'] = set(pd.Series(
rf.feature_importances_, index=X.columns
).nlargest(k).index)
# RFE
rfe = RFE(rf, n_features_to_select=k)
rfe.fit(X, y)
results['rfe'] = set(X.columns[rfe.support_])
# Vote
from collections import Counter
all_feats = []
for feats in results.values():
all_feats.extend(feats)
votes = Counter(all_feats)
selected = [f for f, c in votes.items() if c >= 3]
print(f"Consensus features ({len(selected)}): {selected}")
return selected
final_features = feature_selection_pipeline(X, y, k=15)
Best Practices
- Use multiple methods – no single method is universally best
- Consider computational cost – wrapper methods are O(2^n) in the worst case
- Validate with cross-validation – never select features on the full dataset
- Document selection criteria – reproducibility requires knowing why features were kept
- Monitor feature importance over time – feature relevance can shift with data drift
- Combine domain knowledge with statistics – the best selections blend both
Summary
Feature selection reduces noise, improves interpretability, and often boosts performance. Use filter methods for speed, wrapper methods for accuracy, and embedded methods for convenience. The strongest approach combines multiple methods with domain knowledge to build robust, interpretable models.