Feature Selection Methods

Feature selection is the process of identifying the most relevant variables for your model. Too many features introduce noise, increase computational cost, and cause overfitting. Too few lose predictive signal. The art lies in finding the right subset.

Feature Selection Methods Overview

Why Feature Selection Matters

The curse of dimensionality is real: as features grow, data becomes sparse, models become complex, and generalization suffers. Feature selection reduces dimensionality while preserving (or even improving) predictive power.

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.feature_selection import (
    SelectKBest, mutual_info_classif, f_classif,
    RFE, SequentialFeatureSelector, VarianceThreshold
)
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

Generate a Realistic Dataset

We'll create a dataset with informative, redundant, and noise features to demonstrate selection methods.

X, y = make_classification(
    n_samples=2000, n_features=50, n_informative=10,
    n_redundant=10, n_clusters_per_class=3,
    flip_y=0.05, random_state=42
)

feature_names = [f'feat_{i}' for i in range(50)]
X = pd.DataFrame(X, columns=feature_names)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Informative features: 10, Redundant: 10, Noise: 30")

Filter Methods

Filter methods evaluate features independently of any model, using statistical tests.

Variance Threshold

Remove features with near-zero variance – they carry little information.

vt = VarianceThreshold(threshold=0.1)
X_vt = vt.fit_transform(X)
selected = X.columns[vt.get_support()]
removed = X.columns[~vt.get_support()]
print(f"Removed {len(removed)} low-variance features: {list(removed[:10])}")

Mutual Information

Mutual information captures any statistical dependency – linear or non-linear – between features and the target. It measures how much knowing one variable reduces uncertainty about another:

I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

For continuous variables, the sum becomes an integral. Higher MI indicates stronger dependency – the feature is more informative about the target.

mi_scores = mutual_info_classif(X, y, random_state=42)
mi_series = pd.Series(mi_scores, index=feature_names).sort_values(ascending=False)

print("Top 15 features by mutual information:")
print(mi_series.head(15))

# Select top k features
k = 15
selector = SelectKBest(mutual_info_classif, k=k)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
print(f"\nSelected {len(selected_features)} features: {list(selected_features)}")

F-Test (ANOVA)

The F-test measures linear dependency between each feature and the target.

f_scores, p_values = f_classif(X, y)
f_series = pd.Series(f_scores, index=feature_names).sort_values(ascending=False)
p_series = pd.Series(p_values, index=feature_names)

print("Top 10 features by F-score:")
print(f_series.head(10))

# Features with p-value < 0.05
significant = p_series[p_series < 0.05]
print(f"\n{len(significant)} features statistically significant at p < 0.05")

Correlation-Based Selection

Remove highly correlated features to reduce multicollinearity.

corr_matrix = X.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
print(f"Removing {len(to_drop)} features with correlation > 0.95: {to_drop[:5]}")
X_uncorr = X.drop(columns=to_drop)

Wrapper Methods

Wrapper methods use a specific model to evaluate feature subsets.

Recursive Feature Elimination (RFE)

RFE recursively removes the least important feature until the desired number remains.

rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

rfe = RFE(estimator=rf, n_features_to_select=15, step=5)
rfe.fit(X, y)

selected_rfe = X.columns[rfe.support_]
ranking = pd.Series(rfe.ranking_, index=feature_names).sort_values()
print("RFE ranking (1 = selected):")
print(ranking.head(20))
print(f"\nRFE selected {len(selected_rfe)} features")

Sequential Feature Selection

Greedy forward or backward selection using model performance as the criterion.

# Forward selection – computationally expensive but thorough
sfs_forward = SequentialFeatureSelector(
    rf, n_features_to_select=15, direction='forward',
    scoring='accuracy', cv=3, n_jobs=-1
)
sfs_forward.fit(X, y)
selected_fwd = X.columns[sfs_forward.get_support()]
print(f"Forward SFS selected: {len(selected_fwd)} features")

# Backward elimination
sfs_backward = SequentialFeatureSelector(
    rf, n_features_to_select=15, direction='backward',
    scoring='accuracy', cv=3, n_jobs=-1
)
sfs_backward.fit(X, y)
selected_bwd = X.columns[sfs_backward.get_support()]
print(f"Backward SFS selected: {len(selected_bwd)} features")

# Compare selections
overlap = set(selected_fwd) & set(selected_bwd)
print(f"Overlap between forward and backward: {len(overlap)} features")

Embedded Methods

Embedded methods perform feature selection as part of the model training process.

L1 Regularization (Lasso)

L1 penalty drives coefficients of irrelevant features to exactly zero.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Lasso with cross-validation to find optimal regularization
lasso = LassoCV(cv=5, random_state=42, n_jobs=-1)
lasso.fit(X_scaled, y)

lasso_coef = pd.Series(lasso.coef_, index=feature_names)
selected_lasso = lasso_coef[lasso_coef.abs() > 0.01].index
print(f"Lasso selected {len(selected_lasso)} features with non-zero coefficients")
print(f"Optimal alpha: {lasso.alpha_:.4f}")
print("\nCoefficient magnitudes:")
print(lasso_coef.abs().sort_values(ascending=False).head(15))

Tree-Based Feature Importance

Tree models provide built-in feature importance scores.

rf.fit(X, y)
importances = pd.Series(rf.feature_importances_, index=feature_names).sort_values(ascending=False)

print("Top 20 features by Random Forest importance:")
print(importances.head(20))

# Select features above a threshold
threshold = importances.mean()
selected_tree = importances[importances > threshold].index
print(f"\n{len(selected_tree)} features above mean importance ({threshold:.4f})")

Permutation Importance

Model-agnostic importance based on how much performance drops when a feature is shuffled.

# Use a smaller model for demonstration
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb.fit(X, y)

perm_result = permutation_importance(gb, X, y, n_repeats=10, random_state=42, n_jobs=-1)
perm_importance = pd.Series(
    perm_result.importances_mean, index=feature_names
).sort_values(ascending=False)

print("Permutation importance (top 15):")
print(perm_importance.head(15))

# Select features with positive permutation importance
selected_perm = perm_importance[perm_importance > 0].index
print(f"\n{len(selected_perm)} features with positive importance")

Comparing Selection Methods

selections = {
    'Mutual Info': set(selected_features),
    'RFE': set(selected_rfe),
    'Lasso': set(selected_lasso),
    'Tree Importance': set(selected_tree),
    'Permutation': set(selected_perm),
}

# Voting across methods
from collections import Counter
all_selected = []
for method, feats in selections.items():
    all_selected.extend(feats)

vote_counts = Counter(all_selected)
consensus_features = [f for f, c in vote_counts.items() if c >= 3]
print(f"Features selected by 3+ methods: {len(consensus_features)}")
print(f"Consensus features: {consensus_features}")

# Evaluate with consensus features
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
score_all = cross_val_score(rf, X, y, cv=cv, scoring='accuracy').mean()
score_consensus = cross_val_score(rf, X[consensus_features], y, cv=cv, scoring='accuracy').mean()
print(f"\nAccuracy with all 50 features: {score_all:.4f}")
print(f"Accuracy with {len(consensus_features)} consensus features: {score_consensus:.4f}")

Stability Selection

Combines subsampling with Lasso to identify features that are consistently selected.

from sklearn.linear_model import Lasso

n_bootstrap = 50
selection_frequency = pd.Series(0, index=feature_names)

for i in range(n_bootstrap):
    idx = np.random.choice(len(X), size=len(X), replace=True)
    X_boot, y_boot = X_scaled[idx], y[idx]
    
    lasso_boot = Lasso(alpha=0.01, random_state=i)
    lasso_boot.fit(X_boot, y_boot)
    
    selected_boot = np.abs(lasso_boot.coef_) > 0
    selection_frequency += selected_boot.astype(int)

selection_freq = selection_frequency / n_bootstrap
stable_features = selection_freq[selection_freq > 0.7].index
print(f"Stably selected features (>70% frequency): {len(stable_features)}")
print(selection_freq.sort_values(ascending=False).head(10))

Practical Workflow

def feature_selection_pipeline(X, y, k=15):
    """Multi-method feature selection with voting."""
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    results = {}
    
    # Mutual information
    mi = mutual_info_classif(X, y, random_state=42)
    results['mi'] = set(pd.Series(mi, index=X.columns).nlargest(k).index)
    
    # Lasso
    lasso = LassoCV(cv=5, random_state=42)
    lasso.fit(X_scaled, y)
    results['lasso'] = set(pd.Series(
        np.abs(lasso.coef_), index=X.columns
    ).nlargest(k).index)
    
    # Random Forest importance
    rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    rf.fit(X, y)
    results['rf'] = set(pd.Series(
        rf.feature_importances_, index=X.columns
    ).nlargest(k).index)
    
    # RFE
    rfe = RFE(rf, n_features_to_select=k)
    rfe.fit(X, y)
    results['rfe'] = set(X.columns[rfe.support_])
    
    # Vote
    from collections import Counter
    all_feats = []
    for feats in results.values():
        all_feats.extend(feats)
    
    votes = Counter(all_feats)
    selected = [f for f, c in votes.items() if c >= 3]
    
    print(f"Consensus features ({len(selected)}): {selected}")
    return selected

final_features = feature_selection_pipeline(X, y, k=15)

Best Practices

Use multiple methods – no single method is universally best
Consider computational cost – wrapper methods are O(2^n) in the worst case
Validate with cross-validation – never select features on the full dataset
Document selection criteria – reproducibility requires knowing why features were kept
Monitor feature importance over time – feature relevance can shift with data drift
Combine domain knowledge with statistics – the best selections blend both

Summary

Feature selection reduces noise, improves interpretability, and often boosts performance. Use filter methods for speed, wrapper methods for accuracy, and embedded methods for convenience. The strongest approach combines multiple methods with domain knowledge to build robust, interpretable models.