ML Foundations
Feature Engineering — Where Domain Knowledge Meets Data Science
Feature engineering transforms raw data into representations that dramatically improve model performance. It is often the single most impactful step in any machine learning pipeline.
- Numerical Scaling — StandardScaler, MinMaxScaler, and RobustScaler prepare features for distance-based models
- Categorical Encoding — one-hot, label, and target encoding convert categorical data into model-ready formats
- Feature Creation — interaction terms, date components, and aggregations unlock hidden patterns in your data
"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." — Andrew Ng
Feature Engineering — Complete Guide
Feature engineering transforms raw data into features that improve model performance. It's often the most impactful step in ML.
Mathematical Foundations
Standardization (Z-score)
where
and
Min-Max Scaling
Information Gain (Feature Selection)
where
is the entropy.
Feature Engineering Pipeline
Numerical Features
DfStandardScaler (Z-score)
Standardizes features by removing the mean and scaling to unit variance. Results in mean=0, std=1.
Z-score Standardization
Here,
- =Standardized value
- =Original value
- =Mean of feature
- =Standard deviation of feature
DfMinMaxScaler
Scales features to a fixed range, typically [0, 1], by subtracting the minimum and dividing by the range.
Min-Max Scaling
Here,
- =Scaled value
- =Minimum and maximum values
DfRobustScaler
Uses median and interquartile range (IQR) instead of mean and variance. Robust to outliers.
Encoding Methods Diagram
RobustScaler:
Uses median and IQR
Robust to outliers
Use for: Data with outliers
Log Transform:
x_log = log(x + 1)
Use for: Skewed distributions, power laws
When to Use Each Scaler
- StandardScaler: Most algorithms (SVM, KNN, Neural Networks)
- MinMaxScaler: Neural networks, image data
- RobustScaler: Data with outliers
- Log Transform: Skewed distributions, power laws
Feature Creation
Date features:
Year, Month, Day, Hour
Day of week, Is weekend
Is holiday, Season
Days since event
Text features:
Word count, Character count
TF-IDF vectors
Word embeddings
Sentiment scores
Interaction features:
x₁ × x₂ (product)
x₁ / x₂ (ratio)
x₁ - x₂ (difference)
x₁², x₂² (polynomial)
Aggregation features:
Mean, Median, Std per group
Count per category
Rolling statistics
Lag features
Feature Selection
DfFeature Selection
The process of selecting a subset of relevant features for use in model construction. Reduces overfitting, improves accuracy, and reduces training time.
Feature Selection Methods
Method 1: Filter (statistical tests)
Correlation with target
Chi-squared test
Mutual information
ANOVA F-test
Method 2: Wrapper (model-based)
Forward selection
Backward elimination
Recursive feature elimination (RFE)
Genetic algorithms
Method 3: Embedded (built into model)
L1 regularization (Lasso)
Feature importance (Tree-based)
Permutation importance
Python Implementation
Python Implementation
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Define preprocessing
numerical = ['age', 'income', 'score']
categorical = ['gender', 'city', 'category']
preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical)
])
# Create pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
# Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.3f}")
Key Takeaways
Summary: Feature Engineering
- Feature engineering is often more important than model choice
- Scale numerical features for distance-based algorithms
- One-hot encode categorical variables for most models
- Create interaction features to capture relationships
- Feature selection removes noise and speeds up training
- Use pipelines to prevent data leakage
- Domain knowledge guides the best feature engineering
- Automated tools (featuretools) can generate features
What to Learn Next
-> Dimensionality Reduction Reduce high-dimensional features using PCA, t-SNE, and UMAP while preserving key information.
-> Model Evaluation Measure how much your engineered features actually improve model performance.
-> Linear Regression See how feature scaling and encoding directly impact linear model accuracy.
-> Clustering Use unsupervised techniques to discover hidden groups and create new features.
-> Model Selection Choose the best algorithm and tune hyperparameters for your engineered features.
-> Model Deployment Package your feature engineering pipeline into production-ready APIs and services.