ML Foundations

Feature Engineering — Where Domain Knowledge Meets Data Science

Feature engineering transforms raw data into representations that dramatically improve model performance. It is often the single most impactful step in any machine learning pipeline.

Numerical Scaling — StandardScaler, MinMaxScaler, and RobustScaler prepare features for distance-based models
Categorical Encoding — one-hot, label, and target encoding convert categorical data into model-ready formats
Feature Creation — interaction terms, date components, and aggregations unlock hidden patterns in your data

"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." — Andrew Ng

Feature Engineering — Complete Guide

Feature engineering transforms raw data into features that improve model performance. It's often the most impactful step in ML.

Mathematical Foundations

Standardization (Z-score)

z = \frac{x - \mu}{\sigma}

where

\mu = \frac{1}{n}\sum_{i=1}^{n} x_i

and

\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2}

Min-Max Scaling

x_{\text{scaled}} = \frac{x - \min(x)}{\max(x) - \min(x)}

Information Gain (Feature Selection)

IG(D, A) = H(D) - H(D|A)

where

H(D) = -\sum_{k=1}^{K} p_k \log_2(p_k)

is the entropy.

Feature Engineering Pipeline

Numerical Features

DfStandardScaler (Z-score)

Standardizes features by removing the mean and scaling to unit variance. Results in mean=0, std=1.

Z-score Standardization

z = \frac{x - \mu}{\sigma}

Here,

$z$ =Standardized value
$x$ =Original value
$\mu$ =Mean of feature
$\sigma$ =Standard deviation of feature

DfMinMaxScaler

Scales features to a fixed range, typically [0, 1], by subtracting the minimum and dividing by the range.

Min-Max Scaling

x_{\text{scaled}} = \frac{x - \min(x)}{\max(x) - \min(x)}

Here,

$x_{\text{scaled}}$ =Scaled value
$\min(x), \max(x)$ =Minimum and maximum values

DfRobustScaler

Uses median and interquartile range (IQR) instead of mean and variance. Robust to outliers.

Encoding Methods Diagram

Architecture Diagram

RobustScaler:
Uses median and IQR
Robust to outliers
Use for: Data with outliers

Log Transform:
x_log = log(x + 1)
Use for: Skewed distributions, power laws

When to Use Each Scaler

StandardScaler: Most algorithms (SVM, KNN, Neural Networks)
MinMaxScaler: Neural networks, image data
RobustScaler: Data with outliers
Log Transform: Skewed distributions, power laws

Feature Creation

Architecture Diagram

Date features:
  Year, Month, Day, Hour
  Day of week, Is weekend
  Is holiday, Season
  Days since event

Text features:
  Word count, Character count
  TF-IDF vectors
  Word embeddings
  Sentiment scores

Interaction features:
  x₁ × x₂ (product)
  x₁ / x₂ (ratio)
  x₁ - x₂ (difference)
  x₁², x₂² (polynomial)

Aggregation features:
  Mean, Median, Std per group
  Count per category
  Rolling statistics
  Lag features

Feature Selection

DfFeature Selection

The process of selecting a subset of relevant features for use in model construction. Reduces overfitting, improves accuracy, and reduces training time.

Feature Selection Methods

Architecture Diagram

Method 1: Filter (statistical tests)
  Correlation with target
  Chi-squared test
  Mutual information
  ANOVA F-test

Method 2: Wrapper (model-based)
  Forward selection
  Backward elimination
  Recursive feature elimination (RFE)
  Genetic algorithms

Method 3: Embedded (built into model)
  L1 regularization (Lasso)
  Feature importance (Tree-based)
  Permutation importance

Python Implementation

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Define preprocessing
numerical = ['age', 'income', 'score']
categorical = ['gender', 'city', 'category']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical)
])

# Create pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.3f}")

Key Takeaways

Summary: Feature Engineering

Feature engineering is often more important than model choice
Scale numerical features for distance-based algorithms
One-hot encode categorical variables for most models
Create interaction features to capture relationships
Feature selection removes noise and speeds up training
Use pipelines to prevent data leakage
Domain knowledge guides the best feature engineering
Automated tools (featuretools) can generate features

What to Learn Next

-> Dimensionality Reduction Reduce high-dimensional features using PCA, t-SNE, and UMAP while preserving key information.

-> Model Evaluation Measure how much your engineered features actually improve model performance.

-> Linear Regression See how feature scaling and encoding directly impact linear model accuracy.

-> Clustering Use unsupervised techniques to discover hidden groups and create new features.

-> Model Selection Choose the best algorithm and tune hyperparameters for your engineered features.

-> Model Deployment Package your feature engineering pipeline into production-ready APIs and services.

Feature Engineering — Complete Guide

Feature Engineering — Where Domain Knowledge Meets Data Science

Feature Engineering — Complete Guide

Mathematical Foundations

Standardization (Z-score)

Min-Max Scaling

Information Gain (Feature Selection)

Feature Engineering Pipeline

Numerical Features

DfStandardScaler (Z-score)

Z-score Standardization

DfMinMaxScaler

Min-Max Scaling

DfRobustScaler

Encoding Methods Diagram

Feature Creation

Feature Selection

DfFeature Selection

Feature Selection Methods

Python Implementation

Python Implementation

Key Takeaways

Summary: Feature Engineering

What to Learn Next

Premium Content

Need Expert Machine Learning Help?