Supervised Learning

From Scatter Plots to Predictions — The Simplest ML Algorithm

Linear regression finds the best straight line through your data. It is fast, interpretable, and a powerful baseline for any regression problem.

Ordinary Least Squares — The closed-form solution for optimal parameters
Gradient Descent — The iterative optimization approach that scales
Evaluation Metrics — R², MSE, and MAE for measuring performance

"All models are wrong, but some are useful." — George Box

Linear Regression — Complete Guide

Linear regression is the simplest and most fundamental ML algorithm. It models the relationship between variables as a straight line.

Simple Linear Regression

DfLinear Regression

Given training data $\{(x^{(i)}, y^{(i)})\}_{i=1}^{N}$ where $x^{(i)} \in \mathbb{R}$ and $y^{(i)} \in \mathbb{R}$ , linear regression seeks parameters $w$ (slope) and $b$ (intercept) that minimize the sum of squared residuals: $\min_{w,b} \sum_{i=1}^{N}(y^{(i)} - (wx^{(i)} + b))^2$

Simple Linear Regression

\hat{y} = wx + b

Here,

$\hat{y}$ =Predicted value
$x$ =Input feature
$w$ =Slope (weight)
$b$ =Y-intercept (bias)

Example: House Prices

For predicting house prices: $y$ = price, $x$ = square footage, $w$ = price/sqft, $b$ = base price. For a house with 2000 sq ft, if $w = 150$ and $b = 50000$ :

\hat{y} = 150 \times 2000 + 50000 = \$350{,}000

Finding the Best Line

Ordinary Least Squares (OLS)

DfNormal Equation

The OLS closed-form solution for multiple linear regression with design matrix $\mathbf{X} \in \mathbb{R}^{N \times d}$ and target $\mathbf{y} \in \mathbb{R}^N$ :

\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

This minimizes $L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$ .

Computational Cost

The normal equation requires computing $(\mathbf{X}^T\mathbf{X})^{-1}$ , which is $O(d^3)$ . For high-dimensional data ( $d > 10{,}000$ ), gradient descent is preferred at $O(Nd)$ per iteration.

Gradient Descent

DfGradient Descent for Linear Regression

Initialize $\mathbf{w} = \mathbf{0}$ , then iterate:

w_j \leftarrow w_j - \alpha \cdot \frac{\partial L}{\partial w_j} = w_j + \frac{2\alpha}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})x_j^{(i)}

b \leftarrow b - \alpha \cdot \frac{\partial L}{\partial b} = b + \frac{2\alpha}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})

Cost Function Surface and Gradient Descent Path

Multiple Linear Regression

\hat{y} = w_1x_1 + w_2x_2 + \cdots + w_dx_d + b = \mathbf{w}^T\mathbf{x} + b

Here,

$\mathbf{x} \in \mathbb{R}^d$ =Input feature vector
$\mathbf{w} \in \mathbb{R}^d$ =Weight vector
$b$ =Bias term

Matrix Form

In matrix notation: $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}$ where $\mathbf{X} \in \mathbb{R}^{N \times (d+1)}$ (with bias column of 1s). This is the foundation for all linear models and neural networks.

Evaluation Metrics

R-squared

R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{N}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{N}(y_i - \bar{y})^2}

Here,

$R^2$ =Coefficient of determination (0 to 1)
$SS_{res}$ =Sum of squared residuals
$SS_{tot}$ =Total sum of squares

Adjusted R²

For multiple regression with $d$ features: $R^2_{adj} = 1 - \frac{(1-R^2)(N-1)}{N-d-1}$ . This penalizes adding features that don't improve the model.

Assumptions

Critical Assumptions (Gauss-Markov Theorem)

Linear regression assumes BLUE (Best Linear Unbiased Estimator): (1) Linearity, (2) Independence of errors, (3) Homoscedasticity (constant variance), (4) Normality of residuals, (5) No multicollinearity.

Polynomial Regression

DfPolynomial Regression

Extends linear regression by adding polynomial terms: $\hat{y} = w_1x + w_2x^2 + \cdots + w_px^p + b$ . Despite the nonlinearity in $x$ , it is still linear in the parameters $w$ , so OLS applies after feature transformation.

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model = LinearRegression().fit(X_poly, y)

Key Takeaways

Summary: Linear Regression

Linear regression finds the best $\hat{y} = \mathbf{w}^T\mathbf{x} + b$ by minimizing $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$
OLS gives closed-form $\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ ; gradient descent is iterative
The loss surface is convex — gradient descent finds the global minimum
R² measures proportion of variance explained: $R^2 = 1 - SS_{res}/SS_{tot}$
Check assumptions (linearity, normality, homoscedasticity, independence)
Polynomial regression extends to nonlinear relationships
Regularization (Ridge L2, Lasso L1) prevents overfitting in high dimensions
Linear regression is fast, interpretable, and a great baseline

What to Learn Next

-> Logistic Regression Classification with probability — from linear to sigmoid.

-> Regularization Prevent overfitting with Ridge, Lasso, and Elastic Net.

-> Model Evaluation How to know if your model actually works — beyond accuracy.

Linear Regression — Complete Guide with Math and Code

From Scatter Plots to Predictions — The Simplest ML Algorithm

Linear Regression — Complete Guide

Simple Linear Regression

DfLinear Regression

Simple Linear Regression

Example: House Prices

Finding the Best Line

Ordinary Least Squares (OLS)

DfNormal Equation

Gradient Descent

DfGradient Descent for Linear Regression

Cost Function Surface and Gradient Descent Path

Multiple Linear Regression

Multiple Linear Regression

Evaluation Metrics

R-squared

Assumptions

Polynomial Regression

DfPolynomial Regression

Key Takeaways

Summary: Linear Regression

What to Learn Next

Premium Content

Need Expert Machine Learning Help?