🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Linear Regression — Complete Guide with Math and Code

ML FoundationsRegression🟢 Free Lesson

Advertisement

Supervised Learning

From Scatter Plots to Predictions — The Simplest ML Algorithm

Linear regression finds the best straight line through your data. It is fast, interpretable, and a powerful baseline for any regression problem.

  • Ordinary Least Squares — The closed-form solution for optimal parameters
  • Gradient Descent — The iterative optimization approach that scales
  • Evaluation Metrics — R², MSE, and MAE for measuring performance

"All models are wrong, but some are useful." — George Box

Linear Regression — Complete Guide

Linear regression is the simplest and most fundamental ML algorithm. It models the relationship between variables as a straight line.


Simple Linear Regression

DfLinear Regression

Given training data {(x(i),y(i))}i=1N\{(x^{(i)}, y^{(i)})\}_{i=1}^{N} where x(i)Rx^{(i)} \in \mathbb{R} and y(i)Ry^{(i)} \in \mathbb{R}, linear regression seeks parameters ww (slope) and bb (intercept) that minimize the sum of squared residuals: minw,bi=1N(y(i)(wx(i)+b))2\min_{w,b} \sum_{i=1}^{N}(y^{(i)} - (wx^{(i)} + b))^2

Simple Linear Regression

y^=wx+b\hat{y} = wx + b

Here,

  • y^\hat{y}=Predicted value
  • xx=Input feature
  • ww=Slope (weight)
  • bb=Y-intercept (bias)
Linear Regression: Fitting the Best LinexyTraining data pointsŷ = wx + b (best fit line)Cost Function: ResidualsL(w,b) = Σ(yáµ¢ ≈ ŷáµ¢)²Each residual:eáµ¢ = yáµ¢ ≈ (wxáµ¢ + b)Minimize sum of squared residuals:L = Σeáµ¢² = Σ(yáµ¢ ≈ wxáµ¢ ≈ b)²Analytical solution (partial derivatives):∂L/∂w = 0 → w = Σ(xáµ¢≈xÌ„)(yáµ¢≈ȳ) / Σ(xáµ¢≈xÌ„)²∂L/∂b = 0 → b = ȳ ≈ wxÌ„

Example: House Prices

For predicting house prices: yy = price, xx = square footage, ww = price/sqft, bb = base price. For a house with 2000 sq ft, if w=150w = 150 and b=50000b = 50000:

y^=150×2000+50000=$350,000\hat{y} = 150 \times 2000 + 50000 = \$350{,}000

Finding the Best Line

Ordinary Least Squares (OLS)

DfNormal Equation

The OLS closed-form solution for multiple linear regression with design matrix XRN×d\mathbf{X} \in \mathbb{R}^{N \times d} and target yRN\mathbf{y} \in \mathbb{R}^N:

w^=(XTX)1XTy\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

This minimizes L(w)=yXw2L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2.

Computational Cost

The normal equation requires computing (XTX)1(\mathbf{X}^T\mathbf{X})^{-1}, which is O(d3)O(d^3). For high-dimensional data (d>10,000d > 10{,}000), gradient descent is preferred at O(Nd)O(Nd) per iteration.

Gradient Descent

DfGradient Descent for Linear Regression

Initialize w=0\mathbf{w} = \mathbf{0}, then iterate:

wjwjαLwj=wj+2αNi=1N(y(i)y^(i))xj(i)w_j \leftarrow w_j - \alpha \cdot \frac{\partial L}{\partial w_j} = w_j + \frac{2\alpha}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})x_j^{(i)}
bbαLb=b+2αNi=1N(y(i)y^(i))b \leftarrow b - \alpha \cdot \frac{\partial L}{\partial b} = b + \frac{2\alpha}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})

Cost Function Surface and Gradient Descent Path

Cost Function Surface and Gradient Descent TrajectoryContour Plot of L(w, b)StartOptimalw (weight)3D Loss Surface L(w, b)t=0MinThe loss surface is convex for linear regression — gradient descent finds the global minimum

Multiple Linear Regression

Multiple Linear Regression

y^=w1x1+w2x2++wdxd+b=wTx+b\hat{y} = w_1x_1 + w_2x_2 + \cdots + w_dx_d + b = \mathbf{w}^T\mathbf{x} + b

Here,

  • xRd\mathbf{x} \in \mathbb{R}^d=Input feature vector
  • wRd\mathbf{w} \in \mathbb{R}^d=Weight vector
  • bb=Bias term

Matrix Form

In matrix notation: y^=Xw\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} where XRN×(d+1)\mathbf{X} \in \mathbb{R}^{N \times (d+1)} (with bias column of 1s). This is the foundation for all linear models and neural networks.


Evaluation Metrics

Regression Evaluation MetricsMSEMean Squared Error1/N Σ(yáµ¢ ≈ ŷáµ¢)²• Penalizes large errors• Differentiable ✓• Sensitive to outliersRMSERoot MSE√(1/N Σ(yáµ¢ ≈ ŷáµ¢)²)• Same units as y• Interpretable• Most common metricMAEMean Absolute Error1/N Σ|yáµ¢ ≈ ŷáµ¢|• Robust to outliers• L1 loss variant• Not differentiable at 0Coefficient of Determination1 ≈ SS_res/SS_tot• Scale-independent• 1.0 = perfect• % variance explained

R-squared

R2=1SSresSStot=1i=1N(yiy^i)2i=1N(yiyˉ)2R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{N}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{N}(y_i - \bar{y})^2}

Here,

  • R2R^2=Coefficient of determination (0 to 1)
  • SSresSS_{res}=Sum of squared residuals
  • SStotSS_{tot}=Total sum of squares

Adjusted R²

For multiple regression with dd features: Radj2=1(1R2)(N1)Nd1R^2_{adj} = 1 - \frac{(1-R^2)(N-1)}{N-d-1}. This penalizes adding features that don't improve the model.


Assumptions

Critical Assumptions (Gauss-Markov Theorem)

Linear regression assumes BLUE (Best Linear Unbiased Estimator): (1) Linearity, (2) Independence of errors, (3) Homoscedasticity (constant variance), (4) Normality of residuals, (5) No multicollinearity.

Assumption Diagnostics1. LinearityResidual plot2. NormalityQ-Q plot / Histogram3. HomoscedasticityEqual spread4. IndependenceDurbin-Watson test5. No MulticollinearityX₁ X₂X₃ Xâ‚„VIF < 10

Polynomial Regression

DfPolynomial Regression

Extends linear regression by adding polynomial terms: y^=w1x+w2x2++wpxp+b\hat{y} = w_1x + w_2x^2 + \cdots + w_px^p + b. Despite the nonlinearity in xx, it is still linear in the parameters ww, so OLS applies after feature transformation.

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model = LinearRegression().fit(X_poly, y)

Key Takeaways

Summary: Linear Regression

  1. Linear regression finds the best y^=wTx+b\hat{y} = \mathbf{w}^T\mathbf{x} + b by minimizing yXw2\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2
  2. OLS gives closed-form w^=(XTX)1XTy\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}; gradient descent is iterative
  3. The loss surface is convex — gradient descent finds the global minimum
  4. measures proportion of variance explained: R2=1SSres/SStotR^2 = 1 - SS_{res}/SS_{tot}
  5. Check assumptions (linearity, normality, homoscedasticity, independence)
  6. Polynomial regression extends to nonlinear relationships
  7. Regularization (Ridge L2, Lasso L1) prevents overfitting in high dimensions
  8. Linear regression is fast, interpretable, and a great baseline

What to Learn Next

-> Logistic Regression Classification with probability — from linear to sigmoid.

-> Regularization Prevent overfitting with Ridge, Lasso, and Elastic Net.

-> Model Evaluation How to know if your model actually works — beyond accuracy.

Premium Content

Linear Regression — Complete Guide with Math and Code

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement