Linear Regression: Math, Code and Assumptions
The Foundation of Machine Learning
Linear regression is the most fundamental algorithm in ML. Despite its simplicity, understanding it deeply provides insight into all supervised learning methods.
ML Algorithm Landscape Supervised Learning Algorithms Linear Linear Regression Logistic Regression Ridge/ Lasso Tree-Based Decision Tree Random Forest XGBoost Neural Perceptron MLP Deep Learning Support Linear SVM Kernel SVM SVR Linear Regression is the foundation — understand this first!
1. Simple Linear Regression
Mathematical Formulation
Model:
y ^ = β 0 + β 1 x + ϵ \hat{y} = \beta_0 + \beta_1 x + \epsilon y ^ = β 0 + β 1 x + ϵ Where:
β 0 \beta_0 β 0 = intercept (bias) — value of y y y when x = 0 x = 0 x = 0
β 1 \beta_1 β 1 = slope (weight) — change in y y y for unit change in x x x
ϵ \epsilon ϵ = error term — ϵ ∼ N ( 0 , σ 2 ) \epsilon \sim N(0, \sigma^2) ϵ ∼ N ( 0 , σ 2 )
Simple Linear Regression: Finding the Best Fit Line Feature (x) Target (y) eáµ¢ eáµ¢ eáµ¢ eáµ¢ eáµ¢ eáµ¢ βâ‚€ = intercept β₁ = slope Actual data points Regression line Residuals (errors)
How this diagram works: This diagram shows the core concept of simple linear regression — fitting a straight line through scattered data points to model the relationship between a feature (x) and a target (y). The blue data points represent actual observations, while the purple regression line represents the model's predictions. The red dashed lines (residuals) show the vertical distance between each data point and the line, representing prediction errors. The goal of linear regression is to minimize these residuals by finding the optimal intercept (β₀) and slope (β₁) that produce the smallest total squared error.
2. Cost Function (Ordinary Least Squares)
Mean Squared Error (MSE):
J ( β 0 , β 1 ) = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 = 1 n ∑ i = 1 n ( y i − ( β 0 + β 1 x i ) ) 2 J(\beta_0, \beta_1) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_i))^2 J ( β 0 , β 1 ) = n 1 i = 1 ∑ n ( y i − y ^ i ) 2 = n 1 i = 1 ∑ n ( y i − ( β 0 + β 1 x i ) ) 2 Goal: Find β 0 , β 1 \beta_0, \beta_1 β 0 , β 1 that minimize J J J
Closed-Form Solution (Normal Equation):
β 1 = ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) ∑ i = 1 n ( x i − x ˉ ) 2 = Cov ( X , Y ) Var ( X ) \beta_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} β 1 = ∑ i = 1 n ( x i − x ˉ ) 2 ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) = Var ( X ) Cov ( X , Y ) β 0 = y ˉ − β 1 x ˉ \beta_0 = \bar{y} - \beta_1 \bar{x} β 0 = y ˉ − β 1 x ˉ
Cost Function: The Bowl-Shaped Surface Global Minimum Gradient Descent Gradient Descent β₁ (slope) J(βâ‚€, β₁) The cost function is convex — gradient descent finds the global minimum
3. Gradient Descent
Update Rule:
β j : = β j − α ∂ J ∂ β j \beta_j := \beta_j - \alpha \frac{\partial J}{\partial \beta_j} β j := β j − α ∂ β j ∂ J Partial Derivatives:
∂ J ∂ β 0 = − 2 n ∑ i = 1 n ( y i − y ^ i ) \frac{\partial J}{\partial \beta_0} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i) ∂ β 0 ∂ J = − n 2 i = 1 ∑ n ( y i − y ^ i ) ∂ J ∂ β 1 = − 2 n ∑ i = 1 n ( y i − y ^ i ) ⋅ x i \frac{\partial J}{\partial \beta_1} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i) \cdot x_i ∂ β 1 ∂ J = − n 2 i = 1 ∑ n ( y i − y ^ i ) ⋅ x i Where α \alpha α = learning rate (step size)
Gradient Descent: Learning Rate Impact α = 0.1 ✓ α = 1.0 ≤ (oscillates) α = 0.001 (too slow) Good learning rate Too large Too small
4. Multiple Linear Regression
Model:
y ^ = β 0 + β 1 x 1 + β 2 x 2 + ⋯ + β p x p = β 0 + ∑ j = 1 p β j x j \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p = \beta_0 + \sum_{j=1}^{p} \beta_j x_j y ^ = β 0 + β 1 x 1 + β 2 x 2 + ⋯ + β p x p = β 0 + j = 1 ∑ p β j x j Matrix Form:
y ^ = X β \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta} y ^ = X β Where X ∈ R n × ( p + 1 ) \mathbf{X} \in \mathbb{R}^{n \times (p+1)} X ∈ R n × ( p + 1 ) (design matrix with intercept column)
Normal Equation (Matrix):
β ^ = ( X T X ) − 1 X T y \boldsymbol{\hat{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} β ^ = ( X T X ) − 1 X T y
Multiple Regression: Multiple Features → Single Output x₁ (Size) β₁ x₂ (Beds) β₂ x₃ (Age) β₃ xâ‚„ (Baths) βâ‚„ Linear Model ŷ = βâ‚€ + Σβ⊥x⊥ Output ŷ (Price)
5. Model Evaluation Metrics
R² Score (Coefficient of Determination):
R 2 = 1 − S S r e s S S t o t = 1 − ∑ i = 1 n ( y i − y ^ i ) 2 ∑ i = 1 n ( y i − y ˉ ) 2 R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} R 2 = 1 − S S t o t S S r es = 1 − ∑ i = 1 n ( y i − y ˉ ) 2 ∑ i = 1 n ( y i − y ^ i ) 2
R 2 = 1 R^2 = 1 R 2 = 1 : Perfect fit
R 2 = 0 R^2 = 0 R 2 = 0 : Model predicts the mean
R 2 < 0 R^2 < 0 R 2 < 0 : Model is worse than predicting the mean
Adjusted R²:
R a d j 2 = 1 − ( 1 − R 2 ) ( n − 1 ) n − p − 1 R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1} R a d j 2 = 1 − n − p − 1 ( 1 − R 2 ) ( n − 1 )
R² Score: How Well Does the Model Fit? SS_total = Σ(yáµ¢ - ȳ)² = Total Variance SS_explained = Σ(ŷáµ¢ - ȳ)² = 70% SS_residual = 30% R² = 1 - (30/100) = 0.70 (70% variance explained)
6. Assumptions of Linear Regression
5 Key Assumptions to Validate 1. Linearity y = f(x) is linear 2. Independence Errors are independent 3. Homoscedasticity Constant variance 4. Normality of Errors ε ~ N(0, σ²) 5. No Multicollinearity X₁ → X₂ Features not correlated
Checking Assumptions with Residual Plots
Residual Analysis: What to Look For ✓ Good: Random ≤ Bad: Funnel ≤ Bad: Pattern
7. Implementation in Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluate
print(f"Intercept (βâ‚€): {model.intercept_[0]:.4f}")
print(f"Slope (β₁): {model.coef_[0][0]:.4f}")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
# Visualize
plt.scatter(X_test, y_test, color='blue', alpha=0.6, label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Linear Regression Fit')
plt.legend()
plt.show()
Key Takeaways
Linear regression finds the best-fit line through data points
Cost function (MSE) measures prediction error — minimize it
Gradient descent iteratively updates weights to find minimum
R² score tells you how much variance the model explains
Validate assumptions before trusting the model
Regularization (Ridge/Lasso) prevents overfitting
Next: Logistic Regression Extend linear regression to classification with the sigmoid function.