Regression Analysis

Why It Matters

Regression models relationships between variables, enabling prediction and causal inference. It is the foundation of predictive modeling, from simple trend lines to complex machine learning pipelines. Understanding regression assumptions and diagnostics ensures that model coefficients, p-values, and predictions are trustworthy. Without checking assumptions, regression results can be deeply misleading.

Overview

Simple linear regression models the relationship between one predictor and one outcome: $y = \beta_0 + \beta_1 x + \epsilon$ . Multiple linear regression extends this to multiple predictors. Coefficients are estimated via ordinary least squares (OLS), which minimizes the sum of squared residuals. Key diagnostics include checking linearity (residuals vs fitted plot), normality (Q-Q plot of residuals), homoscedasticity (constant residual variance), and independence (no autocorrelation). R² measures the proportion of variance explained; adjusted R² penalizes for adding predictors. Violated assumptions lead to biased coefficients, incorrect standard errors, and invalid inference.

Key Concepts

Simple Linear Regression

y = \beta_0 + \beta_1 x + \epsilon

Here,

$\beta_0$ =Intercept (value of y when x = 0)
$\beta_1$ =Slope (change in y per unit change in x)
$\epsilon$ =Error term, $\epsilon \sim N(0, \sigma^2)$

OLS Estimator for Slope

\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}

Here,

$\hat{\beta}_1$ =Estimated slope (BLUE under Gauss-Markov)

OLS Estimator for Intercept

\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}

Here,

$\hat{\beta}_0$ =Estimated intercept

Multiple Linear Regression

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon

Here,

$x_1, \ldots, x_p$ =Predictor variables
$\beta_1, \ldots, \beta_p$ =Partial regression coefficients (effect of each predictor holding others constant)

R-Squared

R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Here,

$SS_{res}$ =Sum of squared residuals
$SS_{tot}$ =Total sum of squares

Adjusted R-Squared

R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}

Here,

$p$ =Number of predictors
$n$ =Sample size

Diagnostic Checklist

Assumption	What to Check	How to Check	Remedy if Violated
Linearity	Linear relationship	Residuals vs. fitted plot	Add polynomial terms, transforms
Normality	Residuals ~ Normal	Q-Q plot, Shapiro-Wilk	Transform, robust regression
Homoscedasticity	Constant variance	Residuals vs. fitted (funnel = bad)	Weighted least squares, robust SE
Independence	No autocorrelation	Durbin-Watson test	Time series models, mixed effects
No multicollinearity	Predictors not highly correlated	VIF > 10 threshold	Remove/combine predictors, Ridge

Quick Example

Multiple Regression Prediction

Model: $\hat{y} = 2 + 3x_1 - x_2$ . If $x_1 = 4$ , $x_2 = 2$ :

\hat{y} = 2 + 3(4) - 2 = 2 + 12 - 2 = 12

Each coefficient represents the change in $y$ for a one-unit increase in the predictor, holding all others constant. So $\beta_1 = 3$ means each unit increase in $x_1$ increases $y$ by 3 units, regardless of $x_2$ .

R-Squared Interpretation

$R^2 = 0.85$ means 85% of the variance in $y$ is explained by the predictors. But adjusted $R^2 = 0.82$ after penalizing for 10 predictors — suggesting some predictors may not be adding value.

Key Takeaways

Summary: Regression Analysis

Simple Linear Regression: $y = \beta_0 + \beta_1 x + \epsilon$ . Slope captures the linear relationship between x and y.
Multiple Regression: Extends to multiple predictors. Each $\beta_j$ is a partial effect (holding others constant).
OLS: Minimizes $\sum(y_i - \hat{y}_i)^2$ . Coefficients are BLUE (Best Linear Unbiased Estimators) under Gauss-Markov assumptions.
R²: Proportion of variance explained. Adjusted R² penalizes for number of predictors.
Diagnostics: Always check linearity, normality, homoscedasticity, and independence before interpreting coefficients.
Multicollinearity: Correlated predictors inflate standard errors. Check VIF > 10 threshold.
Extrapolation: Predictions outside the range of observed x values are unreliable.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Simple Linear Regression

Simple Linear Regression — Full derivation, OLS, geometric interpretation, and examples

OLS Estimation

OLS Estimation — Gauss-Markov theorem, BLUE properties, matrix formulation, and efficiency

Assumptions

Regression Assumptions — Gauss-Markov assumptions, what happens when they fail, and remedies

Diagnostics

Residual Analysis — Residual plots, Q-Q plots, influence measures, Cook's distance, and leverage
R-Squared and Adjusted R-Squared — Interpreting model fit, adjusted R², and information criteria

Multiple Regression

Multiple Linear Regression — Extending to multiple predictors, interpretation, and variable selection

Regression Analysis