OLS Estimation: From First Principles

Regression Analysis

The Math Behind Regression Coefficients

Ordinary Least Squares finds the coefficient vector that minimizes the sum of squared residuals. Understanding OLS from first principles reveals why regression works and when it breaks down.

Data Science — Build foundation for understanding regularization and advanced estimators
Econometrics — Derive the Gauss-Markov theorem and BLUE properties
Actuarial Science — Implement premium models with transparent coefficient derivation

The normal equations transform data into the best linear unbiased estimates.

Ordinary Least Squares (OLS) is the foundation of linear regression. It finds the coefficient vector $\hat{\boldsymbol{\beta}}$ that minimizes the sum of squared residuals.

Matrix Formulation

DfThe Linear Model in Matrix Notation

\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}

where:

$\mathbf{y}$ is the $n \times 1$ response vector
$\mathbf{X}$ is the $n \times p$ design matrix (first column is 1s for the intercept)
$\boldsymbol{\beta}$ is the $p \times 1$ coefficient vector
$\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$ is the error vector

Derivation of the Normal Equations

ThOLS — Normal Equations

The OLS estimator minimizes:

\text{SSR} = \boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})

Expanding:

\text{SSR} = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}

Taking the derivative with respect to $\boldsymbol{\beta}$ and setting it to zero:

\frac{\partial \text{SSR}}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}

This gives the normal equations:

\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}

If $\mathbf{X}^T\mathbf{X}$ is invertible (which requires $\text{rank}(\mathbf{X}) = p$ ):

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Proof that this is a minimum: The Hessian is $\frac{\partial^2 \text{SSR}}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = 2\mathbf{X}^T\mathbf{X}$ , which is positive semi-definite (positive definite if $\mathbf{X}$ has full column rank). $\square$

The Hat Matrix

Hat Matrix (Projection Matrix)

\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{H}\mathbf{y}

Here,

$\mathbf{H}$ == X(X?X)?¹X? — the hat matrix
$\hat{\mathbf{y}}$ =Fitted (predicted) values

$\mathbf{H}$ is an orthogonal projection matrix: $\mathbf{H}^2 = \mathbf{H}$ and $\mathbf{H}^T = \mathbf{H}$ . It projects $\mathbf{y}$ onto the column space of $\mathbf{X}$ . The residuals $\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{y}$ lie in the orthogonal complement.

Properties of OLS Estimators

ThGauss–Markov Theorem

Under the classical assumptions ( $E[\boldsymbol{\varepsilon}] = \mathbf{0}$ , $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}$ , $\mathbf{X}$ fixed), the OLS estimator $\hat{\boldsymbol{\beta}}$ is BLUE (Best Linear Unbiased Estimator):

\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}

No other linear unbiased estimator of $\boldsymbol{\beta}$ has smaller variance.

ThUnbiasedness of $\hat{\boldsymbol{\beta}}$

E[\hat{\boldsymbol{\beta}}] = E[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E[\mathbf{y}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}

Estimation of Error Variance

Unbiased Estimator of Error Variance

\hat{\sigma}^2 = s^2 = \frac{\mathbf{e}^T\mathbf{e}}{n - p} = \frac{SSR}{n - p}

Here,

$s^2$ =Estimated error variance
$n - p$ =Degrees of freedom
$\mathbf{e}$ =Residual vector

$E[s^2] = \sigma^2$ — this is unbiased. The denominator $n - p$ accounts for the $p$ parameters estimated.

Numerical Considerations

Avoid Matrix Inversion

In practice, never compute $(\mathbf{X}^T\mathbf{X})^{-1}$ directly. Instead, use:

QR decomposition: $\mathbf{X} = \mathbf{Q}\mathbf{R}$ , then solve $\mathbf{R}\hat{\boldsymbol{\beta}} = \mathbf{Q}^T\mathbf{y}$
Cholesky decomposition: $\mathbf{X}^T\mathbf{X} = \mathbf{L}\mathbf{L}^T$ , then solve via forward/back substitution
SVD: Most numerically stable, handles rank-deficient cases

These methods are $O(np^2)$ and avoid the numerical instability of explicit matrix inversion.

Key Takeaways

Summary: OLS Estimation

Normal equations: $\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$ , giving $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
Gauss–Markov: OLS is BLUE under the classical assumptions
Variance of $\hat{\boldsymbol{\beta}}$ : $\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$ — depends on the design matrix
Error variance: $\hat{\sigma}^2 = SSR/(n-p)$ — unbiased with $n-p$ degrees of freedom
The hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ projects $\mathbf{y}$ onto the column space of $\mathbf{X}$
Use QR or SVD, not explicit matrix inversion, for numerical stability

OLS Estimation — Deriving Regression Coefficients from Scratch