Maximum Likelihood Estimation

Why It Matters

MLE is the most common method for fitting statistical models. It underlies logistic regression, neural networks, and most machine learning. Understanding MLE gives you the theoretical foundation for why loss functions work, how parameters are estimated, and what guarantees exist about estimator quality. It is the bridge between probability theory and practical model fitting.

Overview

Given data from a distribution $f(x|\theta)$ , the maximum likelihood estimator finds the parameter value that maximizes the probability of observing the data. The likelihood function is $L(\theta) = \prod f(x_i|\theta)$ , and we maximize it (or equivalently minimize the negative log-likelihood). For the Normal distribution, MLEs have closed forms: $\hat{\mu} = \bar{x}$ and $\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ (biased — uses $n$ not $n-1$ ). Fisher information $I(\theta)$ measures how much each observation tells us about $\theta$ , and the Cramér-Rao bound sets a floor on estimator variance: no unbiased estimator can have variance less than $1/I(\theta)$ .

Key Concepts

MLE Estimator

\hat{\theta}_{MLE} = \arg\max_\theta \prod_{i=1}^n f(x_i|\theta)

Here,

$f(x_i|\theta)$ =Probability density/mass function of x_i given parameter θ
$\prod_{i=1}^n$ =Product over all n observations
$\hat{\theta}_{MLE}$ =The parameter value that maximizes the likelihood

Log-Likelihood

\ell(\theta) = \sum_{i=1}^n \log f(x_i|\theta)

Here,

$\ell(\theta)$ =Log-likelihood (converts products to sums)

Normal Distribution MLE

\hat{\mu} = \bar{x}, \quad \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^2

Here,

$\hat{\mu}$ =MLE of the mean (unbiased)
$\hat{\sigma}^2$ =MLE of the variance (biased — uses n, not n-1)

Fisher Information

I(\theta) = -E\left[\frac{\partial^2 \ell}{\partial \theta^2}\right]

Here,

$I(\theta)$ =Fisher information per observation

Cramér-Rao Lower Bound

\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}

Here,

$I(\theta)$ =Fisher information

Score Function

S(\theta) = \frac{\partial \ell}{\partial \theta}

Here,

$S(\theta)$ =Score function; set to 0 to find MLE

MLE for Common Distributions

Distribution	Parameter	MLE	Notes
Normal	$\mu$	$\bar{x}$	Unbiased
Normal	$\sigma^2$	$\frac{1}{n}\sum(x_i-\bar{x})^2$	Biased (uses $n$ not $n-1$ )
Poisson	$\lambda$	$\bar{x}$	Closed-form
Bernoulli	$p$	$\text{successes}/n$	Closed-form
Exponential	$\lambda$	$1/\bar{x}$	Closed-form

Quick Example

MLE for Poisson Distribution

Data: $x_1, \ldots, x_n$ from Poisson( $\lambda$ ). Log-likelihood:

\ell(\lambda) = \sum(x_i \log\lambda - \lambda - \log x_i!)

Setting $\frac{d\ell}{d\lambda} = \frac{\sum x_i}{\lambda} - n = 0$ :

\hat{\lambda} = \bar{x}

The MLE of $\lambda$ is the sample mean — intuitive because the Poisson mean equals $\lambda$ .

Numerical MLE

For complex models without closed-form MLEs, minimize the negative log-likelihood numerically:

from scipy.optimize import minimize
neg_log_lik = lambda params: -np.sum(stats.norm.logpdf(data, params[0], params[1]))
result = minimize(neg_log_lik, [0, 1], bounds=[(None,None), (0.01, None)])

Key Takeaways

Summary: Maximum Likelihood Estimation

Definition: $\hat{\theta}_{MLE}$ maximizes the probability of observing the data. The most widely used estimation method.
Log-Likelihood: Convert products to sums via $\ell(\theta) = \sum \log f(x_i|\theta)$ . Preserves argmax, simplifies optimization.
Closed-Form MLEs: Normal ( $\hat{\mu} = \bar{x}$ , biased $\hat{\sigma}^2$ ), Poisson ( $\hat{\lambda} = \bar{x}$ ), Bernoulli ( $\hat{p} = \bar{x}$ ).
Fisher Information: Measures parameter identifiability. Higher $I(\theta)$ -> tighter estimates. Cramér-Rao: $\text{Var}(\hat{\theta}) \geq 1/I(\theta)$ .
Score Function: Set $S(\theta) = \partial\ell/\partial\theta = 0$ to find the MLE analytically.
Numerical Optimization: When no closed form exists, minimize negative log-likelihood with scipy.optimize.minimize.
Connection to ML: Most ML loss functions are negative log-likelihoods (cross-entropy, MSE). MLE unifies statistical and machine learning estimation.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Point Estimation

Point Estimation — MLE, method of moments, and properties of estimators

Properties of Estimators

Properties of Estimators — Unbiasedness, consistency, efficiency, sufficiency, and the Cramér-Rao bound

Maximum Likelihood Estimation