🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Maximum Likelihood Estimation

StatisticsEstimation🟢 Free Lesson

Advertisement

Maximum Likelihood Estimation

Why It Matters

MLE is the most common method for fitting statistical models. It underlies logistic regression, neural networks, and most machine learning. Understanding MLE gives you the theoretical foundation for why loss functions work, how parameters are estimated, and what guarantees exist about estimator quality. It is the bridge between probability theory and practical model fitting.


Overview

Given data from a distribution f(xθ)f(x|\theta), the maximum likelihood estimator finds the parameter value that maximizes the probability of observing the data. The likelihood function is L(θ)=f(xiθ)L(\theta) = \prod f(x_i|\theta), and we maximize it (or equivalently minimize the negative log-likelihood). For the Normal distribution, MLEs have closed forms: μ^=xˉ\hat{\mu} = \bar{x} and σ^2=1n(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2 (biased — uses nn not n1n-1). Fisher information I(θ)I(\theta) measures how much each observation tells us about θ\theta, and the Cramér-Rao bound sets a floor on estimator variance: no unbiased estimator can have variance less than 1/I(θ)1/I(\theta).


Key Concepts

MLE Estimator

θ^MLE=argmaxθi=1nf(xiθ)\hat{\theta}_{MLE} = \arg\max_\theta \prod_{i=1}^n f(x_i|\theta)

Here,

  • f(xiθ)f(x_i|\theta)=Probability density/mass function of x_i given parameter θ
  • i=1n\prod_{i=1}^n=Product over all n observations
  • θ^MLE\hat{\theta}_{MLE}=The parameter value that maximizes the likelihood

Log-Likelihood

(θ)=i=1nlogf(xiθ)\ell(\theta) = \sum_{i=1}^n \log f(x_i|\theta)

Here,

  • (θ)\ell(\theta)=Log-likelihood (converts products to sums)

Normal Distribution MLE

μ^=xˉ,σ^2=1ni=1n(xixˉ)2\hat{\mu} = \bar{x}, \quad \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^2

Here,

  • μ^\hat{\mu}=MLE of the mean (unbiased)
  • σ^2\hat{\sigma}^2=MLE of the variance (biased — uses n, not n-1)

Fisher Information

I(θ)=E[2θ2]I(\theta) = -E\left[\frac{\partial^2 \ell}{\partial \theta^2}\right]

Here,

  • I(θ)I(\theta)=Fisher information per observation

Cramér-Rao Lower Bound

Var(θ^)1I(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}

Here,

  • I(θ)I(\theta)=Fisher information

Score Function

S(θ)=θS(\theta) = \frac{\partial \ell}{\partial \theta}

Here,

  • S(θ)S(\theta)=Score function; set to 0 to find MLE

MLE for Common Distributions

DistributionParameterMLENotes
Normalμ\muxˉ\bar{x}Unbiased
Normalσ2\sigma^21n(xixˉ)2\frac{1}{n}\sum(x_i-\bar{x})^2Biased (uses nn not n1n-1)
Poissonλ\lambdaxˉ\bar{x}Closed-form
Bernoullippsuccesses/n\text{successes}/nClosed-form
Exponentialλ\lambda1/xˉ1/\bar{x}Closed-form

Quick Example

MLE for Poisson Distribution

Data: x1,,xnx_1, \ldots, x_n from Poisson(λ\lambda). Log-likelihood:

(λ)=(xilogλλlogxi!)\ell(\lambda) = \sum(x_i \log\lambda - \lambda - \log x_i!)

Setting ddλ=xiλn=0\frac{d\ell}{d\lambda} = \frac{\sum x_i}{\lambda} - n = 0:

λ^=xˉ\hat{\lambda} = \bar{x}

The MLE of λ\lambda is the sample mean — intuitive because the Poisson mean equals λ\lambda.

Numerical MLE

For complex models without closed-form MLEs, minimize the negative log-likelihood numerically:

from scipy.optimize import minimize
neg_log_lik = lambda params: -np.sum(stats.norm.logpdf(data, params[0], params[1]))
result = minimize(neg_log_lik, [0, 1], bounds=[(None,None), (0.01, None)])

Key Takeaways

Summary: Maximum Likelihood Estimation

  • Definition: θ^MLE\hat{\theta}_{MLE} maximizes the probability of observing the data. The most widely used estimation method.
  • Log-Likelihood: Convert products to sums via (θ)=logf(xiθ)\ell(\theta) = \sum \log f(x_i|\theta). Preserves argmax, simplifies optimization.
  • Closed-Form MLEs: Normal (μ^=xˉ\hat{\mu} = \bar{x}, biased σ^2\hat{\sigma}^2), Poisson (λ^=xˉ\hat{\lambda} = \bar{x}), Bernoulli (p^=xˉ\hat{p} = \bar{x}).
  • Fisher Information: Measures parameter identifiability. Higher I(θ)I(\theta) -> tighter estimates. Cramér-Rao: Var(θ^)1/I(θ)\text{Var}(\hat{\theta}) \geq 1/I(\theta).
  • Score Function: Set S(θ)=/θ=0S(\theta) = \partial\ell/\partial\theta = 0 to find the MLE analytically.
  • Numerical Optimization: When no closed form exists, minimize negative log-likelihood with scipy.optimize.minimize.
  • Connection to ML: Most ML loss functions are negative log-likelihoods (cross-entropy, MSE). MLE unifies statistical and machine learning estimation.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Point Estimation

Properties of Estimators

Related Topics

Premium Content

Maximum Likelihood Estimation

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Mathematics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement