Robust Statistics — Resistant to Outliers

Advanced Statistical Methods

When Outliers Try to Ruin Your Analysis

Robust statistics provide methods that resist the influence of extreme observations, ensuring reliable inference even when data are contaminated or assumptions are violated. A single outlier can distort classical estimates by orders of magnitude.

Financial risk management — Robust estimators prevent extreme market events from skewing risk models
Quality control — Manufacturing data often contain contamination; robust methods maintain accuracy
Environmental monitoring — Sensor malfunctions produce outliers that robust techniques gracefully handle

Robust statistics keep your conclusions standing even when the data fight back.

Why Robustness Matters

Classical estimators such as the sample mean and OLS regression are optimal under normality but highly sensitive to outliers. A single extreme observation can drastically alter results. Robust statistics provides estimators that remain reliable even when data are contaminated or model assumptions are violated.

Robust Estimators of Location

DfSample Median as a Robust Estimator

The sample median is the most fundamental robust estimator of location. It minimizes the sum of absolute deviations:

\hat{\mu}_{\text{med}} = \underset{\mu}{\arg\min} \sum_{i=1}^{n} |x_i - \mu|

The median has a 50% breakdown point: it takes corruption of at least half the data to render it arbitrarily wrong.

Trimmed Mean

\bar{x}_{\alpha} = \frac{1}{n - 2\lfloor n\alpha \rfloor} \sum_{i=\lfloor n\alpha \rfloor + 1}^{n - \lfloor n\alpha \rfloor} x_{(i)}

Here,

$\alpha$ =Trimming fraction (e.g., 0.1 for 10% trimming)
$x_{(i)}$ =The i-th order statistic
$\lfloor n\alpha \rfloor$ =Number of observations trimmed from each tail

Winsorized Mean

\bar{x}_W = \frac{1}{n} \sum_{i=1}^{n} \tilde{x}_i, \quad \tilde{x}_i = \begin{cases} x_{(\lfloor n\alpha \rfloor + 1)} & \text{if } x_i < x_{(\lfloor n\alpha \rfloor + 1)} \\ x_i & \text{otherwise} \\ x_{(n - \lfloor n\alpha \rfloor)} & \text{if } x_i > x_{(n - \lfloor n\alpha \rfloor)} \end{cases}

Here,

$\tilde{x}_i$ =Winsorized observation — extremes replaced by nearest non-extreme value
$\alpha$ =Winsorizing fraction

Breakdown Point

DfBreakdown Point

The finite-sample breakdown point of an estimator $\hat{\theta}_n$ is the smallest fraction $\epsilon^*$ of observations that can be replaced by arbitrary values to make the estimator arbitrarily large:

\epsilon^* = \min\left\{\frac{m}{n} : \sup_{\text{corruption}} |\hat{\theta}_{n,m}| = \infty\right\}

The sample mean has breakdown point $\epsilon^* = 1/n$ (one outlier suffices). The median achieves $\epsilon^* = 0.5$ (the maximum possible).

Efficiency vs. Breakdown

There is a fundamental tradeoff: estimators with higher breakdown points tend to have lower efficiency under normality. The median is 64% efficient under normality compared to the mean, but far more robust.

M-Estimators

DfM-Estimator

An M-estimator generalizes maximum likelihood by solving:

\sum_{i=1}^{n} \psi(x_i - \hat{\theta}) = 0

where $\psi$ is a function derived from a loss function $\rho$ via $\psi(u) = \rho'(u)$ . For least squares, $\rho(u) = u^2$ and $\psi(u) = 2u$ , yielding the mean. Robust M-estimators use $\psi$ functions that bound the influence of outliers.

M-Estimator Objective Function

\hat{\theta} = \underset{\theta}{\arg\min} \sum_{i=1}^{n} \rho\left(\frac{x_i - \theta}{\hat{\sigma}}\right)

Here,

$\rho$ =Robust loss function (e.g., Huber or Tukey bisquare)
$\hat{\sigma}$ =Robust scale estimate (e.g., MAD)
$\psi$ =Derivative of \rho: the influence function

Huber's $\psi$ Function

Huber's Loss Function

\rho_H(u) = \begin{cases} \frac{1}{2}u^2 & \text{if } |u| \leq k \\ k|u| - \frac{1}{2}k^2 & \text{if } |u| > k \end{cases}

Here,

$k$ =Tuning constant, typically k = 1.345 for 95% efficiency under normality
$\psi_H(u)$ == \min(|u|, k) \cdot \text{sign}(u): clips influence at |u| = k

Tukey's Bisquare (Biweight) $\psi$ Function

Tukey Bisquare Loss

\rho_T(u) = \begin{cases} \frac{k^2}{6}\left[1 - \left(1 - \frac{u^2}{k^2}\right)^3\right] & \text{if } |u| \leq k \\ \frac{k^2}{6} & \text{if } |u| > k \end{cases}

Here,

$k$ =Tuning constant, typically k = 4.685 for 95% efficiency under normality
$\psi_T(u)$ == u(1 - u^2/k^2)^2 \cdot \mathbf{1}(|u| \leq k): redescending — fully rejects extreme outliers

Huber vs. Tukey Bisquare

Huber: $\psi$ is bounded but does not redescend — extreme outliers still have some (bounded) influence
Tukey bisquare: $\psi$ redescends to 0 — outliers beyond $|u| > k$ have zero influence entirely
Use Huber when you want bounded influence; use Tukey when you want complete rejection of extreme outliers

Influence Function

ThInfluence Function Properties

The influence function (IF) of an estimator $T$ at distribution $F$ is:

\text{IF}(x; T, F) = \lim_{\epsilon \to 0} \frac{T((1-\epsilon)F + \epsilon \delta_x) - T(F)}{\epsilon}

where $\delta_x$ is a point mass at $x$ . This measures the infinitesimal effect of adding an outlier at $x$ to the distribution $F$ .

Properties:

An estimator is bounded-influence if $\text{IF}$ is bounded for all $x$
The OLS estimator has $\text{IF}(x; \hat{\beta}, F) \propto x$ — unbounded
The Huber M-estimator has $\text{IF}(x; T_H, F) = \psi_H(x) \cdot \sigma$ — bounded by $k\sigma$
The Tukey bisquare has redescending IF — returns to 0 for large $|x|$

Robust Regression

Robust Regression with statsmodels

import numpy as np
import statsmodels.api as sm
from statsmodels.robust import huber
import matplotlib.pyplot as plt

np.random.seed(42)
n = 100

# Clean data
X_clean = np.linspace(0, 10, n)
y_clean = 2 + 1.5 * X_clean + np.random.normal(0, 1, n)

# Add 10% gross outliers
n_outliers = 10
outlier_idx = np.random.choice(n, n_outliers, replace=False)
y_contaminated = y_clean.copy()
y_contaminated[outlier_idx] += np.random.normal(0, 20, n_outliers)

X = sm.add_constant(X_clean)

# OLS — sensitive to outliers
ols = sm.OLS(y_contaminated, X).fit()

# Robust regression — Huber's T
rlm = sm.RLM(y_contaminated, X, M=sm.robust.norms.HuberT()).fit()

# Robust regression — Tukey Bisquare
rlm_bisquare = sm.RLM(y_contaminated, X, M=sm.robust.norms.TukeyBiweight()).fit()

print("OLS estimates:", np.round(ols.params, 4))
print("Huber estimates:", np.round(rlm.params, 4))
print("Tukey estimates:", np.round(rlm_bisquare.params, 4))
print("\nTrue: [2.0, 1.5]")

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(X_clean, y_contaminated, s=20, alpha=0.6, label='Data (with outliers)')
ax.scatter(X_clean[outlier_idx], y_contaminated[outlier_idx],
           s=80, c='red', marker='x', label='Outliers')
ax.plot(X_clean, ols.fittedvalues, 'b-', linewidth=2, label='OLS')
ax.plot(X_clean, rlm.fittedvalues, 'g--', linewidth=2, label='Huber')
ax.plot(X_clean, rlm_bisquare.fittedvalues, 'r:', linewidth=2, label='Tukey Bisquare')
ax.legend()
ax.set_title('Robust vs. OLS Regression')
plt.tight_layout()
plt.savefig('robust_regression.png', dpi=150)
plt.show()

Robust Standard Errors

Huber-White Robust Standard Errors

\widehat{\text{Var}}(\hat{\beta}) = (X^T X)^{-1} \left(\sum_{i=1}^{n} \hat{u}_i^2 \, x_i x_i^T \right) (X^T X)^{-1}

Here,

$\hat{u}_i$ =OLS residual for observation i
$x_i$ =Row vector of regressors for observation i

When to Use Robust SEs

Robust standard errors (also called sandwich estimators or HC estimators) are valid under heteroskedasticity and mild misspecification. They do not require the error variance to be constant. Use them when:

You suspect heteroskedasticity
You want protection against mild misspecification of the error distribution
You are running OLS but want inference that is robust to non-normal errors

Bootstrap for Robust Inference

DfNonparametric Bootstrap

The nonparametric bootstrap resamples the observed data directly (with replacement) to estimate the sampling distribution of any statistic, without distributional assumptions:

\hat{\theta}^{*b} = T(X^{*b}), \quad b = 1, \ldots, B

where $X^{*b}$ is the $b$ -th bootstrap sample drawn from the empirical distribution $\hat{F}_n$ .

Bootstrap Standard Errors for the Median

import numpy as np

np.random.seed(42)
data = np.array([3.2, 4.1, 2.8, 15.7, 3.5, 4.0, 3.9, 2.1,
                 4.3, 3.7, 16.2, 3.4, 4.2, 2.9, 3.8])

# Bootstrap standard error of the median
B = 10000
boot_medians = np.array([
    np.median(np.random.choice(data, size=len(data), replace=True))
    for _ in range(B)
])

se_median = np.std(boot_medians, ddof=1)
ci_95 = np.percentile(boot_medians, [2.5, 97.5])

print(f"Sample median: {np.median(data):.2f}")
print(f"Bootstrap SE:  {se_median:.4f}")
print(f"95% Bootstrap CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}]")

Key Takeaways

Summary: Robust Statistics

Classical estimators (mean, OLS) are non-robust — a single outlier can distort results arbitrarily
Breakdown point quantifies an estimator's resistance to contamination; the median achieves the maximum (50%)
M-estimators generalize MLE by bounding the influence of outliers via $\psi$ functions
Huber's $\psi$ clips influence; Tukey's bisquare redescends to zero — complete rejection
Influence function formalizes the effect of an infinitesimal outlier on an estimator
Robust regression (RLM) provides coefficient estimates that are not driven by outliers
Robust standard errors (sandwich estimators) protect against heteroskedasticity without changing point estimates
Bootstrap provides distribution-free inference for any statistic, including robust estimators

Robust Statistics — Resistant to Outliers

Robust Statistics — Resistant to Outliers

When Outliers Try to Ruin Your Analysis

Why Robustness Matters

Robust Estimators of Location

DfSample Median as a Robust Estimator

Trimmed Mean

Winsorized Mean

Breakdown Point

DfBreakdown Point

M-Estimators

DfM-Estimator

M-Estimator Objective Function

Huber's $\psi$ Function

Huber's Loss Function

Tukey's Bisquare (Biweight) $\psi$ Function

Tukey Bisquare Loss

Influence Function

ThInfluence Function Properties

Robust Regression

Robust Regression with statsmodels

Robust Standard Errors

Huber-White Robust Standard Errors

Bootstrap for Robust Inference

DfNonparametric Bootstrap

Bootstrap Standard Errors for the Median

Key Takeaways

Summary: Robust Statistics

Premium Content

Need Expert Statistics Help?

Robust Statistics — Resistant to Outliers

Robust Statistics — Resistant to Outliers

When Outliers Try to Ruin Your Analysis

Why Robustness Matters

Robust Estimators of Location

DfSample Median as a Robust Estimator

Trimmed Mean

Winsorized Mean

Breakdown Point

DfBreakdown Point

M-Estimators

DfM-Estimator

M-Estimator Objective Function

Huber's ψ\psiψ Function

Huber's Loss Function

Tukey's Bisquare (Biweight) ψ\psiψ Function

Tukey Bisquare Loss

Influence Function

ThInfluence Function Properties

Robust Regression

Robust Regression with statsmodels

Robust Standard Errors

Huber-White Robust Standard Errors

Bootstrap for Robust Inference

DfNonparametric Bootstrap

Bootstrap Standard Errors for the Median

Key Takeaways

Summary: Robust Statistics

Premium Content

Need Expert Statistics Help?

Huber's $\psi$ Function

Tukey's Bisquare (Biweight) $\psi$ Function