Robust Statistics — Resistant to Outliers
Advanced Statistical Methods
When Outliers Try to Ruin Your Analysis
Robust statistics provide methods that resist the influence of extreme observations, ensuring reliable inference even when data are contaminated or assumptions are violated. A single outlier can distort classical estimates by orders of magnitude.
- Financial risk management — Robust estimators prevent extreme market events from skewing risk models
- Quality control — Manufacturing data often contain contamination; robust methods maintain accuracy
- Environmental monitoring — Sensor malfunctions produce outliers that robust techniques gracefully handle
Robust statistics keep your conclusions standing even when the data fight back.
Why Robustness Matters
Classical estimators such as the sample mean and OLS regression are optimal under normality but highly sensitive to outliers. A single extreme observation can drastically alter results. Robust statistics provides estimators that remain reliable even when data are contaminated or model assumptions are violated.
Robust Estimators of Location
DfSample Median as a Robust Estimator
The sample median is the most fundamental robust estimator of location. It minimizes the sum of absolute deviations:
The median has a 50% breakdown point: it takes corruption of at least half the data to render it arbitrarily wrong.
Trimmed Mean
Here,
- =Trimming fraction (e.g., 0.1 for 10% trimming)
- =The i-th order statistic
- =Number of observations trimmed from each tail
Winsorized Mean
Here,
- =Winsorized observation — extremes replaced by nearest non-extreme value
- =Winsorizing fraction
Breakdown Point
DfBreakdown Point
The finite-sample breakdown point of an estimator is the smallest fraction of observations that can be replaced by arbitrary values to make the estimator arbitrarily large:
The sample mean has breakdown point (one outlier suffices). The median achieves (the maximum possible).
Efficiency vs. Breakdown
There is a fundamental tradeoff: estimators with higher breakdown points tend to have lower efficiency under normality. The median is 64% efficient under normality compared to the mean, but far more robust.
M-Estimators
DfM-Estimator
An M-estimator generalizes maximum likelihood by solving:
where is a function derived from a loss function via . For least squares, and , yielding the mean. Robust M-estimators use functions that bound the influence of outliers.
M-Estimator Objective Function
Here,
- =Robust loss function (e.g., Huber or Tukey bisquare)
- =Robust scale estimate (e.g., MAD)
- =Derivative of \rho: the influence function
Huber's Function
Huber's Loss Function
Here,
- =Tuning constant, typically k = 1.345 for 95% efficiency under normality
- == \min(|u|, k) \cdot \text{sign}(u): clips influence at |u| = k
Tukey's Bisquare (Biweight) Function
Tukey Bisquare Loss
Here,
- =Tuning constant, typically k = 4.685 for 95% efficiency under normality
- == u(1 - u^2/k^2)^2 \cdot \mathbf{1}(|u| \leq k): redescending — fully rejects extreme outliers
Huber vs. Tukey Bisquare
- Huber: is bounded but does not redescend — extreme outliers still have some (bounded) influence
- Tukey bisquare: redescends to 0 — outliers beyond have zero influence entirely
- Use Huber when you want bounded influence; use Tukey when you want complete rejection of extreme outliers
Influence Function
ThInfluence Function Properties
The influence function (IF) of an estimator at distribution is:
where is a point mass at . This measures the infinitesimal effect of adding an outlier at to the distribution .
Properties:
- An estimator is bounded-influence if is bounded for all
- The OLS estimator has — unbounded
- The Huber M-estimator has — bounded by
- The Tukey bisquare has redescending IF — returns to 0 for large
Robust Regression
Robust Regression with statsmodels
import numpy as np
import statsmodels.api as sm
from statsmodels.robust import huber
import matplotlib.pyplot as plt
np.random.seed(42)
n = 100
# Clean data
X_clean = np.linspace(0, 10, n)
y_clean = 2 + 1.5 * X_clean + np.random.normal(0, 1, n)
# Add 10% gross outliers
n_outliers = 10
outlier_idx = np.random.choice(n, n_outliers, replace=False)
y_contaminated = y_clean.copy()
y_contaminated[outlier_idx] += np.random.normal(0, 20, n_outliers)
X = sm.add_constant(X_clean)
# OLS — sensitive to outliers
ols = sm.OLS(y_contaminated, X).fit()
# Robust regression — Huber's T
rlm = sm.RLM(y_contaminated, X, M=sm.robust.norms.HuberT()).fit()
# Robust regression — Tukey Bisquare
rlm_bisquare = sm.RLM(y_contaminated, X, M=sm.robust.norms.TukeyBiweight()).fit()
print("OLS estimates:", np.round(ols.params, 4))
print("Huber estimates:", np.round(rlm.params, 4))
print("Tukey estimates:", np.round(rlm_bisquare.params, 4))
print("\nTrue: [2.0, 1.5]")
# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(X_clean, y_contaminated, s=20, alpha=0.6, label='Data (with outliers)')
ax.scatter(X_clean[outlier_idx], y_contaminated[outlier_idx],
s=80, c='red', marker='x', label='Outliers')
ax.plot(X_clean, ols.fittedvalues, 'b-', linewidth=2, label='OLS')
ax.plot(X_clean, rlm.fittedvalues, 'g--', linewidth=2, label='Huber')
ax.plot(X_clean, rlm_bisquare.fittedvalues, 'r:', linewidth=2, label='Tukey Bisquare')
ax.legend()
ax.set_title('Robust vs. OLS Regression')
plt.tight_layout()
plt.savefig('robust_regression.png', dpi=150)
plt.show()
Robust Standard Errors
Huber-White Robust Standard Errors
Here,
- =OLS residual for observation i
- =Row vector of regressors for observation i
When to Use Robust SEs
Robust standard errors (also called sandwich estimators or HC estimators) are valid under heteroskedasticity and mild misspecification. They do not require the error variance to be constant. Use them when:
- You suspect heteroskedasticity
- You want protection against mild misspecification of the error distribution
- You are running OLS but want inference that is robust to non-normal errors
Bootstrap for Robust Inference
DfNonparametric Bootstrap
The nonparametric bootstrap resamples the observed data directly (with replacement) to estimate the sampling distribution of any statistic, without distributional assumptions:
where is the -th bootstrap sample drawn from the empirical distribution .
Bootstrap Standard Errors for the Median
import numpy as np
np.random.seed(42)
data = np.array([3.2, 4.1, 2.8, 15.7, 3.5, 4.0, 3.9, 2.1,
4.3, 3.7, 16.2, 3.4, 4.2, 2.9, 3.8])
# Bootstrap standard error of the median
B = 10000
boot_medians = np.array([
np.median(np.random.choice(data, size=len(data), replace=True))
for _ in range(B)
])
se_median = np.std(boot_medians, ddof=1)
ci_95 = np.percentile(boot_medians, [2.5, 97.5])
print(f"Sample median: {np.median(data):.2f}")
print(f"Bootstrap SE: {se_median:.4f}")
print(f"95% Bootstrap CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}]")
Key Takeaways
Summary: Robust Statistics
- Classical estimators (mean, OLS) are non-robust — a single outlier can distort results arbitrarily
- Breakdown point quantifies an estimator's resistance to contamination; the median achieves the maximum (50%)
- M-estimators generalize MLE by bounding the influence of outliers via functions
- Huber's clips influence; Tukey's bisquare redescends to zero — complete rejection
- Influence function formalizes the effect of an infinitesimal outlier on an estimator
- Robust regression (RLM) provides coefficient estimates that are not driven by outliers
- Robust standard errors (sandwich estimators) protect against heteroskedasticity without changing point estimates
- Bootstrap provides distribution-free inference for any statistic, including robust estimators