Analysis of Covariance (ANCOVA)

Advanced Statistical Methods

Adjusting for Confounders While Testing Group Differences

ANCOVA combines ANOVA with regression to compare group means while statistically controlling for continuous covariates. It increases precision and removes bias from confounding variables.

Clinical trials — Compare treatment effects while adjusting for baseline severity scores
Education research — Assess school performance while controlling for socioeconomic status
Agriculture — Compare crop yields while adjusting for soil quality differences across fields

ANCOVA levels the playing field so you can see true group differences clearly.

Analysis of Covariance (ANCOVA) extends the general linear model to include both categorical independent variables (factors) and continuous independent variables (covariates). ANCOVA combines the explanatory power of ANOVA for group comparisons with the precision-enhancing capability of regression through covariate adjustment. By statistically controlling for the linear influence of one or more covariates, ANCOVA reduces error variance, increases statistical power, and enables more precise comparisons of treatment means adjusted for confounding variables.

Mathematical Framework

DfANCOVA Model

The one-way ANCOVA model with one covariate $X$ and $g$ treatment groups is:

Y_{ij} = \mu + \tau_j + \beta(X_{ij} - \bar{X}) + \varepsilon_{ij}

where:

$Y_{ij}$ is the response for subject $i$ in group $j$
$\mu$ is the overall intercept (grand mean of $Y$ adjusted to the mean of $X$ )
$\tau_j$ is the treatment effect for group $j$ ( $\sum \tau_j = 0$ )
$\beta$ is the common regression coefficient (slope) for the covariate
$X_{ij}$ is the covariate value, $\bar{X}$ is the overall mean of $X$
$\varepsilon_{ij} \sim N(0, \sigma^2)$ independently

In matrix notation: $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ , where $\mathbf{X}$ includes indicator variables for groups and the covariate.

ANCOVA Sum of Squares Decomposition

The total sum of squares is partitioned as:

SS_{\text{Total}} = SS_{\text{Group}} + SS_{\text{Covariate}} + SS_{\text{Error}}

The adjusted treatment sum of squares is computed after removing covariate variation:

SS_{\text{Group}|\text{Covariate}} = SS_{\text{Total}} - SS_{\text{Covariate}} - SS_{\text{Error}}

or equivalently:

SS_{\text{Group}|\text{Covariate}} = SS_{\text{Group}} - \frac{(SP_{XY})^2}{SS_X}

where $SP_{XY}$ is the sum of cross-products between group coding and covariate, and $SS_X$ is the sum of squares of the covariate.

Homogeneity of Regression Slopes

A critical assumption of ANCOVA is that the regression slopes relating the covariate to the dependent variable are equal across all groups.

ThTest for Homogeneity of Slopes

The model including the interaction between group and covariate:

Y_{ij} = \mu + \tau_j + \beta X_{ij} + \gamma_j X_{ij} + \varepsilon_{ij}

where $\gamma_j$ represents group-specific deviations from the common slope $\beta$ . The null hypothesis of homogeneity of slopes:

H_0: \gamma_1 = \gamma_2 = \cdots = \gamma_g = 0

The test statistic compares the full model (with interactions) to the reduced model (without interactions):

F = \frac{(SS_{\text{Reduced}} - SS_{\text{Full}}) / (g-1)}{SS_{\text{Full}} / (n - 2g)} = \frac{MS_{\text{Interaction}}}{MS_{\text{Error}}}

If $H_0$ is rejected, ANCOVA is inappropriate and separate regression analyses for each group are recommended.

Implications of Non-Parallel Slopes

When the homogeneity of slopes assumption is violated:

Interaction model: Fit separate slopes: $Y_{ij} = \mu + \tau_j + \beta_j X_{ij} + \varepsilon_{ij}$
Johnson-Neyman technique: Identifies ranges of the covariate where group differences are statistically significant
Simple slopes analysis: Tests group differences at specific values of the covariate (e.g., $\pm 1$ SD from mean)

Adjusted Means

ANCOVA produces adjusted (least-squares) means that account for the covariate, providing fairer comparisons when groups have different covariate distributions.

DfAdjusted (Marginal) Mean

The adjusted mean for group $j$ is the predicted group mean at the overall mean of the covariate:

\bar{Y}_{j,\text{adj}} = \bar{Y}_j - \hat{\beta}(\bar{X}_j - \bar{X})

where $\bar{Y}_j$ is the unadjusted group mean, $\hat{\beta}$ is the estimated common slope, $\bar{X}_j$ is the group mean of the covariate, and $\bar{X}$ is the overall covariate mean.

In matrix notation for multiple covariates:

\bar{\mathbf{Y}}_{j,\text{adj}} = \bar{\mathbf{Y}}_j - \hat{\boldsymbol{\beta}}^T(\bar{\mathbf{X}}_j - \bar{\mathbf{X}})

Standard Error of Adjusted Means

The standard error of the adjusted mean for group $j$ :

SE(\bar{Y}_{j,\text{adj}}) = \sqrt{MS_{\text{Error}}\left[\frac{1}{n_j} + \frac{(\bar{X}_j - \bar{X})^2}{SS_{X,\text{Error}}}\right]}

where $MS_{\text{Error}}$ is the mean square error from the ANCOVA model, $n_j$ is the group size, and $SS_{X,\text{Error}}$ is the error sum of squares for the covariate. Confidence intervals for adjusted means:

\bar{Y}_{j,\text{adj}} \pm t_{\alpha/2, df_{\text{Error}}} \cdot SE(\bar{Y}_{j,\text{adj}})

ANCOVA Calculation with Covariate Adjustment

Problem: A teaching experiment compares three methods ( $g = 3$ ) with pretest scores as covariate. Data ( $n = 6$ per group):

Group	Pretest ( $X$ )	Posttest ( $Y$ )
A	65, 70, 75, 80, 85, 90	72, 78, 82, 88, 94, 98
B	60, 65, 70, 75, 80, 85	68, 74, 80, 86, 92, 96
C	55, 60, 65, 70, 75, 80	65, 70, 76, 82, 88, 94

Step 1: Compute summary statistics:

Group	$\bar{X}_j$	$\bar{Y}_j$	$n_j$
A	77.5	85.33	6
B	72.5	82.67	6
C	67.5	79.17	6
Overall	72.5	82.39	18

Step 2: Fit common regression of $Y$ on $X$ within groups (test parallel slopes):

Assume slopes are homogeneous ( $p > 0.05$ for interaction test).

Common slope estimate: $\hat{\beta} = 0.92$ (from pooled within-group regression)

Step 3: Compute adjusted means:

\bar{Y}_{A,\text{adj}} = 85.33 - 0.92(77.5 - 72.5) = 85.33 - 4.60 = 80.73

\bar{Y}_{B,\text{adj}} = 82.67 - 0.92(72.5 - 72.5) = 82.67 - 0 = 82.67

\bar{Y}_{C,\text{adj}} = 79.17 - 0.92(67.5 - 72.5) = 79.17 + 4.60 = 83.77

Interpretation: Without covariate adjustment, Group A appears best ( $\bar{Y}_A = 85.33$ ). After adjusting for pretest scores (Group A had the highest pretest), Group C actually shows the highest adjusted posttest performance ( $\bar{Y}_{C,\text{adj}} = 83.77$ ). The covariate adjustment reveals that Group A's apparent superiority was largely due to selection bias (higher entering ability).

Effect Size: Partial Eta-Squared

DfPartial Eta-Squared ($\eta_p^2$)

Partial eta-squared measures the proportion of variance explained by a factor after removing variance attributable to other factors and covariates:

\eta_p^2 = \frac{SS_{\text{Effect}}}{SS_{\text{Effect}} + SS_{\text{Error}}}

For the treatment effect in ANCOVA:

\eta_p^2(\text{Group}) = \frac{SS_{\text{Group}|\text{Covariate}}}{SS_{\text{Group}|\text{Covariate}} + SS_{\text{Error}}}

Interpretation guidelines (Cohen, 1988):

$\eta_p^2 = 0.01$ : Small effect
$\eta_p^2 = 0.06$ : Medium effect
$\eta_p^2 = 0.14$ : Large effect

Unlike regular $\eta^2$ , partial $\eta^2$ does not sum to 1.0 across effects because the denominator excludes variance explained by other effects.

Relationship Between $F$ and $\eta_p^2$

Partial eta-squared can be computed directly from the F-statistic:

\eta_p^2 = \frac{(g-1)F}{(g-1)F + df_{\text{Error}}} = \frac{F}{F + df_{\text{Error}}/(g-1)}

This relationship enables calculation of effect sizes from reported F-values and degrees of freedom.

Assumptions and Diagnostics

ThANCOVA Assumptions

1. Linearity: The relationship between the covariate and dependent variable is linear within each group. Test through scatter plots and residual analysis.

2. Homogeneity of Regression Slopes: The slope of $Y$ on $X$ is the same across groups. Test via the Group × Covariate interaction:

F_{\text{interaction}} = \frac{MS_{\text{Group} \times X}}{MS_{\text{Error}}}

Non-significance ( $p > 0.05$ ) supports the assumption.

3. Homoscedasticity: Error variances are equal across groups. Test using Levene's test on residuals.

4. Normality of Residuals: Residuals are normally distributed. Test using Shapiro-Wilk test, Q-Q plots.

5. Independence: Observations are independent (satisfied by random assignment).

6. Covariate Independence: The covariate is not affected by the treatment (measured before treatment or from a separate source).

ANCOVA vs. Blocking vs. Regression

ANCOVA adjusts for a continuous covariate, treating it as a nuisance variable to reduce error variance
Blocking adjusts for a categorical nuisance variable (blocks), treating it as a fixed factor
Multiple regression $Y = b_0 + b_1 X_1 + b_2 X_2 + \cdots$ can incorporate both factors and covariates using dummy coding

ANCOVA is the special case where some predictors are categorical (factors) and others are continuous (covariates). The general linear model unifies all three approaches.

Python Implementation

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

# Generate ANCOVA data
np.random.seed(42)

n_per_group = 25
n_groups = 3

# Group-specific true effects
group_effects = {'A': 5.0, 'B': 8.0, 'C': 3.0}
beta_true = 0.75  # True common slope

# Generate covariate (pretest) - different distributions per group
covariate_means = {'A': 75, 'B': 65, 'C': 70}
data_list = []

for group, mean_x in covariate_means.items():
    X = np.random.normal(mean_x, 10, n_per_group)
    Y = 50 + group_effects[group] + beta_true * X + np.random.normal(0, 5, n_per_group)
    
    df = pd.DataFrame({
        'Pretest': X,
        'Posttest': Y,
        'Group': group
    })
    data_list.append(df)

data = pd.concat(data_list, ignore_index=True)

print("Data Summary:")
print(data.groupby('Group').agg(['mean', 'std', 'count']))

# Test homogeneity of slopes (interaction model)
model_interaction = ols('Posttest ~ C(Group) * Pretest', data=data).fit()
print("\nInteraction Model (Test Homogeneity of Slopes):")
print(f"Interaction F-statistic: {model_interaction.fvalue:.3f}")
print(f"Interaction p-value: {model_interaction.f_pvalue:.4f}")

# ANCOVA model (assuming parallel slopes)
model_ancova = ols('Posttest ~ C(Group) + Pretest', data=data).fit()
print("\nANCOVA Model Summary:")
print(model_ancova.summary())

# Extract ANCOVA table
from statsmodels.stats.anova import anova_lm
ancova_table = anova_lm(model_ancova, typ=2)
print("\nANCOVA Table (Type II):")
print(ancova_table)

# Compute partial eta-squared
ss_group = ancova_table.loc['C(Group)', 'sum_sq']
ss_covariate = ancova_table.loc['Pretest', 'sum_sq']
ss_error = ancova_table.loc['Residual', 'sum_sq']

eta2_partial_group = ss_group / (ss_group + ss_error)
eta2_partial_covariate = ss_covariate / (ss_covariate + ss_error)

print(f"\nPartial eta-squared (Group): {eta2_partial_group:.4f}")
print(f"Partial eta-squared (Pretest): {eta2_partial_covariate:.4f}")

# Compute adjusted means
group_means = data.groupby('Group')['Posttest'].mean()
group_cov_means = data.groupby('Group')['Pretest'].mean()
overall_cov_mean = data['Pretest'].mean()
beta_hat = model_ancova.params['Pretest']

adjusted_means = {}
for group in ['A', 'B', 'C']:
    adj_mean = group_means[group] - beta_hat * (group_cov_means[group] - overall_cov_mean)
    adjusted_means[group] = adj_mean

print("\nUnadjusted vs Adjusted Means:")
print(f"{'Group':<8} {'Unadjusted':<12} {'Adjusted':<12} {'Pretest Mean':<14}")
for group in ['A', 'B', 'C']:
    print(f"{group:<8} {group_means[group]:<12.3f} {adjusted_means[group]:<12.3f} "
          f"{group_cov_means[group]:<14.1f}")

# Standard error of adjusted means
ms_error = ancova_table.loc['Residual', 'mean_sq']
n_g = data.groupby('Group').size()

se_adjusted = {}
for group in ['A', 'B', 'C']:
    se = np.sqrt(ms_error * (1/n_g[group] + 
                            (group_cov_means[group] - overall_cov_mean)**2 / 
                            ancova_table.loc['Pretest', 'sum_sq']))
    se_adjusted[group] = se

print("\nAdjusted Means with 95% CI:")
for group in ['A', 'B', 'C']:
    ci_low = adjusted_means[group] - 1.96 * se_adjusted[group]
    ci_high = adjusted_means[group] + 1.96 * se_adjusted[group]
    print(f"  Group {group}: {adjusted_means[group]:.3f} [{ci_low:.3f}, {ci_high:.3f}]")

# Assumption diagnostics
# 1. Linearity - scatter plots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
colors = {'A': 'blue', 'B': 'red', 'C': 'green'}

for i, group in enumerate(['A', 'B', 'C']):
    mask = data['Group'] == group
    axes[i].scatter(data[mask]['Pretest'], data[mask]['Posttest'], 
                   c=colors[group], alpha=0.6, s=50)
    
    # Add regression line
    x_range = np.linspace(data[mask]['Pretest'].min(), 
                         data[mask]['Pretest'].max(), 100)
    y_pred = model_ancova.params['Intercept'] + model_ancova.params[f'C(Group)[T.{group}]'] + \
             beta_hat * x_range
    axes[i].plot(x_range, y_pred, 'k-', linewidth=2)
    
    axes[i].set_xlabel('Pretest (Covariate)')
    axes[i].set_ylabel('Posttest')
    axes[i].set_title(f'Group {group}')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('ancova_linearity.png', dpi=150)
plt.show()

# 2. Homogeneity of slopes visualization
fig, ax = plt.subplots(figsize=(8, 6))

for group in ['A', 'B', 'C']:
    mask = data['Group'] == group
    ax.scatter(data[mask]['Pretest'], data[mask]['Posttest'], 
              c=colors[group], alpha=0.6, label=group, s=50)
    
    # Separate regression lines
    z = np.polyfit(data[mask]['Pretest'], data[mask]['Posttest'], 1)
    x_line = np.linspace(data[mask]['Pretest'].min(), 
                        data[mask]['Pretest'].max(), 100)
    ax.plot(x_line, np.polyval(z, x_line), color=colors[group], 
           linestyle='--', linewidth=2)

# Common regression line
x_common = np.linspace(data['Pretest'].min(), data['Pretest'].max(), 100)
y_common = model_ancova.params['Intercept'] + beta_hat * x_common
ax.plot(x_common, y_common, 'k-', linewidth=2, label='Common slope (ANCOVA)')

ax.set_xlabel('Pretest (Covariate)')
ax.set_ylabel('Posttest')
ax.set_title('Test of Homogeneity of Regression Slopes')
ax.legend()
ax.grid(True, alpha=0.3)
plt.savefig('ancova_slopes.png', dpi=150)
plt.show()

# 3. Residual diagnostics
residuals = model_ancova.resid
fitted = model_ancova.fittedvalues

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Residuals vs Fitted
axes[0].scatter(fitted, residuals, alpha=0.6)
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted')
axes[0].grid(True, alpha=0.3)

# Q-Q plot
stats.probplot(residuals, dist="norm", plot=axes[1])
axes[1].set_title('Normal Q-Q Plot')
axes[1].grid(True, alpha=0.3)

# Residuals by group
for group in ['A', 'B', 'C']:
    mask = data['Group'] == group
    group_residuals = residuals[mask]
    axes[2].boxplot(group_residuals, positions=[list(['A', 'B', 'C']).index(group)],
                   widths=0.5)

axes[2].set_xticklabels(['A', 'B', 'C'])
axes[2].set_xlabel('Group')
axes[2].set_ylabel('Residuals')
axes[2].set_title('Residuals by Group')
axes[2].axhline(0, color='red', linestyle='--')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('ancova_diagnostics.png', dpi=150)
plt.show()

# Levene's test for homoscedasticity
levene_stat, levene_p = stats.levene(*[data[data['Group'] == g]['Posttest'] 
                                       for g in ['A', 'B', 'C']])
print(f"\nLevene's test for homoscedasticity: F = {levene_stat:.3f}, p = {levene_p:.4f}")

# Shapiro-Wilk test for normality of residuals
shapiro_stat, shapiro_p = stats.shapiro(residuals)
print(f"Shapiro-Wilk test for normality: W = {shapiro_stat:.3f}, p = {shapiro_p:.4f}")

# Comparison: ANOVA vs ANCOVA
from scipy.stats import f_oneway

# Separate ANOVAs (one-way)
anova_groups = [data[data['Group'] == g]['Posttest'] for g in ['A', 'B', 'C']]
f_anova, p_anova = f_oneway(*anova_groups)

print("\nComparison: One-way ANOVA vs ANCOVA")
print(f"One-way ANOVA: F({n_groups-1}, {len(data)-n_groups}) = {f_anova:.3f}, p = {p_anova:.4f}")
print(f"ANCOVA: F({n_groups-1}, {len(data)-n_groups-1}) = {ancova_table.loc['C(Group)', 'F']:.3f}, "
      f"p = {ancova_table.loc['C(Group)', 'PR(>F)']:.4f}")
print(f"Error reduction: {(1 - ss_error/((len(data)-n_groups)*np.var(data['Posttest'], ddof=1)))*100:.1f}%")

Summary: Analysis of Covariance (ANCOVA)

ANCOVA Model: $Y_{ij} = \mu + \tau_j + \beta(X_{ij} - \bar{X}) + \varepsilon_{ij}$ combines categorical factors with continuous covariates
Adjusted Means: $\bar{Y}_{j,\text{adj}} = \bar{Y}_j - \hat{\beta}(\bar{X}_j - \bar{X})$ provide fair group comparisons when groups differ on the covariate
Homogeneity of Slopes: Critical assumption tested by Group × Covariate interaction; violation requires separate slopes analysis
Partial Eta-Squared: $\eta_p^2 = SS_{\text{Effect}}/(SS_{\text{Effect}} + SS_{\text{Error}})$ measures effect size after removing covariate variance
Error Reduction: ANCOVA reduces $MS_{\text{Error}}$ by removing covariate-related variance, increasing power to detect group differences
Key Assumption: Covariate must be independent of treatment (measured before treatment) to avoid bias from treatment effects on the covariate

Analysis of Covariance (ANCOVA)

Analysis of Covariance (ANCOVA)

Adjusting for Confounders While Testing Group Differences

Mathematical Framework

DfANCOVA Model

ANCOVA Sum of Squares Decomposition

Homogeneity of Regression Slopes

ThTest for Homogeneity of Slopes

Adjusted Means

DfAdjusted (Marginal) Mean

Standard Error of Adjusted Means

ANCOVA Calculation with Covariate Adjustment

Effect Size: Partial Eta-Squared

DfPartial Eta-Squared ($\eta_p^2$)

Relationship Between $F$ and $\eta_p^2$

Assumptions and Diagnostics

ThANCOVA Assumptions

Python Implementation

Summary: Analysis of Covariance (ANCOVA)

Premium Content

Need Expert Statistics Help?