πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Analysis of Covariance (ANCOVA)

Advanced Statistical MethodsAnalysis of Variance🟒 Free Lesson

Advertisement

Analysis of Covariance (ANCOVA)

Advanced Statistical Methods

Adjusting for Confounders While Testing Group Differences

ANCOVA combines ANOVA with regression to compare group means while statistically controlling for continuous covariates. It increases precision and removes bias from confounding variables.

  • Clinical trials β€” Compare treatment effects while adjusting for baseline severity scores
  • Education research β€” Assess school performance while controlling for socioeconomic status
  • Agriculture β€” Compare crop yields while adjusting for soil quality differences across fields

ANCOVA levels the playing field so you can see true group differences clearly.


Analysis of Covariance (ANCOVA) extends the general linear model to include both categorical independent variables (factors) and continuous independent variables (covariates). ANCOVA combines the explanatory power of ANOVA for group comparisons with the precision-enhancing capability of regression through covariate adjustment. By statistically controlling for the linear influence of one or more covariates, ANCOVA reduces error variance, increases statistical power, and enables more precise comparisons of treatment means adjusted for confounding variables.

Mathematical Framework

DfANCOVA Model

The one-way ANCOVA model with one covariate XX and gg treatment groups is:

Yij=ΞΌ+Ο„j+Ξ²(Xijβˆ’XΛ‰)+Ξ΅ijY_{ij} = \mu + \tau_j + \beta(X_{ij} - \bar{X}) + \varepsilon_{ij}

where:

  • YijY_{ij} is the response for subject ii in group jj
  • ΞΌ\mu is the overall intercept (grand mean of YY adjusted to the mean of XX)
  • Ο„j\tau_j is the treatment effect for group jj (βˆ‘Ο„j=0\sum \tau_j = 0)
  • Ξ²\beta is the common regression coefficient (slope) for the covariate
  • XijX_{ij} is the covariate value, XΛ‰\bar{X} is the overall mean of XX
  • Ξ΅ij∼N(0,Οƒ2)\varepsilon_{ij} \sim N(0, \sigma^2) independently

In matrix notation: Y=XΞ²+Ξ΅\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, where X\mathbf{X} includes indicator variables for groups and the covariate.

ANCOVA Sum of Squares Decomposition

The total sum of squares is partitioned as:

SSTotal=SSGroup+SSCovariate+SSErrorSS_{\text{Total}} = SS_{\text{Group}} + SS_{\text{Covariate}} + SS_{\text{Error}}

The adjusted treatment sum of squares is computed after removing covariate variation:

SSGroup∣Covariate=SSTotalβˆ’SSCovariateβˆ’SSErrorSS_{\text{Group}|\text{Covariate}} = SS_{\text{Total}} - SS_{\text{Covariate}} - SS_{\text{Error}}

or equivalently:

SSGroup∣Covariate=SSGroupβˆ’(SPXY)2SSXSS_{\text{Group}|\text{Covariate}} = SS_{\text{Group}} - \frac{(SP_{XY})^2}{SS_X}

where SPXYSP_{XY} is the sum of cross-products between group coding and covariate, and SSXSS_X is the sum of squares of the covariate.

Homogeneity of Regression Slopes

A critical assumption of ANCOVA is that the regression slopes relating the covariate to the dependent variable are equal across all groups.

ThTest for Homogeneity of Slopes

The model including the interaction between group and covariate:

Yij=ΞΌ+Ο„j+Ξ²Xij+Ξ³jXij+Ξ΅ijY_{ij} = \mu + \tau_j + \beta X_{ij} + \gamma_j X_{ij} + \varepsilon_{ij}

where Ξ³j\gamma_j represents group-specific deviations from the common slope Ξ²\beta. The null hypothesis of homogeneity of slopes:

H0:Ξ³1=Ξ³2=β‹―=Ξ³g=0H_0: \gamma_1 = \gamma_2 = \cdots = \gamma_g = 0

The test statistic compares the full model (with interactions) to the reduced model (without interactions):

F=(SSReducedβˆ’SSFull)/(gβˆ’1)SSFull/(nβˆ’2g)=MSInteractionMSErrorF = \frac{(SS_{\text{Reduced}} - SS_{\text{Full}}) / (g-1)}{SS_{\text{Full}} / (n - 2g)} = \frac{MS_{\text{Interaction}}}{MS_{\text{Error}}}

If H0H_0 is rejected, ANCOVA is inappropriate and separate regression analyses for each group are recommended.

Implications of Non-Parallel Slopes

When the homogeneity of slopes assumption is violated:

  1. Interaction model: Fit separate slopes: Yij=ΞΌ+Ο„j+Ξ²jXij+Ξ΅ijY_{ij} = \mu + \tau_j + \beta_j X_{ij} + \varepsilon_{ij}
  2. Johnson-Neyman technique: Identifies ranges of the covariate where group differences are statistically significant
  3. Simple slopes analysis: Tests group differences at specific values of the covariate (e.g., Β±1\pm 1 SD from mean)

Adjusted Means

ANCOVA produces adjusted (least-squares) means that account for the covariate, providing fairer comparisons when groups have different covariate distributions.

DfAdjusted (Marginal) Mean

The adjusted mean for group jj is the predicted group mean at the overall mean of the covariate:

YΛ‰j,adj=YΛ‰jβˆ’Ξ²^(XΛ‰jβˆ’XΛ‰)\bar{Y}_{j,\text{adj}} = \bar{Y}_j - \hat{\beta}(\bar{X}_j - \bar{X})

where Yˉj\bar{Y}_j is the unadjusted group mean, β^\hat{\beta} is the estimated common slope, Xˉj\bar{X}_j is the group mean of the covariate, and Xˉ\bar{X} is the overall covariate mean.

In matrix notation for multiple covariates:

YΛ‰j,adj=YΛ‰jβˆ’Ξ²^T(XΛ‰jβˆ’XΛ‰)\bar{\mathbf{Y}}_{j,\text{adj}} = \bar{\mathbf{Y}}_j - \hat{\boldsymbol{\beta}}^T(\bar{\mathbf{X}}_j - \bar{\mathbf{X}})

Standard Error of Adjusted Means

The standard error of the adjusted mean for group jj:

SE(YΛ‰j,adj)=MSError[1nj+(XΛ‰jβˆ’XΛ‰)2SSX,Error]SE(\bar{Y}_{j,\text{adj}}) = \sqrt{MS_{\text{Error}}\left[\frac{1}{n_j} + \frac{(\bar{X}_j - \bar{X})^2}{SS_{X,\text{Error}}}\right]}

where MSErrorMS_{\text{Error}} is the mean square error from the ANCOVA model, njn_j is the group size, and SSX,ErrorSS_{X,\text{Error}} is the error sum of squares for the covariate. Confidence intervals for adjusted means:

Yˉj,adj±tα/2,dfError⋅SE(Yˉj,adj)\bar{Y}_{j,\text{adj}} \pm t_{\alpha/2, df_{\text{Error}}} \cdot SE(\bar{Y}_{j,\text{adj}})

ANCOVA Calculation with Covariate Adjustment

Problem: A teaching experiment compares three methods (g=3g = 3) with pretest scores as covariate. Data (n=6n = 6 per group):

GroupPretest (XX)Posttest (YY)
A65, 70, 75, 80, 85, 9072, 78, 82, 88, 94, 98
B60, 65, 70, 75, 80, 8568, 74, 80, 86, 92, 96
C55, 60, 65, 70, 75, 8065, 70, 76, 82, 88, 94

Step 1: Compute summary statistics:

GroupXˉj\bar{X}_jYˉj\bar{Y}_jnjn_j
A77.585.336
B72.582.676
C67.579.176
Overall72.582.3918

Step 2: Fit common regression of YY on XX within groups (test parallel slopes):

Assume slopes are homogeneous (p>0.05p > 0.05 for interaction test).

Common slope estimate: Ξ²^=0.92\hat{\beta} = 0.92 (from pooled within-group regression)

Step 3: Compute adjusted means:

YΛ‰A,adj=85.33βˆ’0.92(77.5βˆ’72.5)=85.33βˆ’4.60=80.73\bar{Y}_{A,\text{adj}} = 85.33 - 0.92(77.5 - 72.5) = 85.33 - 4.60 = 80.73
YΛ‰B,adj=82.67βˆ’0.92(72.5βˆ’72.5)=82.67βˆ’0=82.67\bar{Y}_{B,\text{adj}} = 82.67 - 0.92(72.5 - 72.5) = 82.67 - 0 = 82.67
YΛ‰C,adj=79.17βˆ’0.92(67.5βˆ’72.5)=79.17+4.60=83.77\bar{Y}_{C,\text{adj}} = 79.17 - 0.92(67.5 - 72.5) = 79.17 + 4.60 = 83.77

Interpretation: Without covariate adjustment, Group A appears best (YˉA=85.33\bar{Y}_A = 85.33). After adjusting for pretest scores (Group A had the highest pretest), Group C actually shows the highest adjusted posttest performance (YˉC,adj=83.77\bar{Y}_{C,\text{adj}} = 83.77). The covariate adjustment reveals that Group A's apparent superiority was largely due to selection bias (higher entering ability).

Effect Size: Partial Eta-Squared

DfPartial Eta-Squared ($\eta_p^2$)

Partial eta-squared measures the proportion of variance explained by a factor after removing variance attributable to other factors and covariates:

Ξ·p2=SSEffectSSEffect+SSError\eta_p^2 = \frac{SS_{\text{Effect}}}{SS_{\text{Effect}} + SS_{\text{Error}}}

For the treatment effect in ANCOVA:

ηp2(Group)=SSGroup∣CovariateSSGroup∣Covariate+SSError\eta_p^2(\text{Group}) = \frac{SS_{\text{Group}|\text{Covariate}}}{SS_{\text{Group}|\text{Covariate}} + SS_{\text{Error}}}

Interpretation guidelines (Cohen, 1988):

  • Ξ·p2=0.01\eta_p^2 = 0.01: Small effect
  • Ξ·p2=0.06\eta_p^2 = 0.06: Medium effect
  • Ξ·p2=0.14\eta_p^2 = 0.14: Large effect

Unlike regular Ξ·2\eta^2, partial Ξ·2\eta^2 does not sum to 1.0 across effects because the denominator excludes variance explained by other effects.

Relationship Between $F$ and $\eta_p^2$

Partial eta-squared can be computed directly from the F-statistic:

Ξ·p2=(gβˆ’1)F(gβˆ’1)F+dfError=FF+dfError/(gβˆ’1)\eta_p^2 = \frac{(g-1)F}{(g-1)F + df_{\text{Error}}} = \frac{F}{F + df_{\text{Error}}/(g-1)}

This relationship enables calculation of effect sizes from reported F-values and degrees of freedom.

Assumptions and Diagnostics

ThANCOVA Assumptions

1. Linearity: The relationship between the covariate and dependent variable is linear within each group. Test through scatter plots and residual analysis.

2. Homogeneity of Regression Slopes: The slope of YY on XX is the same across groups. Test via the Group Γ— Covariate interaction:

Finteraction=MSGroupΓ—XMSErrorF_{\text{interaction}} = \frac{MS_{\text{Group} \times X}}{MS_{\text{Error}}}

Non-significance (p>0.05p > 0.05) supports the assumption.

3. Homoscedasticity: Error variances are equal across groups. Test using Levene's test on residuals.

4. Normality of Residuals: Residuals are normally distributed. Test using Shapiro-Wilk test, Q-Q plots.

5. Independence: Observations are independent (satisfied by random assignment).

6. Covariate Independence: The covariate is not affected by the treatment (measured before treatment or from a separate source).

ANCOVA vs. Blocking vs. Regression

  • ANCOVA adjusts for a continuous covariate, treating it as a nuisance variable to reduce error variance
  • Blocking adjusts for a categorical nuisance variable (blocks), treating it as a fixed factor
  • Multiple regression Y=b0+b1X1+b2X2+β‹―Y = b_0 + b_1 X_1 + b_2 X_2 + \cdots can incorporate both factors and covariates using dummy coding

ANCOVA is the special case where some predictors are categorical (factors) and others are continuous (covariates). The general linear model unifies all three approaches.

Python Implementation

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

# Generate ANCOVA data
np.random.seed(42)

n_per_group = 25
n_groups = 3

# Group-specific true effects
group_effects = {'A': 5.0, 'B': 8.0, 'C': 3.0}
beta_true = 0.75  # True common slope

# Generate covariate (pretest) - different distributions per group
covariate_means = {'A': 75, 'B': 65, 'C': 70}
data_list = []

for group, mean_x in covariate_means.items():
    X = np.random.normal(mean_x, 10, n_per_group)
    Y = 50 + group_effects[group] + beta_true * X + np.random.normal(0, 5, n_per_group)
    
    df = pd.DataFrame({
        'Pretest': X,
        'Posttest': Y,
        'Group': group
    })
    data_list.append(df)

data = pd.concat(data_list, ignore_index=True)

print("Data Summary:")
print(data.groupby('Group').agg(['mean', 'std', 'count']))

# Test homogeneity of slopes (interaction model)
model_interaction = ols('Posttest ~ C(Group) * Pretest', data=data).fit()
print("\nInteraction Model (Test Homogeneity of Slopes):")
print(f"Interaction F-statistic: {model_interaction.fvalue:.3f}")
print(f"Interaction p-value: {model_interaction.f_pvalue:.4f}")

# ANCOVA model (assuming parallel slopes)
model_ancova = ols('Posttest ~ C(Group) + Pretest', data=data).fit()
print("\nANCOVA Model Summary:")
print(model_ancova.summary())

# Extract ANCOVA table
from statsmodels.stats.anova import anova_lm
ancova_table = anova_lm(model_ancova, typ=2)
print("\nANCOVA Table (Type II):")
print(ancova_table)

# Compute partial eta-squared
ss_group = ancova_table.loc['C(Group)', 'sum_sq']
ss_covariate = ancova_table.loc['Pretest', 'sum_sq']
ss_error = ancova_table.loc['Residual', 'sum_sq']

eta2_partial_group = ss_group / (ss_group + ss_error)
eta2_partial_covariate = ss_covariate / (ss_covariate + ss_error)

print(f"\nPartial eta-squared (Group): {eta2_partial_group:.4f}")
print(f"Partial eta-squared (Pretest): {eta2_partial_covariate:.4f}")

# Compute adjusted means
group_means = data.groupby('Group')['Posttest'].mean()
group_cov_means = data.groupby('Group')['Pretest'].mean()
overall_cov_mean = data['Pretest'].mean()
beta_hat = model_ancova.params['Pretest']

adjusted_means = {}
for group in ['A', 'B', 'C']:
    adj_mean = group_means[group] - beta_hat * (group_cov_means[group] - overall_cov_mean)
    adjusted_means[group] = adj_mean

print("\nUnadjusted vs Adjusted Means:")
print(f"{'Group':<8} {'Unadjusted':<12} {'Adjusted':<12} {'Pretest Mean':<14}")
for group in ['A', 'B', 'C']:
    print(f"{group:<8} {group_means[group]:<12.3f} {adjusted_means[group]:<12.3f} "
          f"{group_cov_means[group]:<14.1f}")

# Standard error of adjusted means
ms_error = ancova_table.loc['Residual', 'mean_sq']
n_g = data.groupby('Group').size()

se_adjusted = {}
for group in ['A', 'B', 'C']:
    se = np.sqrt(ms_error * (1/n_g[group] + 
                            (group_cov_means[group] - overall_cov_mean)**2 / 
                            ancova_table.loc['Pretest', 'sum_sq']))
    se_adjusted[group] = se

print("\nAdjusted Means with 95% CI:")
for group in ['A', 'B', 'C']:
    ci_low = adjusted_means[group] - 1.96 * se_adjusted[group]
    ci_high = adjusted_means[group] + 1.96 * se_adjusted[group]
    print(f"  Group {group}: {adjusted_means[group]:.3f} [{ci_low:.3f}, {ci_high:.3f}]")

# Assumption diagnostics
# 1. Linearity - scatter plots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
colors = {'A': 'blue', 'B': 'red', 'C': 'green'}

for i, group in enumerate(['A', 'B', 'C']):
    mask = data['Group'] == group
    axes[i].scatter(data[mask]['Pretest'], data[mask]['Posttest'], 
                   c=colors[group], alpha=0.6, s=50)
    
    # Add regression line
    x_range = np.linspace(data[mask]['Pretest'].min(), 
                         data[mask]['Pretest'].max(), 100)
    y_pred = model_ancova.params['Intercept'] + model_ancova.params[f'C(Group)[T.{group}]'] + \
             beta_hat * x_range
    axes[i].plot(x_range, y_pred, 'k-', linewidth=2)
    
    axes[i].set_xlabel('Pretest (Covariate)')
    axes[i].set_ylabel('Posttest')
    axes[i].set_title(f'Group {group}')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('ancova_linearity.png', dpi=150)
plt.show()

# 2. Homogeneity of slopes visualization
fig, ax = plt.subplots(figsize=(8, 6))

for group in ['A', 'B', 'C']:
    mask = data['Group'] == group
    ax.scatter(data[mask]['Pretest'], data[mask]['Posttest'], 
              c=colors[group], alpha=0.6, label=group, s=50)
    
    # Separate regression lines
    z = np.polyfit(data[mask]['Pretest'], data[mask]['Posttest'], 1)
    x_line = np.linspace(data[mask]['Pretest'].min(), 
                        data[mask]['Pretest'].max(), 100)
    ax.plot(x_line, np.polyval(z, x_line), color=colors[group], 
           linestyle='--', linewidth=2)

# Common regression line
x_common = np.linspace(data['Pretest'].min(), data['Pretest'].max(), 100)
y_common = model_ancova.params['Intercept'] + beta_hat * x_common
ax.plot(x_common, y_common, 'k-', linewidth=2, label='Common slope (ANCOVA)')

ax.set_xlabel('Pretest (Covariate)')
ax.set_ylabel('Posttest')
ax.set_title('Test of Homogeneity of Regression Slopes')
ax.legend()
ax.grid(True, alpha=0.3)
plt.savefig('ancova_slopes.png', dpi=150)
plt.show()

# 3. Residual diagnostics
residuals = model_ancova.resid
fitted = model_ancova.fittedvalues

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Residuals vs Fitted
axes[0].scatter(fitted, residuals, alpha=0.6)
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted')
axes[0].grid(True, alpha=0.3)

# Q-Q plot
stats.probplot(residuals, dist="norm", plot=axes[1])
axes[1].set_title('Normal Q-Q Plot')
axes[1].grid(True, alpha=0.3)

# Residuals by group
for group in ['A', 'B', 'C']:
    mask = data['Group'] == group
    group_residuals = residuals[mask]
    axes[2].boxplot(group_residuals, positions=[list(['A', 'B', 'C']).index(group)],
                   widths=0.5)

axes[2].set_xticklabels(['A', 'B', 'C'])
axes[2].set_xlabel('Group')
axes[2].set_ylabel('Residuals')
axes[2].set_title('Residuals by Group')
axes[2].axhline(0, color='red', linestyle='--')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('ancova_diagnostics.png', dpi=150)
plt.show()

# Levene's test for homoscedasticity
levene_stat, levene_p = stats.levene(*[data[data['Group'] == g]['Posttest'] 
                                       for g in ['A', 'B', 'C']])
print(f"\nLevene's test for homoscedasticity: F = {levene_stat:.3f}, p = {levene_p:.4f}")

# Shapiro-Wilk test for normality of residuals
shapiro_stat, shapiro_p = stats.shapiro(residuals)
print(f"Shapiro-Wilk test for normality: W = {shapiro_stat:.3f}, p = {shapiro_p:.4f}")

# Comparison: ANOVA vs ANCOVA
from scipy.stats import f_oneway

# Separate ANOVAs (one-way)
anova_groups = [data[data['Group'] == g]['Posttest'] for g in ['A', 'B', 'C']]
f_anova, p_anova = f_oneway(*anova_groups)

print("\nComparison: One-way ANOVA vs ANCOVA")
print(f"One-way ANOVA: F({n_groups-1}, {len(data)-n_groups}) = {f_anova:.3f}, p = {p_anova:.4f}")
print(f"ANCOVA: F({n_groups-1}, {len(data)-n_groups-1}) = {ancova_table.loc['C(Group)', 'F']:.3f}, "
      f"p = {ancova_table.loc['C(Group)', 'PR(>F)']:.4f}")
print(f"Error reduction: {(1 - ss_error/((len(data)-n_groups)*np.var(data['Posttest'], ddof=1)))*100:.1f}%")

Summary: Analysis of Covariance (ANCOVA)

  1. ANCOVA Model: Yij=ΞΌ+Ο„j+Ξ²(Xijβˆ’XΛ‰)+Ξ΅ijY_{ij} = \mu + \tau_j + \beta(X_{ij} - \bar{X}) + \varepsilon_{ij} combines categorical factors with continuous covariates
  2. Adjusted Means: YΛ‰j,adj=YΛ‰jβˆ’Ξ²^(XΛ‰jβˆ’XΛ‰)\bar{Y}_{j,\text{adj}} = \bar{Y}_j - \hat{\beta}(\bar{X}_j - \bar{X}) provide fair group comparisons when groups differ on the covariate
  3. Homogeneity of Slopes: Critical assumption tested by Group Γ— Covariate interaction; violation requires separate slopes analysis
  4. Partial Eta-Squared: Ξ·p2=SSEffect/(SSEffect+SSError)\eta_p^2 = SS_{\text{Effect}}/(SS_{\text{Effect}} + SS_{\text{Error}}) measures effect size after removing covariate variance
  5. Error Reduction: ANCOVA reduces MSErrorMS_{\text{Error}} by removing covariate-related variance, increasing power to detect group differences
  6. Key Assumption: Covariate must be independent of treatment (measured before treatment) to avoid bias from treatment effects on the covariate
⭐

Premium Content

Analysis of Covariance (ANCOVA)

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement