🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Pearson Correlation — r Coefficient Formula and Testing

Foundations of StatisticsDescriptive Statistics🟢 Free Lesson

Advertisement

Pearson Correlation

Descriptive Statistics

The Gold Standard for Measuring Linear Association

Pearson's r measures the strength and direction of the linear relationship between two continuous variables. It is the most widely used correlation coefficient in all of statistics.

  • Bounded between -1 and +1 — Easy to interpret at a glance
  • Unitless — Compare correlations across completely different scales
  • Hypothesis testing — Test whether a observed correlation could have come from no relationship
  • Causation warning — Correlation never proves causation; remember this always

Pearson r is powerful, but it only measures linear relationships. Always visualize before you calculate.


What is Pearson Correlation?

Definition

Pearson's rr measures the strength and direction of the linear relationship between two continuous variables.


Definition

DfPearson Correlation Coefficient

The Pearson product-moment correlation coefficient between random variables XX and YY is:

ρ=Cov(X,Y)σXσY=E[(XμX)(YμY)]σXσY\rho = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}

The sample version is:

r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}

Pearson Correlation Formula (Computational)

r=nxiyixiyi[nxi2(xi)2][nyi2(yi)2]r = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{\left[n\sum x_i^2 - (\sum x_i)^2\right]\left[n\sum y_i^2 - (\sum y_i)^2\right]}}

Here,

  • rr=Sample Pearson correlation coefficient
  • nn=Sample size
  • xi,yix_i, y_i=Individual data points

Properties of rr

ThProperties of Pearson's r

  1. Bounded: 1r+1-1 \leq r \leq +1
  2. Symmetric: corr(X,Y)=corr(Y,X)\text{corr}(X, Y) = \text{corr}(Y, X)
  3. Dimensionless: rr is unitless — invariant to linear transformations
  4. r=+1r = +1: Perfect positive linear relationship: Y=a+bXY = a + bX with b>0b > 0
  5. r=1r = -1: Perfect negative linear relationship: Y=a+bXY = a + bX with b<0b < 0
  6. r=0r = 0: No linear relationship (but may be non-linear)
  7. rr equals the slope of the standardized regression: r=β^1r = \hat{\beta}_1 when both variables are standardized

r as a Regression Slope

When both XX and YY are standardized (z-scores), the regression line is ZY=rZXZ_Y = r \cdot Z_X. This provides a direct connection between correlation and simple linear regression.


Coefficient of Determination

R-Squared

r2=Explained VariationTotal Variation=SSregSStot=1SSresSStotr^2 = \frac{\text{Explained Variation}}{\text{Total Variation}} = \frac{SS_{\text{reg}}}{SS_{\text{tot}}} = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}

Here,

  • r2r^2=Coefficient of determination
  • SSregSS_{\text{reg}}=Regression sum of squares
  • SStotSS_{\text{tot}}=Total sum of squares

r2r^2 is the proportion of variance in YY explained by the linear relationship with XX.


Hypothesis Testing for Correlation

ThTesting $H_0: \rho = 0$

Under H0:ρ=0H_0: \rho = 0 (no linear correlation in the population):

t=r(1r2)/(n2)tn2t = \frac{r}{\sqrt{(1 - r^2)/(n - 2)}} \sim t_{n-2}

This tt-test has n2n - 2 degrees of freedom (two parameters estimated: the intercept and slope).

Confidence Interval for $r$ (Fisher z-transformation)

zr=12ln(1+r1r)=arctanh(r)z_r = \frac{1}{2}\ln\left(\frac{1+r}{1-r}\right) = \text{arctanh}(r)

Here,

  • zrz_r=Fisher z-transformed correlation
  • rr=Sample correlation

Fisher's zz-transformation converts rr (which has a skewed distribution for ρ0\rho \neq 0) to an approximately normal variable, enabling valid confidence intervals:

zr±zα/2n3z_r \pm \frac{z_{\alpha/2}}{\sqrt{n-3}}

Transform back: r=tanh(zr)r = \tanh(z_r).


Interpretation Guidelines

r|r|Interpretation
0.000.190.00 - 0.19Very weak
0.200.390.20 - 0.39Weak
0.400.590.40 - 0.59Moderate
0.600.790.60 - 0.79Strong
0.801.000.80 - 1.00Very strong

Correlation ≠ Causation

A high rr does not imply that XX causes YY. The relationship may be:

  1. Confounded — a third variable ZZ drives both XX and YY
  2. ReverseYY causes XX, not the other way around
  3. Coincidental — spurious correlation in small samples
  4. Non-causal — mathematical coupling or selection bias

Only controlled experiments or causal inference methods can establish causation.


Assumptions

AssumptionWhat It MeansHow to Check
LinearityThe relationship between XX and YY is linearScatter plot
Continuous variablesBoth XX and YY are measured on interval or ratio scalesVariable type
Bivariate normality(X,Y)(X, Y) follow a bivariate normal distributionQ-Q plots; Shapiro–Wilk
HomoscedasticityVariance of residuals is constantResidual plot
No significant outliersOutliers can distort rr dramaticallyScatter plot; influence analysis

Limitations

ThPearson's r Misses Non-Linear Relationships

Pearson's rr measures linear association only. A dataset can have r=0r = 0 yet have a strong non-linear relationship (e.g., Y=X2Y = X^2 on a symmetric range). Always visualize the data before relying on rr.

When assumptions are violated, use:

  • Spearman's rank correlation (rsr_s) — for monotonic (not necessarily linear) relationships
  • Kendall's τ\tau — for ordinal data or small samples
  • Distance correlation — detects any dependence, not just linear

Pearson Correlation in Machine Learning

ML ApplicationCorrelation UsageWhy
Feature selectionHigh corr with target → importantIdentify predictive features
MulticollinearityHigh corr between features → removeStability of linear models
Feature engineeringCreate interaction featuresCorrelated pairs suggest interactions
Data validationCheck feature-target relationshipSanity check before modeling
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)
df = pd.DataFrame(X, columns=[f'X{i}' for i in range(5)])
df['target'] = y

# Correlation with target
corr_with_target = df.corr()['target'].drop('target')
print("Correlation with target:")
print(corr_with_target.sort_values(ascending=False).round(3))

# Remove highly correlated features
corr_matrix = df.drop('target', axis=1).corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr = [(i, j) for i in upper.columns for j in upper.columns 
             if upper.loc[i,j] > 0.8]
print(f"\nHighly correlated feature pairs (|r| > 0.8): {high_corr if high_corr else 'None'}")

Key Takeaways

Summary: Pearson Correlation

  • rr ranges from -1 to +1 — sign indicates direction, magnitude indicates strength
  • rr measures LINEAR association only — non-linear relationships may have r0r \approx 0
  • Correlation \neq causation — a strong rr does not imply one variable causes the other
  • r2r^2 is the coefficient of determination — proportion of variance in YY explained by XX
  • Fisher's zz-transformation is needed for confidence intervals when ρ0\rho \neq 0
  • Always visualize — scatterplots reveal patterns (curves, outliers, clusters) that rr alone misses
  • For non-linear relationships, use Spearman's rsr_s or Kendall's τ\tau instead

Premium Content

Pearson Correlation — r Coefficient Formula and Testing

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement