Pearson Correlation

Descriptive Statistics

The Gold Standard for Measuring Linear Association

Pearson's r measures the strength and direction of the linear relationship between two continuous variables. It is the most widely used correlation coefficient in all of statistics.

Bounded between -1 and +1 — Easy to interpret at a glance
Unitless — Compare correlations across completely different scales
Hypothesis testing — Test whether a observed correlation could have come from no relationship
Causation warning — Correlation never proves causation; remember this always

Pearson r is powerful, but it only measures linear relationships. Always visualize before you calculate.

What is Pearson Correlation?

Definition

Pearson's $r$ measures the strength and direction of the linear relationship between two continuous variables.

Definition

DfPearson Correlation Coefficient

The Pearson product-moment correlation coefficient between random variables $X$ and $Y$ is:

\rho = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}

The sample version is:

r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}

Pearson Correlation Formula (Computational)

r = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{\left[n\sum x_i^2 - (\sum x_i)^2\right]\left[n\sum y_i^2 - (\sum y_i)^2\right]}}

Here,

$r$ =Sample Pearson correlation coefficient
$n$ =Sample size
$x_i, y_i$ =Individual data points

Properties of $r$

ThProperties of Pearson's r

Bounded: $-1 \leq r \leq +1$
Symmetric: $\text{corr}(X, Y) = \text{corr}(Y, X)$
Dimensionless: $r$ is unitless — invariant to linear transformations
$r = +1$ : Perfect positive linear relationship: $Y = a + bX$ with $b > 0$
$r = -1$ : Perfect negative linear relationship: $Y = a + bX$ with $b < 0$
$r = 0$ : No linear relationship (but may be non-linear)
$r$ equals the slope of the standardized regression: $r = \hat{\beta}_1$ when both variables are standardized

r as a Regression Slope

When both $X$ and $Y$ are standardized (z-scores), the regression line is $Z_Y = r \cdot Z_X$ . This provides a direct connection between correlation and simple linear regression.

Coefficient of Determination

R-Squared

r^2 = \frac{\text{Explained Variation}}{\text{Total Variation}} = \frac{SS_{\text{reg}}}{SS_{\text{tot}}} = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}

Here,

$r^2$ =Coefficient of determination
$SS_{\text{reg}}$ =Regression sum of squares
$SS_{\text{tot}}$ =Total sum of squares

$r^2$ is the proportion of variance in $Y$ explained by the linear relationship with $X$ .

Hypothesis Testing for Correlation

ThTesting $H_0: \rho = 0$

Under $H_0: \rho = 0$ (no linear correlation in the population):

t = \frac{r}{\sqrt{(1 - r^2)/(n - 2)}} \sim t_{n-2}

This $t$ -test has $n - 2$ degrees of freedom (two parameters estimated: the intercept and slope).

Confidence Interval for $r$ (Fisher z-transformation)

z_r = \frac{1}{2}\ln\left(\frac{1+r}{1-r}\right) = \text{arctanh}(r)

Here,

$z_r$ =Fisher z-transformed correlation
$r$ =Sample correlation

Fisher's $z$ -transformation converts $r$ (which has a skewed distribution for $\rho \neq 0$ ) to an approximately normal variable, enabling valid confidence intervals:

z_r \pm \frac{z_{\alpha/2}}{\sqrt{n-3}}

Transform back: $r = \tanh(z_r)$ .

Interpretation Guidelines

$\|r\|$	Interpretation
$0.00 - 0.19$	Very weak
$0.20 - 0.39$	Weak
$0.40 - 0.59$	Moderate
$0.60 - 0.79$	Strong
$0.80 - 1.00$	Very strong

Correlation ≠ Causation

A high $r$ does not imply that $X$ causes $Y$ . The relationship may be:

Confounded — a third variable $Z$ drives both $X$ and $Y$
Reverse — $Y$ causes $X$ , not the other way around
Coincidental — spurious correlation in small samples
Non-causal — mathematical coupling or selection bias

Only controlled experiments or causal inference methods can establish causation.

Assumptions

Assumption	What It Means	How to Check
Linearity	The relationship between $X$ and $Y$ is linear	Scatter plot
Continuous variables	Both $X$ and $Y$ are measured on interval or ratio scales	Variable type
Bivariate normality	$(X, Y)$ follow a bivariate normal distribution	Q-Q plots; Shapiro–Wilk
Homoscedasticity	Variance of residuals is constant	Residual plot
No significant outliers	Outliers can distort $r$ dramatically	Scatter plot; influence analysis

Limitations

ThPearson's r Misses Non-Linear Relationships

Pearson's $r$ measures linear association only. A dataset can have $r = 0$ yet have a strong non-linear relationship (e.g., $Y = X^2$ on a symmetric range). Always visualize the data before relying on $r$ .

When assumptions are violated, use:

Spearman's rank correlation ( $r_s$ ) — for monotonic (not necessarily linear) relationships
Kendall's $\tau$ — for ordinal data or small samples
Distance correlation — detects any dependence, not just linear