Pearson Correlation
Descriptive Statistics
The Gold Standard for Measuring Linear Association
Pearson's r measures the strength and direction of the linear relationship between two continuous variables. It is the most widely used correlation coefficient in all of statistics.
- Bounded between -1 and +1 — Easy to interpret at a glance
- Unitless — Compare correlations across completely different scales
- Hypothesis testing — Test whether a observed correlation could have come from no relationship
- Causation warning — Correlation never proves causation; remember this always
Pearson r is powerful, but it only measures linear relationships. Always visualize before you calculate.
What is Pearson Correlation?
Definition
Pearson's measures the strength and direction of the linear relationship between two continuous variables.
Definition
DfPearson Correlation Coefficient
The Pearson product-moment correlation coefficient between random variables and is:
The sample version is:
Pearson Correlation Formula (Computational)
Here,
- =Sample Pearson correlation coefficient
- =Sample size
- =Individual data points
Properties of
ThProperties of Pearson's r
- Bounded:
- Symmetric:
- Dimensionless: is unitless — invariant to linear transformations
- : Perfect positive linear relationship: with
- : Perfect negative linear relationship: with
- : No linear relationship (but may be non-linear)
- equals the slope of the standardized regression: when both variables are standardized
r as a Regression Slope
When both and are standardized (z-scores), the regression line is . This provides a direct connection between correlation and simple linear regression.
Coefficient of Determination
R-Squared
Here,
- =Coefficient of determination
- =Regression sum of squares
- =Total sum of squares
is the proportion of variance in explained by the linear relationship with .
Hypothesis Testing for Correlation
ThTesting $H_0: \rho = 0$
Under (no linear correlation in the population):
This -test has degrees of freedom (two parameters estimated: the intercept and slope).
Confidence Interval for $r$ (Fisher z-transformation)
Here,
- =Fisher z-transformed correlation
- =Sample correlation
Fisher's -transformation converts (which has a skewed distribution for ) to an approximately normal variable, enabling valid confidence intervals:
Transform back: .
Interpretation Guidelines
| Interpretation | |
|---|---|
| Very weak | |
| Weak | |
| Moderate | |
| Strong | |
| Very strong |
Correlation ≠ Causation
A high does not imply that causes . The relationship may be:
- Confounded — a third variable drives both and
- Reverse — causes , not the other way around
- Coincidental — spurious correlation in small samples
- Non-causal — mathematical coupling or selection bias
Only controlled experiments or causal inference methods can establish causation.
Assumptions
| Assumption | What It Means | How to Check |
|---|---|---|
| Linearity | The relationship between and is linear | Scatter plot |
| Continuous variables | Both and are measured on interval or ratio scales | Variable type |
| Bivariate normality | follow a bivariate normal distribution | Q-Q plots; Shapiro–Wilk |
| Homoscedasticity | Variance of residuals is constant | Residual plot |
| No significant outliers | Outliers can distort dramatically | Scatter plot; influence analysis |
Limitations
ThPearson's r Misses Non-Linear Relationships
Pearson's measures linear association only. A dataset can have yet have a strong non-linear relationship (e.g., on a symmetric range). Always visualize the data before relying on .
When assumptions are violated, use:
- Spearman's rank correlation () — for monotonic (not necessarily linear) relationships
- Kendall's — for ordinal data or small samples
- Distance correlation — detects any dependence, not just linear
Pearson Correlation in Machine Learning
| ML Application | Correlation Usage | Why |
|---|---|---|
| Feature selection | High corr with target → important | Identify predictive features |
| Multicollinearity | High corr between features → remove | Stability of linear models |
| Feature engineering | Create interaction features | Correlated pairs suggest interactions |
| Data validation | Check feature-target relationship | Sanity check before modeling |
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)
df = pd.DataFrame(X, columns=[f'X{i}' for i in range(5)])
df['target'] = y
# Correlation with target
corr_with_target = df.corr()['target'].drop('target')
print("Correlation with target:")
print(corr_with_target.sort_values(ascending=False).round(3))
# Remove highly correlated features
corr_matrix = df.drop('target', axis=1).corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr = [(i, j) for i in upper.columns for j in upper.columns
if upper.loc[i,j] > 0.8]
print(f"\nHighly correlated feature pairs (|r| > 0.8): {high_corr if high_corr else 'None'}")
Key Takeaways
Summary: Pearson Correlation
- ranges from -1 to +1 — sign indicates direction, magnitude indicates strength
- measures LINEAR association only — non-linear relationships may have
- Correlation causation — a strong does not imply one variable causes the other
- is the coefficient of determination — proportion of variance in explained by
- Fisher's -transformation is needed for confidence intervals when
- Always visualize — scatterplots reveal patterns (curves, outliers, clusters) that alone misses
- For non-linear relationships, use Spearman's or Kendall's instead