Point-Biserial Correlation

Descriptive Statistics

Bridging Categories and Numbers in One Statistic

When one variable splits your data into two groups (treated/untreated, pass/fail) and the other is continuous (test score, blood pressure), point-biserial correlation gives you a single number that captures the relationship — and it is mathematically identical to the independent t-test.

Key things point-biserial correlation helps you understand:

Group differences — Whether binary membership (e.g., gender, treatment status) is associated with different means on a continuous outcome.
Effect size — The square of r_pb tells you the proportion of variance explained by group membership.
T-test equivalence — You can convert a t-statistic directly into r_pb, making it easy to compare effect sizes across studies.

Whenever you run an independent t-test, you are already computing point-biserial correlation — you just might not know it.

What is Point-Biserial Correlation?

Definition

The point-biserial correlation measures the association between a binary variable and a continuous variable.

DfPoint-Biserial Correlation

The point-biserial correlation coefficient is a special case of Pearson's r that measures the relationship between a dichotomous (binary) variable and a continuous variable.

Point-Biserial Formula

r_{pb} = \frac{\bar{x}_1 - \bar{x}_0}{s_n} \sqrt{\frac{n_0 \cdot n_1}{n^2}}

Here,

$\bar{x}_1$ =Mean of the continuous variable for group 1 (code=1)
$\bar{x}_0$ =Mean of the continuous variable for group 0 (code=0)
$s_n$ =Standard deviation of the continuous variable (population formula)
$n_0, n_1$ =Sample sizes of each group
$n$ =Total sample size

import numpy as np
from scipy import stats

np.random.seed(42)

# Binary variable: gender (0=Male, 1=Female)
gender = np.array([0]*30 + [1]*30)

# Continuous variable: test scores
scores_male = np.random.normal(75, 10, 30)
scores_female = np.random.normal(82, 10, 30)
scores = np.concatenate([scores_male, scores_female])

r_pb, p_value = stats.pointbiserialr(gender, scores)
print(f"Point-biserial r = {r_pb:.4f}")
print(f"p-value          = {p_value:.6f}")

Relationship to Independent t-Test

# The point-biserial r is equivalent to:
# r_pb = sqrt(t² / (t² + df))

t_stat, p_t = stats.ttest_ind(scores_female, scores_male)
df = len(scores_female) + len(scores_male) - 2
r_from_t = np.sqrt(t_stat**2 / (t_stat**2 + df))

print(f"t-statistic = {t_stat:.4f}, p = {p_t:.6f}")
print(f"r from t-test: {r_from_t:.4f}")
print(f"r from pointbiserial: {r_pb:.4f}")

Equivalence to t-test

The point-biserial correlation is mathematically equivalent to the independent samples t-test. The square of r_pb equals the proportion of variance explained by group membership.

Interpretation

r_pb Value	Interpretation
0.10 – 0.29	Small effect
0.30 – 0.49	Medium effect
0.50+	Large effect

# Effect size interpretation
r_squared = r_pb**2
print(f"r² = {r_squared:.4f}")
print(f"{r_squared*100:.1f}% of variance in scores explained by gender")