Point-Biserial Correlation
Descriptive Statistics
Bridging Categories and Numbers in One Statistic
When one variable splits your data into two groups (treated/untreated, pass/fail) and the other is continuous (test score, blood pressure), point-biserial correlation gives you a single number that captures the relationship — and it is mathematically identical to the independent t-test.
Key things point-biserial correlation helps you understand:
- Group differences — Whether binary membership (e.g., gender, treatment status) is associated with different means on a continuous outcome.
- Effect size — The square of r_pb tells you the proportion of variance explained by group membership.
- T-test equivalence — You can convert a t-statistic directly into r_pb, making it easy to compare effect sizes across studies.
Whenever you run an independent t-test, you are already computing point-biserial correlation — you just might not know it.
What is Point-Biserial Correlation?
Definition
The point-biserial correlation measures the association between a binary variable and a continuous variable.
DfPoint-Biserial Correlation
The point-biserial correlation coefficient is a special case of Pearson's r that measures the relationship between a dichotomous (binary) variable and a continuous variable.
Point-Biserial Formula
Here,
- =Mean of the continuous variable for group 1 (code=1)
- =Mean of the continuous variable for group 0 (code=0)
- =Standard deviation of the continuous variable (population formula)
- =Sample sizes of each group
- =Total sample size
import numpy as np
from scipy import stats
np.random.seed(42)
# Binary variable: gender (0=Male, 1=Female)
gender = np.array([0]*30 + [1]*30)
# Continuous variable: test scores
scores_male = np.random.normal(75, 10, 30)
scores_female = np.random.normal(82, 10, 30)
scores = np.concatenate([scores_male, scores_female])
r_pb, p_value = stats.pointbiserialr(gender, scores)
print(f"Point-biserial r = {r_pb:.4f}")
print(f"p-value = {p_value:.6f}")
Relationship to Independent t-Test
# The point-biserial r is equivalent to:
# r_pb = sqrt(t² / (t² + df))
t_stat, p_t = stats.ttest_ind(scores_female, scores_male)
df = len(scores_female) + len(scores_male) - 2
r_from_t = np.sqrt(t_stat**2 / (t_stat**2 + df))
print(f"t-statistic = {t_stat:.4f}, p = {p_t:.6f}")
print(f"r from t-test: {r_from_t:.4f}")
print(f"r from pointbiserial: {r_pb:.4f}")
Equivalence to t-test
The point-biserial correlation is mathematically equivalent to the independent samples t-test. The square of r_pb equals the proportion of variance explained by group membership.
Interpretation
| r_pb Value | Interpretation |
|---|---|
| 0.10 – 0.29 | Small effect |
| 0.30 – 0.49 | Medium effect |
| 0.50+ | Large effect |
# Effect size interpretation
r_squared = r_pb**2
print(f"r² = {r_squared:.4f}")
print(f"{r_squared*100:.1f}% of variance in scores explained by gender")
Point-Biserial Correlation in Machine Learning
| ML Application | Usage | Why |
|---|---|---|
| Feature selection | Binary target vs continuous feature | Identify discriminative features |
| A/B testing | Binary outcome vs continuous metric | Measure treatment effect |
| Classification | Binary class separation | Quick feature importance |
import numpy as np
from scipy.stats import pointbiserialr
np.random.seed(42)
# Binary outcome (e.g., pass/fail) and continuous feature (e.g., hours studied)
passed = np.random.binomial(1, 0.6, 200)
hours = np.where(passed == 1,
np.random.normal(8, 2, 200),
np.random.normal(4, 2, 200))
r, p = pointbiserialr(passed, hours)
print(f"Point-biserial r: {r:.3f}, p-value: {p:.4f}")
print(f"Hours studied is {'strongly' if abs(r) > 0.5 else 'moderately'} correlated with passing")
Key Takeaways
Measures association between a binary and continuous variable — a special case of Pearson's r.
Equivalent to the independent t-test — r_pb² = t²/(t²+df), so every t-test already produces a point-biserial correlation.
Positive r_pb means group 1 (coded 1) has a higher mean; negative means group 0 has the higher mean.
r_pb² gives the proportion of variance in the continuous variable explained by group membership — your effect size in one number.
"The t-test and point-biserial correlation are two sides of the same coin — one tells you if the difference is significant, the other tells you how big it actually is."
Summary: Point-Biserial Correlation
- Measures association between a binary and continuous variable — a special case of Pearson's r
- Equivalent to the independent t-test — r_pb² = t²/(t²+df)
- Positive r_pb: group 1 (coded 1) has higher mean; negative: group 0 has higher mean
- Assumptions: continuous variable is approximately normal within each group, observations are independent
- Effect size: r_pb² gives the proportion of variance in the continuous variable explained by group membership
- Use when: one variable is naturally dichotomous (pass/fail, male/female, treated/untreated)