Phi Coefficient
Descriptive Statistics
When Both Variables Are Yes-or-No Questions
Most real-world data is binary: yes/no, pass/fail, true/false. The phi coefficient takes two binary variables and produces a single correlation value that tells you whether they tend to agree — and by how much.
Key things the phi coefficient helps you understand:
- Binary association — Whether two dichotomous variables move together or in opposite directions.
- Chi-square bridge — Phi is directly derived from the chi-square statistic (φ² = χ²/n), linking test of independence to effect size.
- Symmetric and bounded — φ ranges from -1 to +1, just like Pearson's r, and φ(A,B) = φ(B,A).
For 2×2 tables, phi is the simplest and most elegant measure of association — nothing more, nothing less.
What is the Phi Coefficient?
Definition
The phi coefficient (φ) measures the association between two binary variables using a 2×2 contingency table.
DfPhi Coefficient
The phi coefficient is a measure of association for two binary variables, computed from the 2×2 contingency table. It ranges from -1 (perfect negative association) to +1 (perfect positive association).
Phi Coefficient Formula
Here,
- =Count where both variables are 1 (true positives)
- =Count where X=1, Y=0
- =Count where X=0, Y=1
- =Count where both variables are 0 (true negatives)
import numpy as np
from scipy import stats
# 2x2 contingency table
# Y=0 Y=1
# X=0 a b
# X=1 c d
table = np.array([[30, 10], # X=0: Y=0=30, Y=1=10
[15, 45]]) # X=1: Y=0=15, Y=1=45
chi2, p_value, dof, expected = stats.chi2_contingency(table)
n = table.sum()
phi = np.sqrt(chi2 / n)
print(f"2x2 Table:\n{table}")
print(f"\nChi-square = {chi2:.4f}")
print(f"Phi (φ) = {phi:.4f}")
print(f"p-value = {p_value:.6f}")
Relationship to Chi-Square
Phi from Chi-Square
Here,
- =Chi-square statistic from the 2×2 table
- =Total sample size
# Verify: phi² = chi²/n
phi_squared = chi2 / n
print(f"φ² = {phi_squared:.4f}")
print(f"φ = {np.sqrt(phi_squared):.4f}")
Relationship to Other Measures
For a 2×2 table, phi is also equal to: the square root of the chi-square statistic divided by n. It is equivalent to the Pearson correlation between two binary variables coded as 0/1.
Interpretation
| φ Value | Interpretation |
|---|---|
| 0.00 – 0.10 | Negligible association |
| 0.10 – 0.30 | Weak association |
| 0.30 – 0.50 | Moderate association |
| 0.50+ | Strong association |
# Manual calculation
a, b, c, d = table[0,0], table[0,1], table[1,0], table[1,1]
phi_manual = (a*d - b*c) / np.sqrt((a+b)*(c+d)*(a+c)*(b+d))
print(f"Manual φ = {phi_manual:.4f}")
Phi Coefficient in Machine Learning
| ML Application | Phi Usage | Why |
|---|---|---|
| Binary classification | Phi coefficient of confusion matrix | Balanced accuracy metric |
| Feature selection | Binary vs binary association | Quick screening |
| Medical ML | Disease vs symptom association | Clinical feature selection |
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=500, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
phi = (tp*tn - fp*fn) / np.sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))
print(f"Confusion matrix: TP={tp}, TN={tn}, FP={fp}, FN={fn}")
print(f"Phi coefficient: {phi:.4f}")
print("Phi = 1 is perfect, 0 is random, -1 is opposite")
Key Takeaways
Phi measures association between two binary variables — requires a 2×2 contingency table and nothing else.
φ ranges from -1 to +1 — sign indicates the direction of association, just like Pearson's r.
φ² = χ²/n — directly related to the chi-square statistic, making it a natural effect size for the chi-square test of independence.
Only for 2×2 tables — for larger tables, graduate to Cramér's V, which generalizes phi to any r×c table.
"Two binary variables, one elegant number — phi is Pearson's r when everything is a yes-or-no question."
Summary: Phi Coefficient
- Phi measures association between two binary variables — requires a 2×2 contingency table
- φ ranges from -1 to +1 — sign indicates direction of association
- φ² = χ²/n — directly related to the chi-square statistic
- Equivalent to Pearson's r when both variables are binary (0/1)
- Symmetric: φ(A,B) = φ(B,A)
- Limitations: only works for 2×2 tables; use Cramér's V for larger tables