Phi Coefficient

Descriptive Statistics

When Both Variables Are Yes-or-No Questions

Most real-world data is binary: yes/no, pass/fail, true/false. The phi coefficient takes two binary variables and produces a single correlation value that tells you whether they tend to agree — and by how much.

Key things the phi coefficient helps you understand:

Binary association — Whether two dichotomous variables move together or in opposite directions.
Chi-square bridge — Phi is directly derived from the chi-square statistic (φ² = χ²/n), linking test of independence to effect size.
Symmetric and bounded — φ ranges from -1 to +1, just like Pearson's r, and φ(A,B) = φ(B,A).

For 2×2 tables, phi is the simplest and most elegant measure of association — nothing more, nothing less.

What is the Phi Coefficient?

Definition

The phi coefficient (φ) measures the association between two binary variables using a 2×2 contingency table.

DfPhi Coefficient

The phi coefficient is a measure of association for two binary variables, computed from the 2×2 contingency table. It ranges from -1 (perfect negative association) to +1 (perfect positive association).

Phi Coefficient Formula

\phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}

Here,

$a$ =Count where both variables are 1 (true positives)
$b$ =Count where X=1, Y=0
$c$ =Count where X=0, Y=1
$d$ =Count where both variables are 0 (true negatives)

import numpy as np
from scipy import stats

# 2x2 contingency table
#               Y=0    Y=1
# X=0           a      b
# X=1           c      d

table = np.array([[30, 10],   # X=0: Y=0=30, Y=1=10
                   [15, 45]])  # X=1: Y=0=15, Y=1=45

chi2, p_value, dof, expected = stats.chi2_contingency(table)
n = table.sum()
phi = np.sqrt(chi2 / n)

print(f"2x2 Table:\n{table}")
print(f"\nChi-square = {chi2:.4f}")
print(f"Phi (φ)    = {phi:.4f}")
print(f"p-value    = {p_value:.6f}")

Relationship to Chi-Square

Phi from Chi-Square

\phi = \sqrt{\frac{\chi^2}{n}}

Here,

$\chi^2$ =Chi-square statistic from the 2×2 table
$n$ =Total sample size

# Verify: phi² = chi²/n
phi_squared = chi2 / n
print(f"φ² = {phi_squared:.4f}")
print(f"φ  = {np.sqrt(phi_squared):.4f}")

Relationship to Other Measures

For a 2×2 table, phi is also equal to: the square root of the chi-square statistic divided by n. It is equivalent to the Pearson correlation between two binary variables coded as 0/1.

Interpretation

φ Value	Interpretation
0.00 – 0.10	Negligible association
0.10 – 0.30	Weak association
0.30 – 0.50	Moderate association
0.50+	Strong association

# Manual calculation
a, b, c, d = table[0,0], table[0,1], table[1,0], table[1,1]
phi_manual = (a*d - b*c) / np.sqrt((a+b)*(c+d)*(a+c)*(b+d))
print(f"Manual φ = {phi_manual:.4f}")

Phi Coefficient in Machine Learning

ML Application	Phi Usage	Why
Binary classification	Phi coefficient of confusion matrix	Balanced accuracy metric
Feature selection	Binary vs binary association	Quick screening
Medical ML	Disease vs symptom association	Clinical feature selection

import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=500, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
phi = (tp*tn - fp*fn) / np.sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))
print(f"Confusion matrix: TP={tp}, TN={tn}, FP={fp}, FN={fn}")
print(f"Phi coefficient: {phi:.4f}")
print("Phi = 1 is perfect, 0 is random, -1 is opposite")

Key Takeaways

Phi measures association between two binary variables — requires a 2×2 contingency table and nothing else.

φ ranges from -1 to +1 — sign indicates the direction of association, just like Pearson's r.

φ² = χ²/n — directly related to the chi-square statistic, making it a natural effect size for the chi-square test of independence.

Only for 2×2 tables — for larger tables, graduate to Cramér's V, which generalizes phi to any r×c table.

"Two binary variables, one elegant number — phi is Pearson's r when everything is a yes-or-no question."

Summary: Phi Coefficient

Phi measures association between two binary variables — requires a 2×2 contingency table
φ ranges from -1 to +1 — sign indicates direction of association
φ² = χ²/n — directly related to the chi-square statistic
Equivalent to Pearson's r when both variables are binary (0/1)
Symmetric: φ(A,B) = φ(B,A)
Limitations: only works for 2×2 tables; use Cramér's V for larger tables

Phi Coefficient — Correlation Between Two Binary Variables