🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Phi Coefficient — Correlation Between Two Binary Variables

Foundations of StatisticsDescriptive Statistics🟢 Free Lesson

Advertisement

Phi Coefficient

Descriptive Statistics

When Both Variables Are Yes-or-No Questions

Most real-world data is binary: yes/no, pass/fail, true/false. The phi coefficient takes two binary variables and produces a single correlation value that tells you whether they tend to agree — and by how much.

Key things the phi coefficient helps you understand:

  • Binary association — Whether two dichotomous variables move together or in opposite directions.
  • Chi-square bridge — Phi is directly derived from the chi-square statistic (φ² = χ²/n), linking test of independence to effect size.
  • Symmetric and bounded — φ ranges from -1 to +1, just like Pearson's r, and φ(A,B) = φ(B,A).

For 2×2 tables, phi is the simplest and most elegant measure of association — nothing more, nothing less.


What is the Phi Coefficient?

Definition

The phi coefficient (φ) measures the association between two binary variables using a 2×2 contingency table.


DfPhi Coefficient

The phi coefficient is a measure of association for two binary variables, computed from the 2×2 contingency table. It ranges from -1 (perfect negative association) to +1 (perfect positive association).

Phi Coefficient Formula

ϕ=adbc(a+b)(c+d)(a+c)(b+d)\phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}

Here,

  • aa=Count where both variables are 1 (true positives)
  • bb=Count where X=1, Y=0
  • cc=Count where X=0, Y=1
  • dd=Count where both variables are 0 (true negatives)
import numpy as np
from scipy import stats

# 2x2 contingency table
#               Y=0    Y=1
# X=0           a      b
# X=1           c      d

table = np.array([[30, 10],   # X=0: Y=0=30, Y=1=10
                   [15, 45]])  # X=1: Y=0=15, Y=1=45

chi2, p_value, dof, expected = stats.chi2_contingency(table)
n = table.sum()
phi = np.sqrt(chi2 / n)

print(f"2x2 Table:\n{table}")
print(f"\nChi-square = {chi2:.4f}")
print(f"Phi (φ)    = {phi:.4f}")
print(f"p-value    = {p_value:.6f}")

Relationship to Chi-Square

Phi from Chi-Square

ϕ=χ2n\phi = \sqrt{\frac{\chi^2}{n}}

Here,

  • χ2\chi^2=Chi-square statistic from the 2×2 table
  • nn=Total sample size
# Verify: phi² = chi²/n
phi_squared = chi2 / n
print(f"φ² = {phi_squared:.4f}")
print(f"φ  = {np.sqrt(phi_squared):.4f}")

Relationship to Other Measures

For a 2×2 table, phi is also equal to: the square root of the chi-square statistic divided by n. It is equivalent to the Pearson correlation between two binary variables coded as 0/1.


Interpretation

φ ValueInterpretation
0.00 – 0.10Negligible association
0.10 – 0.30Weak association
0.30 – 0.50Moderate association
0.50+Strong association
# Manual calculation
a, b, c, d = table[0,0], table[0,1], table[1,0], table[1,1]
phi_manual = (a*d - b*c) / np.sqrt((a+b)*(c+d)*(a+c)*(b+d))
print(f"Manual φ = {phi_manual:.4f}")

Phi Coefficient in Machine Learning

ML ApplicationPhi UsageWhy
Binary classificationPhi coefficient of confusion matrixBalanced accuracy metric
Feature selectionBinary vs binary associationQuick screening
Medical MLDisease vs symptom associationClinical feature selection
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=500, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
phi = (tp*tn - fp*fn) / np.sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))
print(f"Confusion matrix: TP={tp}, TN={tn}, FP={fp}, FN={fn}")
print(f"Phi coefficient: {phi:.4f}")
print("Phi = 1 is perfect, 0 is random, -1 is opposite")

Key Takeaways

Phi measures association between two binary variables — requires a 2×2 contingency table and nothing else.

φ ranges from -1 to +1 — sign indicates the direction of association, just like Pearson's r.

φ² = χ²/n — directly related to the chi-square statistic, making it a natural effect size for the chi-square test of independence.

Only for 2×2 tables — for larger tables, graduate to Cramér's V, which generalizes phi to any r×c table.

"Two binary variables, one elegant number — phi is Pearson's r when everything is a yes-or-no question."

Summary: Phi Coefficient

  • Phi measures association between two binary variables — requires a 2×2 contingency table
  • φ ranges from -1 to +1 — sign indicates direction of association
  • φ² = χ²/n — directly related to the chi-square statistic
  • Equivalent to Pearson's r when both variables are binary (0/1)
  • Symmetric: φ(A,B) = φ(B,A)
  • Limitations: only works for 2×2 tables; use Cramér's V for larger tables

Premium Content

Phi Coefficient — Correlation Between Two Binary Variables

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement