πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Discriminant Analysis

Advanced Statistical MethodsClassification Methods🟒 Free Lesson

Advertisement

Introduction to Discriminant Analysis

Advanced Statistical Methods

Classifying Observations With Statistical Precision

Discriminant analysis finds linear or quadratic functions that best separate known groups, providing probabilistic classification rules grounded in multivariate normal theory. Fisher's criterion maximizes between-group separation.

  • Medical diagnosis β€” Classify patients into disease groups based on multiple clinical measurements
  • Biology β€” Identify species from morphometric measurements using LDA or QDA
  • Finance β€” Classify credit applicants as good or bad risks based on financial indicators

Discriminant analysis draws the optimal boundary between groups in multivariate space.


Discriminant analysis is a supervised classification technique that seeks to find linear combinations of features that best separate two or more predefined classes. Originally developed by R.A. Fisher in 1936 for the iris dataset, the method has grown into a foundational tool in statistical pattern recognition, with deep connections to Bayesian decision theory and multivariate normal theory.

The central problem is: given a set of observations x1,…,xn\mathbf{x}_1, \dots, \mathbf{x}_n belonging to known classes G={1,…,K}\mathcal{G} = \{1, \dots, K\}, construct a rule that assigns a new observation xnew\mathbf{x}_{\text{new}} to one of these classes with minimal misclassification probability.

Probabilistic Foundations

Bayes' Classification Rule

Let Ο€k=P(G=k)\pi_k = P(\mathcal{G} = k) denote the prior probability of class kk, and fk(x)=f(x∣G=k)f_k(\mathbf{x}) = f(\mathbf{x} \mid \mathcal{G} = k) the class-conditional density. The posterior probability of class membership is given by Bayes' theorem:

P(G=k∣x)=Ο€kfk(x)βˆ‘β„“=1KΟ€β„“fβ„“(x)P(\mathcal{G} = k \mid \mathbf{x}) = \frac{\pi_k f_k(\mathbf{x})}{\sum_{\ell=1}^{K} \pi_\ell f_\ell(\mathbf{x})}

The Bayes optimal classifier assigns x\mathbf{x} to the class with the highest posterior probability:

DfBayes Optimal Classifier

The Bayes classifier Ξ΄βˆ—(x)\delta^*(\mathbf{x}) minimizes the expected misclassification rate and is given by:

Ξ΄βˆ—(x)=arg⁑max⁑k∈G πkfk(x)\delta^*(\mathbf{x}) = \arg\max_{k \in \mathcal{G}} \, \pi_k f_k(\mathbf{x})

Equivalently, assigning to class kk is optimal when Ο€kfk(x)>Ο€β„“fβ„“(x)\pi_k f_k(\mathbf{x}) > \pi_\ell f_\ell(\mathbf{x}) for all β„“β‰ k\ell \neq k.

Discriminant Functions and Log-Odds

Define the discriminant function Ξ΄k(x)=log⁑[Ο€kfk(x)]\delta_k(\mathbf{x}) = \log[\pi_k f_k(\mathbf{x})]. The classification rule becomes y^=arg⁑max⁑kΞ΄k(x)\hat{y} = \arg\max_k \delta_k(\mathbf{x}). The decision boundary between classes kk and β„“\ell satisfies:

Ξ΄k(x)=Ξ΄β„“(x)⟺log⁑πkfk(x)Ο€β„“fβ„“(x)=0\delta_k(\mathbf{x}) = \delta_\ell(\mathbf{x}) \quad \Longleftrightarrow \quad \log\frac{\pi_k f_k(\mathbf{x})}{\pi_\ell f_\ell(\mathbf{x})} = 0

This log-ratio is the log-odds of class membership, and its sign determines the assignment.

Linear Discriminant Analysis (LDA)

Gaussian Assumption with Common Covariance

LDA assumes each class-conditional density is multivariate normal with a shared covariance matrix:

DfLDA Assumptions

For each class k∈{1,…,K}k \in \{1, \dots, K\}:

fk(x)=1(2Ο€)p/2∣Σ∣1/2exp⁑(βˆ’12(xβˆ’ΞΌk)βŠ€Ξ£βˆ’1(xβˆ’ΞΌk))f_k(\mathbf{x}) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)\right)

where ΞΌk\boldsymbol{\mu}_k is the class-kk mean vector and Ξ£\boldsymbol{\Sigma} is the common covariance matrix shared across all KK classes.

Substituting into the discriminant function and simplifying (dropping terms constant in kk):

Ξ΄k(x)=xβŠ€Ξ£βˆ’1ΞΌkβˆ’12ΞΌkβŠ€Ξ£βˆ’1ΞΌk+log⁑πk\delta_k(\mathbf{x}) = \mathbf{x}^{\top} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k - \frac{1}{2} \boldsymbol{\mu}_k^{\top} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k + \log \pi_k

This is linear in x\mathbf{x}, hence the name. The decision boundary between any two classes is a hyperplane.

Fisher's Linear Discriminant

Fisher's approach formulates discriminant analysis as an optimization problem. Project data onto a direction w\mathbf{w} such that the separation between classes is maximized relative to within-class variability:

DfFisher's Criterion

Fisher's discriminant maximizes the ratio:

J(w)=w⊀SBww⊀SWwJ(\mathbf{w}) = \frac{\mathbf{w}^{\top} \mathbf{S}_B \mathbf{w}}{\mathbf{w}^{\top} \mathbf{S}_W \mathbf{w}}

where SB=βˆ‘k=1Knk(ΞΌkβˆ’ΞΌ)(ΞΌkβˆ’ΞΌ)⊀\mathbf{S}_B = \sum_{k=1}^{K} n_k (\boldsymbol{\mu}_k - \boldsymbol{\mu})(\boldsymbol{\mu}_k - \boldsymbol{\mu})^{\top} is the between-class scatter matrix and SW=βˆ‘k=1Kβˆ‘i:yi=k(xiβˆ’ΞΌk)(xiβˆ’ΞΌk)⊀\mathbf{S}_W = \sum_{k=1}^{K} \sum_{i: y_i = k} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^{\top} is the within-class scatter matrix.

The optimal w\mathbf{w} is the leading eigenvector of SWβˆ’1SB\mathbf{S}_W^{-1} \mathbf{S}_B. For KK classes, we extract up to Kβˆ’1K-1 discriminant directions.

ThEquivalence of Fisher and Bayes LDA

Under the LDA assumptions (Gaussian classes, common covariance), the Fisher discriminant directions span the same subspace as the Bayes optimal linear decision boundaries. Specifically, the first Kβˆ’1K-1 eigenvectors of SWβˆ’1SB\mathbf{S}_W^{-1}\mathbf{S}_B provide the same classification as the LDA rule.

Mahalanobis Distance

The LDA classification rule can be recast in terms of distances:

DfMahalanobis Distance

The Mahalanobis distance from observation x\mathbf{x} to class kk is:

dM(x,ΞΌk)=(xβˆ’ΞΌk)βŠ€Ξ£βˆ’1(xβˆ’ΞΌk)d_M(\mathbf{x}, \boldsymbol{\mu}_k) = \sqrt{(\mathbf{x} - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)}

Under LDA, the classification rule assigns x\mathbf{x} to the class whose centroid is closest in Mahalanobis distance, adjusted for prior probabilities:

y^=arg⁑min⁑k[dM2(x,ΞΌk)βˆ’2log⁑πk]\hat{y} = \arg\min_k \left[ d_M^2(\mathbf{x}, \boldsymbol{\mu}_k) - 2\log\pi_k \right]

The Mahalanobis distance accounts for correlations between variables and differing scales. When Ξ£=Ip\boldsymbol{\Sigma} = \mathbf{I}_p, it reduces to ordinary Euclidean distance. The Mahalanobis distance is invariant to linear transformations of the feature space.

Quadratic Discriminant Analysis (QDA)

When the covariance matrices differ across classes, the decision boundaries become quadratic surfaces:

DfQDA Classification Rule

Under the assumption fk(N)=N(ΞΌk,Ξ£k)f_k(\mathcal{N}) = \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) with class-specific covariances, the discriminant function is:

Ξ΄k(x)=βˆ’12(xβˆ’ΞΌk)⊀Σkβˆ’1(xβˆ’ΞΌk)βˆ’12log⁑∣Σk∣+log⁑πk\delta_k(\mathbf{x}) = -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) - \frac{1}{2}\log|\boldsymbol{\Sigma}_k| + \log\pi_k

The term (xβˆ’ΞΌk)⊀Σkβˆ’1(xβˆ’ΞΌk)(\mathbf{x} - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) is quadratic in x\mathbf{x}, producing quadratic decision boundaries.

QDA requires estimating KK covariance matrices Ξ£1,…,Ξ£K\boldsymbol{\Sigma}_1, \dots, \boldsymbol{\Sigma}_K, each with p(p+1)2\frac{p(p+1)}{2} free parameters, totaling Kp(p+1)2\frac{Kp(p+1)}{2} covariance parameters. LDA shares a single covariance matrix, requiring only p(p+1)2\frac{p(p+1)}{2} parameters. The bias-variance tradeoff favors LDA when nn is small relative to pp, and QDA when the true covariances genuinely differ.

Regularized Discriminant Analysis (RDA)

Friedman (1989) proposed a compromise between LDA and QDA by shrinking toward a common covariance:

Ξ£k(Ξ±)=(1βˆ’Ξ±)Ξ£k+Ξ±Ξ£\boldsymbol{\Sigma}_k(\alpha) = (1 - \alpha)\boldsymbol{\Sigma}_k + \alpha\boldsymbol{\Sigma}

where α∈[0,1]\alpha \in [0, 1] controls the degree of pooling. When α=0\alpha = 0, we obtain QDA; when α=1\alpha = 1, we obtain LDA.

Parameter Estimation

Maximum Likelihood Estimation

Given training data {(xi,yi)}i=1n\{(\mathbf{x}_i, y_i)\}_{i=1}^n with nk=βˆ‘i=1n1(yi=k)n_k = \sum_{i=1}^n \mathbb{1}(y_i = k):

ΞΌ^k=1nkβˆ‘i:yi=kxi,Ο€^k=nkn\hat{\boldsymbol{\mu}}_k = \frac{1}{n_k}\sum_{i: y_i = k} \mathbf{x}_i, \qquad \hat{\pi}_k = \frac{n_k}{n}
Ξ£^=1nβˆ’Kβˆ‘k=1Kβˆ‘i:yi=k(xiβˆ’ΞΌ^k)(xiβˆ’ΞΌ^k)⊀\hat{\boldsymbol{\Sigma}} = \frac{1}{n-K}\sum_{k=1}^{K}\sum_{i: y_i = k}(\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)^{\top}

The pooled covariance estimator Ξ£^\hat{\boldsymbol{\Sigma}} uses nβˆ’Kn - K degrees of freedom (one per class mean subtracted). This is the unbiased ML estimator under the common-covariance assumption.

Linear Shrinkage Estimation

When pp is large relative to nn, the sample covariance can be poorly conditioned. Ledoit-Wolf shrinkage provides a well-conditioned estimator:

Ξ£^shrunk=(1βˆ’Ξ»)Ξ£^+λνIp\hat{\boldsymbol{\Sigma}}_{\text{shrunk}} = (1 - \lambda)\hat{\boldsymbol{\Sigma}} + \lambda \nu \mathbf{I}_p

where ν=tr(Σ^)/p\nu = \text{tr}(\hat{\boldsymbol{\Sigma}})/p is the average eigenvalue and λ∈[0,1]\lambda \in [0, 1] is the shrinkage intensity estimated analytically.

Classification Assessment

Error Rate Estimation

The apparent (resubstitution) error rate R^app=1nβˆ‘i=1n1(y^iβ‰ yi)\hat{R}_{\text{app}} = \frac{1}{n}\sum_{i=1}^{n}\mathbb{1}(\hat{y}_i \neq y_i) is optimistically biased. Cross-validation and bootstrap methods provide better estimates:

Leave-one-out cross-validation for LDA has an elegant closed form. Since each observation is classified using parameters estimated on the remaining nβˆ’1n-1 points, the LOOCV error rate can be computed without refitting, using the fact that the ii-th observation's leave-one-out classification depends on the leave-one-out mean and covariance, which can be updated incrementally.

Confusion Matrix and Beyond

For a KK-class problem, the confusion matrix C\mathbf{C} where Ckβ„“C_{k\ell} counts true class kk predicted as class β„“\ell provides:

Sensitivityk=Ckkβˆ‘β„“Ckβ„“,Specificityk=βˆ‘jβ‰ k,β„“β‰ kCjβ„“βˆ‘jβ‰ kβˆ‘β„“Cjβ„“\text{Sensitivity}_k = \frac{C_{kk}}{\sum_{\ell} C_{k\ell}}, \qquad \text{Specificity}_k = \frac{\sum_{j\neq k, \ell\neq k} C_{j\ell}}{\sum_{j\neq k}\sum_{\ell} C_{j\ell}}

Python Implementation

import numpy as np
from scipy import stats
from sklearn.discriminant_analysis import (
    LinearDiscriminantAnalysis,
    QuadraticDiscriminantAnalysis,
)
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

iris = load_iris()
X, y = iris.data, iris.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- LDA ---
lda = LinearDiscriminantAnalysis(solver='svd', n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

print("LDA explained variance ratio:", lda.explained_variance_ratio_)
print("LDA class priors:", lda.priors_)
print("LDA means shape:", lda.means_.shape)

lda_cv = cross_val_score(lda, X_scaled, y, cv=10, scoring='accuracy')
print(f"LDA 10-fold CV accuracy: {lda_cv.mean():.4f} (+/- {lda_cv.std():.4f})")

# --- QDA ---
qda = QuadraticDiscriminantAnalysis()
qda_cv = cross_val_score(qda, X_scaled, y, cv=10, scoring='accuracy')
print(f"QDA 10-fold CV accuracy: {qda_cv.mean():.4f} (+/- {qda_cv.std():.4f})")

# --- Fisher's LDA (manual) ---
def fisher_lda(X, y, n_components=2):
    classes = np.unique(y)
    n_features = X.shape[1]
    mean_overall = X.mean(axis=0)

    S_W = np.zeros((n_features, n_features))
    S_B = np.zeros((n_features, n_features))

    for c in classes:
        X_c = X[y == c]
        mean_c = X_c.mean(axis=0)
        S_W += (X_c - mean_c).T @ (X_c - mean_c)
        n_c = X_c.shape[0]
        S_B += n_c * np.outer(mean_c - mean_overall, mean_c - mean_overall)

    eigenvalues, eigenvectors = np.linalg.eig(np.linalg.inv(S_W) @ S_B)
    idx = np.argsort(eigenvalues)[::-1][:n_components]
    return eigenvectors[:, idx].real

W = fisher_lda(X_scaled, y, n_components=2)
X_fisher = X_scaled @ W
print("Fisher discriminant projections shape:", X_fisher.shape)

# --- Mahalanobis distance ---
def mahalanobis_to_class(x, mean_k, cov_inv):
    diff = x - mean_k
    return np.sqrt(diff @ cov_inv @ diff)

# Compute per-class Mahalanobis distances for first sample
lda.fit(X_scaled, y)
x0 = X_scaled[0]
for k, c in enumerate(classes := np.unique(y)):
    d = mahalanobis_to_class(x0, lda.means_[k], np.linalg.inv(lda.covariance_))
    print(f"  Mahalanobis distance to class {c}: {d:.4f}")

Assumptions and Diagnostics

Key assumptions of LDA and QDA:

  1. Multivariate normality: Each class-conditional distribution is approximately multivariate normal. Check with Mardia's test, Q-Q plots of Mahalanobis distances, or Henze-Zirkler's test.

  2. Homoscedasticity (LDA only): The covariance matrices are equal across classes. Test with Box's M test (sensitive to non-normality; use with caution).

  3. No multicollinearity: Features should not be perfectly collinear. Examine the condition number of Ξ£^\hat{\boldsymbol{\Sigma}}.

  4. Independent observations: Each observation is independent of others.

  5. No significant outliers: Outliers distort mean and covariance estimates. Use robust Mahalanobis distances (e.g., minimum covariance determinant) for detection.

When assumptions are violated:

  • Non-normality: Consider nonparametric discriminant analysis or support vector machines.
  • Unequal covariances: Use QDA or regularized DA.
  • High dimensionality: Apply shrinkage estimators, variable selection, or dimension reduction prior to LDA.
  • Small samples: Use regularized LDA with cross-validated shrinkage parameter.

Connection to Other Methods

Discriminant analysis occupies a rich position in the statistical landscape. LDA is equivalent to a single-layer neural network with softmax output when features are Gaussian. Logistic regression estimates the same decision boundary as LDA without requiring the normality assumption, only the linear log-odds structure. Naive Bayes relaxes the covariance assumption by assuming diagonal Ξ£k\boldsymbol{\Sigma}_k, while kernel discriminant analysis handles non-Gaussian distributions through reproducing kernel Hilbert space embeddings.

The Bayesian framework naturally extends to naive Bayes when features are assumed independent within classes, and to mixture discriminant analysis (Hastie & Tibshirani, 1996) when each class is modeled as a mixture of Gaussians, providing flexibility for multimodal class distributions.

⭐

Premium Content

Discriminant Analysis

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement