Introduction to Discriminant Analysis

Advanced Statistical Methods

Classifying Observations With Statistical Precision

Discriminant analysis finds linear or quadratic functions that best separate known groups, providing probabilistic classification rules grounded in multivariate normal theory. Fisher's criterion maximizes between-group separation.

Medical diagnosis — Classify patients into disease groups based on multiple clinical measurements
Biology — Identify species from morphometric measurements using LDA or QDA
Finance — Classify credit applicants as good or bad risks based on financial indicators

Discriminant analysis draws the optimal boundary between groups in multivariate space.

Discriminant analysis is a supervised classification technique that seeks to find linear combinations of features that best separate two or more predefined classes. Originally developed by R.A. Fisher in 1936 for the iris dataset, the method has grown into a foundational tool in statistical pattern recognition, with deep connections to Bayesian decision theory and multivariate normal theory.

The central problem is: given a set of observations $\mathbf{x}_1, \dots, \mathbf{x}_n$ belonging to known classes $\mathcal{G} = \{1, \dots, K\}$ , construct a rule that assigns a new observation $\mathbf{x}_{\text{new}}$ to one of these classes with minimal misclassification probability.

Probabilistic Foundations

Bayes' Classification Rule

Let $\pi_k = P(\mathcal{G} = k)$ denote the prior probability of class $k$ , and $f_k(\mathbf{x}) = f(\mathbf{x} \mid \mathcal{G} = k)$ the class-conditional density. The posterior probability of class membership is given by Bayes' theorem:

P(\mathcal{G} = k \mid \mathbf{x}) = \frac{\pi_k f_k(\mathbf{x})}{\sum_{\ell=1}^{K} \pi_\ell f_\ell(\mathbf{x})}

The Bayes optimal classifier assigns $\mathbf{x}$ to the class with the highest posterior probability:

DfBayes Optimal Classifier

The Bayes classifier $\delta^*(\mathbf{x})$ minimizes the expected misclassification rate and is given by:

\delta^*(\mathbf{x}) = \arg\max_{k \in \mathcal{G}} \, \pi_k f_k(\mathbf{x})

Equivalently, assigning to class $k$ is optimal when $\pi_k f_k(\mathbf{x}) > \pi_\ell f_\ell(\mathbf{x})$ for all $\ell \neq k$ .

Discriminant Functions and Log-Odds

Define the discriminant function $\delta_k(\mathbf{x}) = \log[\pi_k f_k(\mathbf{x})]$ . The classification rule becomes $\hat{y} = \arg\max_k \delta_k(\mathbf{x})$ . The decision boundary between classes $k$ and $\ell$ satisfies:

\delta_k(\mathbf{x}) = \delta_\ell(\mathbf{x}) \quad \Longleftrightarrow \quad \log\frac{\pi_k f_k(\mathbf{x})}{\pi_\ell f_\ell(\mathbf{x})} = 0

This log-ratio is the log-odds of class membership, and its sign determines the assignment.

Linear Discriminant Analysis (LDA)

Gaussian Assumption with Common Covariance

LDA assumes each class-conditional density is multivariate normal with a shared covariance matrix:

DfLDA Assumptions

For each class $k \in \{1, \dots, K\}$ :

f_k(\mathbf{x}) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)\right)

where $\boldsymbol{\mu}_k$ is the class- $k$ mean vector and $\boldsymbol{\Sigma}$ is the common covariance matrix shared across all $K$ classes.

Substituting into the discriminant function and simplifying (dropping terms constant in $k$ ):

\delta_k(\mathbf{x}) = \mathbf{x}^{\top} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k - \frac{1}{2} \boldsymbol{\mu}_k^{\top} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k + \log \pi_k

This is linear in $\mathbf{x}$ , hence the name. The decision boundary between any two classes is a hyperplane.

Fisher's Linear Discriminant

Fisher's approach formulates discriminant analysis as an optimization problem. Project data onto a direction $\mathbf{w}$ such that the separation between classes is maximized relative to within-class variability:

DfFisher's Criterion

Fisher's discriminant maximizes the ratio:

J(\mathbf{w}) = \frac{\mathbf{w}^{\top} \mathbf{S}_B \mathbf{w}}{\mathbf{w}^{\top} \mathbf{S}_W \mathbf{w}}

where $\mathbf{S}_B = \sum_{k=1}^{K} n_k (\boldsymbol{\mu}_k - \boldsymbol{\mu})(\boldsymbol{\mu}_k - \boldsymbol{\mu})^{\top}$ is the between-class scatter matrix and $\mathbf{S}_W = \sum_{k=1}^{K} \sum_{i: y_i = k} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^{\top}$ is the within-class scatter matrix.

The optimal $\mathbf{w}$ is the leading eigenvector of $\mathbf{S}_W^{-1} \mathbf{S}_B$ . For $K$ classes, we extract up to $K-1$ discriminant directions.

ThEquivalence of Fisher and Bayes LDA

Under the LDA assumptions (Gaussian classes, common covariance), the Fisher discriminant directions span the same subspace as the Bayes optimal linear decision boundaries. Specifically, the first $K-1$ eigenvectors of $\mathbf{S}_W^{-1}\mathbf{S}_B$ provide the same classification as the LDA rule.

Mahalanobis Distance

The LDA classification rule can be recast in terms of distances:

DfMahalanobis Distance

The Mahalanobis distance from observation $\mathbf{x}$ to class $k$ is:

d_M(\mathbf{x}, \boldsymbol{\mu}_k) = \sqrt{(\mathbf{x} - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)}

Under LDA, the classification rule assigns $\mathbf{x}$ to the class whose centroid is closest in Mahalanobis distance, adjusted for prior probabilities:

\hat{y} = \arg\min_k \left[ d_M^2(\mathbf{x}, \boldsymbol{\mu}_k) - 2\log\pi_k \right]

The Mahalanobis distance accounts for correlations between variables and differing scales. When $\boldsymbol{\Sigma} = \mathbf{I}_p$ , it reduces to ordinary Euclidean distance. The Mahalanobis distance is invariant to linear transformations of the feature space.

Quadratic Discriminant Analysis (QDA)

When the covariance matrices differ across classes, the decision boundaries become quadratic surfaces:

DfQDA Classification Rule

Under the assumption $f_k(\mathcal{N}) = \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ with class-specific covariances, the discriminant function is:

\delta_k(\mathbf{x}) = -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) - \frac{1}{2}\log|\boldsymbol{\Sigma}_k| + \log\pi_k

The term $(\mathbf{x} - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)$ is quadratic in $\mathbf{x}$ , producing quadratic decision boundaries.

QDA requires estimating $K$ covariance matrices $\boldsymbol{\Sigma}_1, \dots, \boldsymbol{\Sigma}_K$ , each with $\frac{p(p+1)}{2}$ free parameters, totaling $\frac{Kp(p+1)}{2}$ covariance parameters. LDA shares a single covariance matrix, requiring only $\frac{p(p+1)}{2}$ parameters. The bias-variance tradeoff favors LDA when $n$ is small relative to $p$ , and QDA when the true covariances genuinely differ.

Regularized Discriminant Analysis (RDA)

Friedman (1989) proposed a compromise between LDA and QDA by shrinking toward a common covariance:

\boldsymbol{\Sigma}_k(\alpha) = (1 - \alpha)\boldsymbol{\Sigma}_k + \alpha\boldsymbol{\Sigma}

where $\alpha \in [0, 1]$ controls the degree of pooling. When $\alpha = 0$ , we obtain QDA; when $\alpha = 1$ , we obtain LDA.

Parameter Estimation

Maximum Likelihood Estimation

Given training data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ with $n_k = \sum_{i=1}^n \mathbb{1}(y_i = k)$ :

\hat{\boldsymbol{\mu}}_k = \frac{1}{n_k}\sum_{i: y_i = k} \mathbf{x}_i, \qquad \hat{\pi}_k = \frac{n_k}{n}

\hat{\boldsymbol{\Sigma}} = \frac{1}{n-K}\sum_{k=1}^{K}\sum_{i: y_i = k}(\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)^{\top}

The pooled covariance estimator $\hat{\boldsymbol{\Sigma}}$ uses $n - K$ degrees of freedom (one per class mean subtracted). This is the unbiased ML estimator under the common-covariance assumption.

Linear Shrinkage Estimation

When $p$ is large relative to $n$ , the sample covariance can be poorly conditioned. Ledoit-Wolf shrinkage provides a well-conditioned estimator:

\hat{\boldsymbol{\Sigma}}_{\text{shrunk}} = (1 - \lambda)\hat{\boldsymbol{\Sigma}} + \lambda \nu \mathbf{I}_p

where $\nu = \text{tr}(\hat{\boldsymbol{\Sigma}})/p$ is the average eigenvalue and $\lambda \in [0, 1]$ is the shrinkage intensity estimated analytically.

Classification Assessment

Error Rate Estimation

The apparent (resubstitution) error rate $\hat{R}_{\text{app}} = \frac{1}{n}\sum_{i=1}^{n}\mathbb{1}(\hat{y}_i \neq y_i)$ is optimistically biased. Cross-validation and bootstrap methods provide better estimates:

Leave-one-out cross-validation for LDA has an elegant closed form. Since each observation is classified using parameters estimated on the remaining $n-1$ points, the LOOCV error rate can be computed without refitting, using the fact that the $i$ -th observation's leave-one-out classification depends on the leave-one-out mean and covariance, which can be updated incrementally.

Confusion Matrix and Beyond

For a $K$ -class problem, the confusion matrix $\mathbf{C}$ where $C_{k\ell}$ counts true class $k$ predicted as class $\ell$ provides:

\text{Sensitivity}_k = \frac{C_{kk}}{\sum_{\ell} C_{k\ell}}, \qquad \text{Specificity}_k = \frac{\sum_{j\neq k, \ell\neq k} C_{j\ell}}{\sum_{j\neq k}\sum_{\ell} C_{j\ell}}

Python Implementation

import numpy as np
from scipy import stats
from sklearn.discriminant_analysis import (
    LinearDiscriminantAnalysis,
    QuadraticDiscriminantAnalysis,
)
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

iris = load_iris()
X, y = iris.data, iris.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- LDA ---
lda = LinearDiscriminantAnalysis(solver='svd', n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

print("LDA explained variance ratio:", lda.explained_variance_ratio_)
print("LDA class priors:", lda.priors_)
print("LDA means shape:", lda.means_.shape)

lda_cv = cross_val_score(lda, X_scaled, y, cv=10, scoring='accuracy')
print(f"LDA 10-fold CV accuracy: {lda_cv.mean():.4f} (+/- {lda_cv.std():.4f})")

# --- QDA ---
qda = QuadraticDiscriminantAnalysis()
qda_cv = cross_val_score(qda, X_scaled, y, cv=10, scoring='accuracy')
print(f"QDA 10-fold CV accuracy: {qda_cv.mean():.4f} (+/- {qda_cv.std():.4f})")

# --- Fisher's LDA (manual) ---
def fisher_lda(X, y, n_components=2):
    classes = np.unique(y)
    n_features = X.shape[1]
    mean_overall = X.mean(axis=0)

    S_W = np.zeros((n_features, n_features))
    S_B = np.zeros((n_features, n_features))

    for c in classes:
        X_c = X[y == c]
        mean_c = X_c.mean(axis=0)
        S_W += (X_c - mean_c).T @ (X_c - mean_c)
        n_c = X_c.shape[0]
        S_B += n_c * np.outer(mean_c - mean_overall, mean_c - mean_overall)

    eigenvalues, eigenvectors = np.linalg.eig(np.linalg.inv(S_W) @ S_B)
    idx = np.argsort(eigenvalues)[::-1][:n_components]
    return eigenvectors[:, idx].real

W = fisher_lda(X_scaled, y, n_components=2)
X_fisher = X_scaled @ W
print("Fisher discriminant projections shape:", X_fisher.shape)

# --- Mahalanobis distance ---
def mahalanobis_to_class(x, mean_k, cov_inv):
    diff = x - mean_k
    return np.sqrt(diff @ cov_inv @ diff)

# Compute per-class Mahalanobis distances for first sample
lda.fit(X_scaled, y)
x0 = X_scaled[0]
for k, c in enumerate(classes := np.unique(y)):
    d = mahalanobis_to_class(x0, lda.means_[k], np.linalg.inv(lda.covariance_))
    print(f"  Mahalanobis distance to class {c}: {d:.4f}")

Assumptions and Diagnostics

Key assumptions of LDA and QDA:

Multivariate normality: Each class-conditional distribution is approximately multivariate normal. Check with Mardia's test, Q-Q plots of Mahalanobis distances, or Henze-Zirkler's test.
Homoscedasticity (LDA only): The covariance matrices are equal across classes. Test with Box's M test (sensitive to non-normality; use with caution).
No multicollinearity: Features should not be perfectly collinear. Examine the condition number of $\hat{\boldsymbol{\Sigma}}$ .
Independent observations: Each observation is independent of others.
No significant outliers: Outliers distort mean and covariance estimates. Use robust Mahalanobis distances (e.g., minimum covariance determinant) for detection.

When assumptions are violated:

Non-normality: Consider nonparametric discriminant analysis or support vector machines.
Unequal covariances: Use QDA or regularized DA.
High dimensionality: Apply shrinkage estimators, variable selection, or dimension reduction prior to LDA.
Small samples: Use regularized LDA with cross-validated shrinkage parameter.

Connection to Other Methods

Discriminant analysis occupies a rich position in the statistical landscape. LDA is equivalent to a single-layer neural network with softmax output when features are Gaussian. Logistic regression estimates the same decision boundary as LDA without requiring the normality assumption, only the linear log-odds structure. Naive Bayes relaxes the covariance assumption by assuming diagonal $\boldsymbol{\Sigma}_k$ , while kernel discriminant analysis handles non-Gaussian distributions through reproducing kernel Hilbert space embeddings.

The Bayesian framework naturally extends to naive Bayes when features are assumed independent within classes, and to mixture discriminant analysis (Hastie & Tibshirani, 1996) when each class is modeled as a mixture of Gaussians, providing flexibility for multimodal class distributions.

Discriminant Analysis