Introduction to Discriminant Analysis
Advanced Statistical Methods
Classifying Observations With Statistical Precision
Discriminant analysis finds linear or quadratic functions that best separate known groups, providing probabilistic classification rules grounded in multivariate normal theory. Fisher's criterion maximizes between-group separation.
- Medical diagnosis β Classify patients into disease groups based on multiple clinical measurements
- Biology β Identify species from morphometric measurements using LDA or QDA
- Finance β Classify credit applicants as good or bad risks based on financial indicators
Discriminant analysis draws the optimal boundary between groups in multivariate space.
Discriminant analysis is a supervised classification technique that seeks to find linear combinations of features that best separate two or more predefined classes. Originally developed by R.A. Fisher in 1936 for the iris dataset, the method has grown into a foundational tool in statistical pattern recognition, with deep connections to Bayesian decision theory and multivariate normal theory.
The central problem is: given a set of observations belonging to known classes , construct a rule that assigns a new observation to one of these classes with minimal misclassification probability.
Probabilistic Foundations
Bayes' Classification Rule
Let denote the prior probability of class , and the class-conditional density. The posterior probability of class membership is given by Bayes' theorem:
The Bayes optimal classifier assigns to the class with the highest posterior probability:
DfBayes Optimal Classifier
The Bayes classifier minimizes the expected misclassification rate and is given by:
Equivalently, assigning to class is optimal when for all .
Discriminant Functions and Log-Odds
Define the discriminant function . The classification rule becomes . The decision boundary between classes and satisfies:
This log-ratio is the log-odds of class membership, and its sign determines the assignment.
Linear Discriminant Analysis (LDA)
Gaussian Assumption with Common Covariance
LDA assumes each class-conditional density is multivariate normal with a shared covariance matrix:
DfLDA Assumptions
For each class :
where is the class- mean vector and is the common covariance matrix shared across all classes.
Substituting into the discriminant function and simplifying (dropping terms constant in ):
This is linear in , hence the name. The decision boundary between any two classes is a hyperplane.
Fisher's Linear Discriminant
Fisher's approach formulates discriminant analysis as an optimization problem. Project data onto a direction such that the separation between classes is maximized relative to within-class variability:
DfFisher's Criterion
Fisher's discriminant maximizes the ratio:
where is the between-class scatter matrix and is the within-class scatter matrix.
The optimal is the leading eigenvector of . For classes, we extract up to discriminant directions.
ThEquivalence of Fisher and Bayes LDA
Under the LDA assumptions (Gaussian classes, common covariance), the Fisher discriminant directions span the same subspace as the Bayes optimal linear decision boundaries. Specifically, the first eigenvectors of provide the same classification as the LDA rule.
Mahalanobis Distance
The LDA classification rule can be recast in terms of distances:
DfMahalanobis Distance
The Mahalanobis distance from observation to class is:
Under LDA, the classification rule assigns to the class whose centroid is closest in Mahalanobis distance, adjusted for prior probabilities:
The Mahalanobis distance accounts for correlations between variables and differing scales. When , it reduces to ordinary Euclidean distance. The Mahalanobis distance is invariant to linear transformations of the feature space.
Quadratic Discriminant Analysis (QDA)
When the covariance matrices differ across classes, the decision boundaries become quadratic surfaces:
DfQDA Classification Rule
Under the assumption with class-specific covariances, the discriminant function is:
The term is quadratic in , producing quadratic decision boundaries.
QDA requires estimating covariance matrices , each with free parameters, totaling covariance parameters. LDA shares a single covariance matrix, requiring only parameters. The bias-variance tradeoff favors LDA when is small relative to , and QDA when the true covariances genuinely differ.
Regularized Discriminant Analysis (RDA)
Friedman (1989) proposed a compromise between LDA and QDA by shrinking toward a common covariance:
where controls the degree of pooling. When , we obtain QDA; when , we obtain LDA.
Parameter Estimation
Maximum Likelihood Estimation
Given training data with :
The pooled covariance estimator uses degrees of freedom (one per class mean subtracted). This is the unbiased ML estimator under the common-covariance assumption.
Linear Shrinkage Estimation
When is large relative to , the sample covariance can be poorly conditioned. Ledoit-Wolf shrinkage provides a well-conditioned estimator:
where is the average eigenvalue and is the shrinkage intensity estimated analytically.
Classification Assessment
Error Rate Estimation
The apparent (resubstitution) error rate is optimistically biased. Cross-validation and bootstrap methods provide better estimates:
Leave-one-out cross-validation for LDA has an elegant closed form. Since each observation is classified using parameters estimated on the remaining points, the LOOCV error rate can be computed without refitting, using the fact that the -th observation's leave-one-out classification depends on the leave-one-out mean and covariance, which can be updated incrementally.
Confusion Matrix and Beyond
For a -class problem, the confusion matrix where counts true class predicted as class provides:
Python Implementation
import numpy as np
from scipy import stats
from sklearn.discriminant_analysis import (
LinearDiscriminantAnalysis,
QuadraticDiscriminantAnalysis,
)
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X, y = iris.data, iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# --- LDA ---
lda = LinearDiscriminantAnalysis(solver='svd', n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
print("LDA explained variance ratio:", lda.explained_variance_ratio_)
print("LDA class priors:", lda.priors_)
print("LDA means shape:", lda.means_.shape)
lda_cv = cross_val_score(lda, X_scaled, y, cv=10, scoring='accuracy')
print(f"LDA 10-fold CV accuracy: {lda_cv.mean():.4f} (+/- {lda_cv.std():.4f})")
# --- QDA ---
qda = QuadraticDiscriminantAnalysis()
qda_cv = cross_val_score(qda, X_scaled, y, cv=10, scoring='accuracy')
print(f"QDA 10-fold CV accuracy: {qda_cv.mean():.4f} (+/- {qda_cv.std():.4f})")
# --- Fisher's LDA (manual) ---
def fisher_lda(X, y, n_components=2):
classes = np.unique(y)
n_features = X.shape[1]
mean_overall = X.mean(axis=0)
S_W = np.zeros((n_features, n_features))
S_B = np.zeros((n_features, n_features))
for c in classes:
X_c = X[y == c]
mean_c = X_c.mean(axis=0)
S_W += (X_c - mean_c).T @ (X_c - mean_c)
n_c = X_c.shape[0]
S_B += n_c * np.outer(mean_c - mean_overall, mean_c - mean_overall)
eigenvalues, eigenvectors = np.linalg.eig(np.linalg.inv(S_W) @ S_B)
idx = np.argsort(eigenvalues)[::-1][:n_components]
return eigenvectors[:, idx].real
W = fisher_lda(X_scaled, y, n_components=2)
X_fisher = X_scaled @ W
print("Fisher discriminant projections shape:", X_fisher.shape)
# --- Mahalanobis distance ---
def mahalanobis_to_class(x, mean_k, cov_inv):
diff = x - mean_k
return np.sqrt(diff @ cov_inv @ diff)
# Compute per-class Mahalanobis distances for first sample
lda.fit(X_scaled, y)
x0 = X_scaled[0]
for k, c in enumerate(classes := np.unique(y)):
d = mahalanobis_to_class(x0, lda.means_[k], np.linalg.inv(lda.covariance_))
print(f" Mahalanobis distance to class {c}: {d:.4f}")
Assumptions and Diagnostics
Key assumptions of LDA and QDA:
-
Multivariate normality: Each class-conditional distribution is approximately multivariate normal. Check with Mardia's test, Q-Q plots of Mahalanobis distances, or Henze-Zirkler's test.
-
Homoscedasticity (LDA only): The covariance matrices are equal across classes. Test with Box's M test (sensitive to non-normality; use with caution).
-
No multicollinearity: Features should not be perfectly collinear. Examine the condition number of .
-
Independent observations: Each observation is independent of others.
-
No significant outliers: Outliers distort mean and covariance estimates. Use robust Mahalanobis distances (e.g., minimum covariance determinant) for detection.
When assumptions are violated:
- Non-normality: Consider nonparametric discriminant analysis or support vector machines.
- Unequal covariances: Use QDA or regularized DA.
- High dimensionality: Apply shrinkage estimators, variable selection, or dimension reduction prior to LDA.
- Small samples: Use regularized LDA with cross-validated shrinkage parameter.
Connection to Other Methods
Discriminant analysis occupies a rich position in the statistical landscape. LDA is equivalent to a single-layer neural network with softmax output when features are Gaussian. Logistic regression estimates the same decision boundary as LDA without requiring the normality assumption, only the linear log-odds structure. Naive Bayes relaxes the covariance assumption by assuming diagonal , while kernel discriminant analysis handles non-Gaussian distributions through reproducing kernel Hilbert space embeddings.
The Bayesian framework naturally extends to naive Bayes when features are assumed independent within classes, and to mixture discriminant analysis (Hastie & Tibshirani, 1996) when each class is modeled as a mixture of Gaussians, providing flexibility for multimodal class distributions.