Functional Data Analysis (FDA)
Advanced Statistical Methods
Treating Curves and Signals as Data Points
Functional data analysis treats entire curves, surfaces, or functions as individual observations. Basis expansions, functional PCA, and registration methods enable statistical analysis of inherently functional data.
- Biomechanics — Analyze gait patterns by treating each walking cycle as a functional observation
- Meteorology — Compare daily temperature profiles across weather stations as functional data
- Finance — Treat intraday trading curves as functional observations for pattern recognition
FDA lets you do statistics on curves, not just numbers.
Functional data analysis treats observations as functions defined on a continuous domain (typically an interval), rather than as vectors of discrete measurements. The paradigm shift from multivariate to functional statistics arises naturally in longitudinal studies, spectrometry, kinematics, and any setting where the sampling frequency is dense relative to the underlying signal.
Functional Data Representation
DfFunctional Data
Functional data consists of curves , where each is a smooth function belonging to a Hilbert space with inner product .
In practice, we observe discretized or noisy versions:
where is the true curve and are measurement errors. FDA begins by converting these discrete observations into smooth functional representations.
Basis Expansions
The most common approach represents each curve as a linear combination of known basis functions:
Basis Expansion
where are basis functions and are coefficient vectors.
B-Spline Basis
DfB-Spline Basis Functions
B-splines of order are piecewise polynomial functions of degree defined recursively on a knot sequence :
B-splines offer local support: changing one coefficient affects only nearby curve segments. The roughness penalty on the second derivative enforces smoothness:
Roughness Penalty
where . The smoothing parameter controls the bias–variance tradeoff.
Fourier Basis
DfFourier Basis
For periodic functions on :
The Fourier basis is optimal for capturing oscillatory behavior and provides natural frequency-domain interpretation. The roughness penalty is proportional to .
Basis Selection
- B-splines: Non-periodic, irregularly sampled, local features (growth curves, gait data)
- Fourier: Periodic signals ( circadian rhythms, cardiac cycles)
- Wavelets: Non-stationary signals with abrupt changes (EEG, financial time series)
- Wavelet packets: Adaptive selection of time-frequency resolution
The number of basis functions (or equivalently ) is selected via GCV, AIC, or BIC.
Functional PCA
DfFunctional PCA
FPCA decomposes the variability of functional data into orthogonal modes:
where is the mean function, are orthonormal eigenfunctions, and are functional principal component scores.
The eigenfunctions satisfy the Fredholm integral equation:
Eigenvalue Problem
where is the sample covariance function.
ThKarhunen-Loève Representation
For a second-order process with covariance :
where are uncorrelated with and . The truncation to terms captures proportion of total variance.
Functional Linear Models
Scalar-on-Function Regression
Scalar-on-Function Regression
The coefficient function is estimated by penalized splines:
where is the design matrix.
Function-on-Function Regression
Function-on-Function Regression
where is a bivariate coefficient surface. Truncated basis expansion converts this to a multivariate regression problem.
Depth for Functional Data
DfFunctional Depth
The modified band depth (MBD) of a curve among measures the proportion of intervals that contain :
Functional depth generalizes the concept of median and quantiles to function spaces, enabling robust descriptive statistics, boxplots for functional data, and outlier detection.
Registration / Warping
DfCurve Registration
Registration (or alignment) removes phase variation by mapping each observed curve to a standardized time domain via a warping function :
where is a strictly increasing diffeomorphism.
The registration problem decomposes the observed variation into amplitude variation (vertical differences after alignment) and phase variation (horizontal shifts in features):
Registration Methods
- Landmark registration: Aligns pre-specified features (peaks, valleys) using monotone interpolation.
- Continuous registration (PROCRUSTES): Minimizes jointly over and the target.
- Elastic registration: Uses the square-root velocity function (SRVF) framework, where , and alignment is performed in the Hilbert space of SRVFs via geodesic matching.
Python Implementation
import numpy as np
import pandas as pd
from scipy.interpolate import BSpline
from scipy.optimize import minimize_scalar
import matplotlib.pyplot as plt
np.random.seed(42)
# --- Generate functional data ---
t = np.linspace(0, 1, 200)
n = 30
# True signal + phase variation
def true_signal(t):
return np.sin(2 * np.pi * t) + 0.5 * np.cos(4 * np.pi * t)
# Generate curves with random warping and amplitude perturbation
x = np.zeros((n, len(t)))
for i in range(n):
phase_shift = np.random.normal(0, 0.05)
amp = 1 + np.random.normal(0, 0.1)
freq = 1 + np.random.normal(0, 0.05)
x[i, :] = amp * np.sin(2 * np.pi * freq * (t + phase_shift)) + \
0.5 * np.cos(4 * np.pi * freq * (t + phase_shift)) + \
np.random.normal(0, 0.05, len(t))
# --- B-spline basis expansion ---
n_basis = 25
knots = np.linspace(0, 1, n_basis - 4 + 2)
knots = np.concatenate([[0, 0, 0], knots[1:-1], [1, 1, 1]])
from scipy.interpolate import BSpline
B = np.zeros((len(t), n_basis))
for k in range(n_basis):
coeffs = np.zeros(n_basis)
coeffs[k] = 1.0
spl = BSpline(knots, coeffs, 3)
B[:, k] = spl(t)
# Smooth each curve
lambdas = np.logspace(-4, 0, 20)
x_smooth = np.zeros_like(x)
for i in range(n):
best_gcv = np.inf
best_lam = lambdas[0]
for lam in lambdas:
R = np.zeros((n_basis, n_basis))
for k in range(n_basis - 2):
R[k:k+3, k:k+3] += lam * np.array([[1, -2, 1], [-2, 4, -2], [1, -2, 1]])
c = np.linalg.solve(B.T @ B + R, B.T @ x[i, :])
residual = x[i, :] - B @ c
gcv = np.mean(residual**2) / (1 - n_basis / len(t))**2
if gcv < best_gcv:
best_gcv = gcv
best_lam = lam
R = np.zeros((n_basis, n_basis))
for k in range(n_basis - 2):
R[k:k+3, k:k+3] += best_lam * np.array([[1, -2, 1], [-2, 4, -2], [1, -2, 1]])
c = np.linalg.solve(B.T @ B + R, B.T @ x[i, :])
x_smooth[i, :] = B @ c
print(f"Smoothed {n} curves with {n_basis} B-splines")
# --- Functional PCA ---
mean_func = x_smooth.mean(axis=0)
centered = x_smooth - mean_func
# Covariance matrix in basis coordinates
c_basis = np.linalg.lstsq(B, centered.T, rcond=None)[0].T
cov_matrix = np.cov(c_basis.T)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Reverse for descending order
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Eigenfunctions
psi = np.zeros((len(t), min(5, n_basis)))
for k in range(min(5, n_basis)):
psi[:, k] = B @ eigenvectors[:, k]
psi[:, k] /= np.sqrt(np.trapz(psi[:, k] ** 2, t))
var_explained = eigenvalues / eigenvalues.sum()
print(f"\nFPCA variance explained (first 5 PCs): {var_explained[:5].round(4)}")
print(f"Cumulative: {np.cumsum(var_explained)[:5].round(4)}")
# --- Functional Depth (simplified MBD) ---
def modified_band_depth(X, t_grid):
n, m = X.shape
mbd = np.zeros(n)
for i in range(n):
count = 0
total = 0
for j in range(n):
if j == i:
continue
for k in range(j + 1, n):
if k == i:
continue
lower = np.minimum(X[j], X[k])
upper = np.maximum(X[j], X[k])
count += np.sum((X[i] >= lower) & (X[i] <= upper))
total += m
# Approximate proportion for speed
mbd[i] = count / total if total > 0 else 0
return mbd
# Use subset for speed
mbd = modified_band_depth(x_smooth[:10, ::5], t[::5])
print(f"\nFunctional depths (first 10): {mbd.round(4)}")
# --- Visualization ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Raw curves
for i in range(min(10, n)):
axes[0, 0].plot(t, x[i, :], alpha=0.3, lw=0.5)
axes[0, 0].set_title("Raw Curves")
axes[0, 0].set_xlabel("t")
axes[0, 0].set_ylabel("x(t)")
# Smoothed curves
for i in range(min(10, n)):
axes[0, 1].plot(t, x_smooth[i, :], alpha=0.5, lw=0.8)
axes[0, 1].set_title("Smoothed (B-splines)")
axes[0, 1].set_xlabel("t")
# Eigenfunctions
colors = ['red', 'blue', 'green', 'orange', 'purple']
for k in range(min(3, psi.shape[1])):
axes[1, 0].plot(t, psi[:, k], color=colors[k], lw=1.5,
label=f"PC{k+1} ({var_explained[k]*100:.1f}%)")
axes[1, 0].legend(fontsize=9)
axes[1, 0].set_title("FPCA Eigenfunctions")
axes[1, 0].axhline(y=0, color='gray', ls='--', lw=0.5)
# Scree plot
axes[1, 1].bar(range(1, 8), var_explained[:7] * 100, color='steelblue')
axes[1, 1].set_xlabel("Component")
axes[1, 1].set_ylabel("% Variance")
axes[1, 1].set_title("Scree Plot")
plt.tight_layout()
plt.savefig("fda_analysis.png", dpi=150, bbox_inches="tight")
plt.show()
Summary
Key Takeaways: FDA
- FDA treats observations as smooth functions in ; basis expansions (B-splines, Fourier) convert discretized data into functional representations with roughness penalties controlling smoothness.
- FPCA decomposes functional variability via the Fredholm integral eigenvalue problem; the Karhunen-Loève representation provides an optimal low-dimensional approximation.
- Functional linear models (scalar-on-function, function-on-function) extend classical regression; penalized spline estimation handles the ill-posed inverse problem.
- Functional depth (MBD) generalizes median and quantiles to function spaces, enabling robust inference and outlier detection.
- Registration removes phase variation (horizontal shifts) from amplitude variation (vertical differences), which is critical for meaningful functional comparisons.