Functional Data Analysis (FDA)

Advanced Statistical Methods

Treating Curves and Signals as Data Points

Functional data analysis treats entire curves, surfaces, or functions as individual observations. Basis expansions, functional PCA, and registration methods enable statistical analysis of inherently functional data.

Biomechanics — Analyze gait patterns by treating each walking cycle as a functional observation
Meteorology — Compare daily temperature profiles across weather stations as functional data
Finance — Treat intraday trading curves as functional observations for pattern recognition

FDA lets you do statistics on curves, not just numbers.

Functional data analysis treats observations as functions $x_i(t)$ defined on a continuous domain $\mathcal{T}$ (typically an interval), rather than as vectors of discrete measurements. The paradigm shift from multivariate to functional statistics arises naturally in longitudinal studies, spectrometry, kinematics, and any setting where the sampling frequency is dense relative to the underlying signal.

Functional Data Representation

DfFunctional Data

Functional data consists of $n$ curves $\{x_i(t): t \in \mathcal{T}\}_{i=1}^{n}$ , where each $x_i$ is a smooth function belonging to a Hilbert space $\mathcal{H} = L^2(\mathcal{T})$ with inner product $\langle x, y \rangle = \int_\mathcal{T} x(t)y(t)\,dt$ .

In practice, we observe discretized or noisy versions:

x_i(t_{ij}) = f_i(t_{ij}) + \varepsilon_{ij}, \quad j = 1, \dots, m_i

where $f_i$ is the true curve and $\varepsilon_{ij} \sim N(0, \sigma^2)$ are measurement errors. FDA begins by converting these discrete observations into smooth functional representations.

Basis Expansions

The most common approach represents each curve as a linear combination of known basis functions:

Basis Expansion

x_i(t) = \sum_{k=1}^{K} c_{ik}\,\phi_k(t) = \boldsymbol{\phi}(t)^\top \mathbf{c}_i

where $\{\phi_k\}$ are basis functions and $\mathbf{c}_i = (c_{i1}, \dots, c_{iK})^\top$ are coefficient vectors.

B-Spline Basis

DfB-Spline Basis Functions

B-splines of order $r$ are piecewise polynomial functions of degree $r - 1$ defined recursively on a knot sequence $\{t_1 \leq t_2 \leq \dots \leq t_{K+r}\}$ :

B_{i,1}(t) = \begin{cases} 1 & t_i \leq t < t_{i+1} \\ 0 & \text{otherwise} \end{cases}

B_{i,r}(t) = \frac{t - t_i}{t_{i+r-1} - t_i}B_{i,r-1}(t) + \frac{t_{i+r} - t}{t_{i+r} - t_{i+1}}B_{i+1,r-1}(t)

B-splines offer local support: changing one coefficient affects only nearby curve segments. The roughness penalty on the second derivative enforces smoothness:

Roughness Penalty

\text{Pen}(\mathbf{c}) = \lambda \int_\mathcal{T} \left[x''(t)\right]^2 dt = \lambda\,\mathbf{c}^\top \mathbf{R}\,\mathbf{c}

where $R_{jk} = \int \phi_j''(t)\phi_k''(t)\,dt$ . The smoothing parameter $\lambda$ controls the bias–variance tradeoff.

Fourier Basis

DfFourier Basis

For periodic functions on $[0, T]$ :

\phi_0(t) = 1, \quad \phi_{2k-1}(t) = \cos\left(\frac{2\pi k t}{T}\right), \quad \phi_{2k}(t) = \sin\left(\frac{2\pi k t}{T}\right)

The Fourier basis is optimal for capturing oscillatory behavior and provides natural frequency-domain interpretation. The roughness penalty is proportional to $\sum_k k^4 (c_{2k-1}^2 + c_{2k}^2)$ .

Basis Selection

B-splines: Non-periodic, irregularly sampled, local features (growth curves, gait data)
Fourier: Periodic signals ( circadian rhythms, cardiac cycles)
Wavelets: Non-stationary signals with abrupt changes (EEG, financial time series)
Wavelet packets: Adaptive selection of time-frequency resolution

The number of basis functions $K$ (or equivalently $\lambda$ ) is selected via GCV, AIC, or BIC.

Functional PCA

DfFunctional PCA

FPCA decomposes the variability of functional data into orthogonal modes:

x_i(t) = \bar{x}(t) + \sum_{k=1}^{\infty} \xi_{ik}\,\psi_k(t)

where $\bar{x}(t) = \frac{1}{n}\sum_i x_i(t)$ is the mean function, $\{\psi_k\}$ are orthonormal eigenfunctions, and $\xi_{ik} = \int (x_i(t) - \bar{x}(t))\psi_k(t)\,dt$ are functional principal component scores.

The eigenfunctions satisfy the Fredholm integral equation:

Eigenvalue Problem

\int_\mathcal{T} C(s, t)\,\psi_k(t)\,dt = \lambda_k\,\psi_k(s)

where $C(s,t) = \frac{1}{n-1}\sum_i (x_i(s) - \bar{x}(s))(x_i(t) - \bar{x}(t))$ is the sample covariance function.

ThKarhunen-Loève Representation

For a second-order process $X(t)$ with covariance $C(s,t) = \sum_{k=1}^{\infty} \lambda_k \psi_k(s)\psi_k(t)$ :

X(t) = \mu(t) + \sum_{k=1}^{\infty} \sqrt{\lambda_k}\,\xi_k\,\psi_k(t)

where $\xi_k$ are uncorrelated with $\mathbb{E}[\xi_k] = 0$ and $\text{Var}[\xi_k] = 1$ . The truncation to $K$ terms captures proportion $\sum_{k=1}^K \lambda_k / \sum_{k=1}^\infty \lambda_k$ of total variance.

Functional Linear Models

Scalar-on-Function Regression

Y_i = \alpha + \int_\mathcal{T} x_i(t)\,\beta(t)\,dt + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2)

The coefficient function $\beta(t)$ is estimated by penalized splines:

\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{R})^{-1}\mathbf{X}^\top\mathbf{Y}

where $\mathbf{X}_{ik} = \int x_i(t)\phi_k(t)\,dt$ is the design matrix.

Function-on-Function Regression

Y_i(t) = \alpha(t) + \int_\mathcal{T} x_i(s)\,\beta(s, t)\,ds + \varepsilon_i(t)

where $\beta(s,t)$ is a bivariate coefficient surface. Truncated basis expansion converts this to a multivariate regression problem.

Depth for Functional Data

DfFunctional Depth

The modified band depth (MBD) of a curve $x_i$ among $\{x_1, \dots, x_n\}$ measures the proportion of intervals $[x_j(t), x_k(t)]$ that contain $x_i(t)$ :

\text{MBD}_i = \binom{n}{2}^{-1}\sum_{j < k} \frac{1}{m}\sum_{\ell=1}^{m} \mathbf{1}\left[x_i(t_\ell) \in [\min(x_j(t_\ell), x_k(t_\ell)), \max(x_j(t_\ell), x_k(t_\ell))]\right]

Functional depth generalizes the concept of median and quantiles to function spaces, enabling robust descriptive statistics, boxplots for functional data, and outlier detection.

Registration / Warping

DfCurve Registration

Registration (or alignment) removes phase variation by mapping each observed curve $x_i(t)$ to a standardized time domain via a warping function $h_i$ :

x_i^*(t) = x_i(h_i(t))

where $h_i: \mathcal{T} \to \mathcal{T}$ is a strictly increasing diffeomorphism.

The registration problem decomposes the observed variation into amplitude variation (vertical differences after alignment) and phase variation (horizontal shifts in features):

Registration Methods

Landmark registration: Aligns pre-specified features (peaks, valleys) using monotone interpolation.
Continuous registration (PROCRUSTES): Minimizes $\int [x_i(t) - \bar{x}(h_i(t))]^2 dt$ jointly over $h_i$ and the target.
Elastic registration: Uses the square-root velocity function (SRVF) framework, where $q_i(t) = \dot{x}_i(t)/\sqrt{|x_i(t)|}$ , and alignment is performed in the Hilbert space of SRVFs via geodesic matching.

Python Implementation

import numpy as np
import pandas as pd
from scipy.interpolate import BSpline
from scipy.optimize import minimize_scalar
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Generate functional data ---
t = np.linspace(0, 1, 200)
n = 30

# True signal + phase variation
def true_signal(t):
    return np.sin(2 * np.pi * t) + 0.5 * np.cos(4 * np.pi * t)

# Generate curves with random warping and amplitude perturbation
x = np.zeros((n, len(t)))
for i in range(n):
    phase_shift = np.random.normal(0, 0.05)
    amp = 1 + np.random.normal(0, 0.1)
    freq = 1 + np.random.normal(0, 0.05)
    x[i, :] = amp * np.sin(2 * np.pi * freq * (t + phase_shift)) + \
              0.5 * np.cos(4 * np.pi * freq * (t + phase_shift)) + \
              np.random.normal(0, 0.05, len(t))

# --- B-spline basis expansion ---
n_basis = 25
knots = np.linspace(0, 1, n_basis - 4 + 2)
knots = np.concatenate([[0, 0, 0], knots[1:-1], [1, 1, 1]])

from scipy.interpolate import BSpline
B = np.zeros((len(t), n_basis))
for k in range(n_basis):
    coeffs = np.zeros(n_basis)
    coeffs[k] = 1.0
    spl = BSpline(knots, coeffs, 3)
    B[:, k] = spl(t)

# Smooth each curve
lambdas = np.logspace(-4, 0, 20)
x_smooth = np.zeros_like(x)
for i in range(n):
    best_gcv = np.inf
    best_lam = lambdas[0]
    for lam in lambdas:
        R = np.zeros((n_basis, n_basis))
        for k in range(n_basis - 2):
            R[k:k+3, k:k+3] += lam * np.array([[1, -2, 1], [-2, 4, -2], [1, -2, 1]])
        c = np.linalg.solve(B.T @ B + R, B.T @ x[i, :])
        residual = x[i, :] - B @ c
        gcv = np.mean(residual**2) / (1 - n_basis / len(t))**2
        if gcv < best_gcv:
            best_gcv = gcv
            best_lam = lam
    R = np.zeros((n_basis, n_basis))
    for k in range(n_basis - 2):
        R[k:k+3, k:k+3] += best_lam * np.array([[1, -2, 1], [-2, 4, -2], [1, -2, 1]])
    c = np.linalg.solve(B.T @ B + R, B.T @ x[i, :])
    x_smooth[i, :] = B @ c

print(f"Smoothed {n} curves with {n_basis} B-splines")

# --- Functional PCA ---
mean_func = x_smooth.mean(axis=0)
centered = x_smooth - mean_func

# Covariance matrix in basis coordinates
c_basis = np.linalg.lstsq(B, centered.T, rcond=None)[0].T
cov_matrix = np.cov(c_basis.T)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

# Reverse for descending order
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Eigenfunctions
psi = np.zeros((len(t), min(5, n_basis)))
for k in range(min(5, n_basis)):
    psi[:, k] = B @ eigenvectors[:, k]
    psi[:, k] /= np.sqrt(np.trapz(psi[:, k] ** 2, t))

var_explained = eigenvalues / eigenvalues.sum()
print(f"\nFPCA variance explained (first 5 PCs): {var_explained[:5].round(4)}")
print(f"Cumulative: {np.cumsum(var_explained)[:5].round(4)}")

# --- Functional Depth (simplified MBD) ---
def modified_band_depth(X, t_grid):
    n, m = X.shape
    mbd = np.zeros(n)
    for i in range(n):
        count = 0
        total = 0
        for j in range(n):
            if j == i:
                continue
            for k in range(j + 1, n):
                if k == i:
                    continue
                lower = np.minimum(X[j], X[k])
                upper = np.maximum(X[j], X[k])
                count += np.sum((X[i] >= lower) & (X[i] <= upper))
                total += m
                # Approximate proportion for speed
        mbd[i] = count / total if total > 0 else 0
    return mbd

# Use subset for speed
mbd = modified_band_depth(x_smooth[:10, ::5], t[::5])
print(f"\nFunctional depths (first 10): {mbd.round(4)}")

# --- Visualization ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Raw curves
for i in range(min(10, n)):
    axes[0, 0].plot(t, x[i, :], alpha=0.3, lw=0.5)
axes[0, 0].set_title("Raw Curves")
axes[0, 0].set_xlabel("t")
axes[0, 0].set_ylabel("x(t)")

# Smoothed curves
for i in range(min(10, n)):
    axes[0, 1].plot(t, x_smooth[i, :], alpha=0.5, lw=0.8)
axes[0, 1].set_title("Smoothed (B-splines)")
axes[0, 1].set_xlabel("t")

# Eigenfunctions
colors = ['red', 'blue', 'green', 'orange', 'purple']
for k in range(min(3, psi.shape[1])):
    axes[1, 0].plot(t, psi[:, k], color=colors[k], lw=1.5,
                    label=f"PC{k+1} ({var_explained[k]*100:.1f}%)")
axes[1, 0].legend(fontsize=9)
axes[1, 0].set_title("FPCA Eigenfunctions")
axes[1, 0].axhline(y=0, color='gray', ls='--', lw=0.5)

# Scree plot
axes[1, 1].bar(range(1, 8), var_explained[:7] * 100, color='steelblue')
axes[1, 1].set_xlabel("Component")
axes[1, 1].set_ylabel("% Variance")
axes[1, 1].set_title("Scree Plot")

plt.tight_layout()
plt.savefig("fda_analysis.png", dpi=150, bbox_inches="tight")
plt.show()

Summary

Key Takeaways: FDA

FDA treats observations as smooth functions in $L^2(\mathcal{T})$ ; basis expansions (B-splines, Fourier) convert discretized data into functional representations with roughness penalties controlling smoothness.
FPCA decomposes functional variability via the Fredholm integral eigenvalue problem; the Karhunen-Loève representation provides an optimal low-dimensional approximation.
Functional linear models (scalar-on-function, function-on-function) extend classical regression; penalized spline estimation handles the ill-posed inverse problem.
Functional depth (MBD) generalizes median and quantiles to function spaces, enabling robust inference and outlier detection.
Registration removes phase variation (horizontal shifts) from amplitude variation (vertical differences), which is critical for meaningful functional comparisons.

Functional Data Analysis (FDA)

Functional Data Analysis (FDA)

Treating Curves and Signals as Data Points

Functional Data Representation

DfFunctional Data

Basis Expansions

Basis Expansion

B-Spline Basis

DfB-Spline Basis Functions

Roughness Penalty

Fourier Basis

DfFourier Basis

Functional PCA

DfFunctional PCA

Eigenvalue Problem

ThKarhunen-Loève Representation

Functional Linear Models

Scalar-on-Function Regression

Scalar-on-Function Regression

Function-on-Function Regression

Function-on-Function Regression

Depth for Functional Data

DfFunctional Depth

Registration / Warping

DfCurve Registration

Python Implementation

Summary

Key Takeaways: FDA

Premium Content

Need Expert Statistics Help?