Supervised Learning

Finding the Optimal Boundary — Maximum Margin Classification

SVM finds the hyperplane that maximizes the margin between classes. It is theoretically elegant and powerful in high-dimensional spaces.

Maximum Margin — The widest possible gap between classes
Kernel Trick — Nonlinear classification without explicit transformation
Support Vectors — The critical points that define the decision boundary

"The art of discovery consists of seeing what everyone has seen and thinking what nobody has thought."

Support Vector Machines — Complete Guide

SVM finds the optimal hyperplane that maximizes the margin between classes. It is one of the most theoretically elegant ML algorithms.

Maximum Margin Classifier

DfSupport Vector Machine (SVM)

Given training data $\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^{N}$ with $y^{(i)} \in \{-1, +1\}$ , SVM finds the hyperplane $\mathbf{w}^T\mathbf{x} + b = 0$ that maximizes the margin $\gamma = \frac{2}{\|\mathbf{w}\|}$ , subject to $y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)} + b) \geq 1$ for all $i$ .

Soft Margin SVM

DfSoft Margin SVM (C-SVM)

For non-separable data, allow margin violations with slack variables $\xi_i \geq 0$ :

\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{N}\xi_i

\text{s.t. } y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)} + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

The C Parameter

Large C: Less regularization, fewer margin violations (may overfit)
Small C: More regularization, more margin violations (smoother boundary)
$C = \infty$ : Hard margin SVM (requires perfect separation)

The Kernel Trick

DfKernel Trick

The kernel trick allows SVM to learn nonlinear decision boundaries by implicitly mapping inputs into high-dimensional feature spaces without explicitly computing the transformation. The dual formulation only needs dot products $K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)$ , which can be computed efficiently via kernel functions.

Common Kernel Functions

Linear: $K(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z}$ — no mapping, original space
Polynomial: $K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^d$ — polynomial features
RBF (Gaussian): $K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2)$ — infinite-dimensional feature space
Sigmoid: $K(\mathbf{x}, \mathbf{z}) = \tanh(\alpha\mathbf{x}^T\mathbf{z} + c)$ — neural network-like

RBF Kernel

K(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{\|\mathbf{x}-\mathbf{z}\|^2}{2\sigma^2}\right) = \exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2)

Here,

$\gamma$ =Inverse bandwidth; large γ → complex boundary
$\sigma$ =Bandwidth parameter

Kernel Decision Boundaries

Python Implementation

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Linear SVM
pipe_linear = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='linear', C=1.0))
])
pipe_linear.fit(X_train, y_train)
print(f"Linear: {pipe_linear.score(X_test, y_test):.3f}")

# RBF SVM (default)
pipe_rbf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=1.0, gamma='scale'))
])
pipe_rbf.fit(X_train, y_train)
print(f"RBF: {pipe_rbf.score(X_test, y_test):.3f}")

Always Scale Features

SVM is sensitive to feature magnitudes because it uses distances. Always standardize features (zero mean, unit variance) before training SVM.

Key Takeaways

Summary: SVM

SVM finds the maximum margin hyperplane: $\max \frac{2}{\|\mathbf{w}\|}$ s.t. $y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)}+b) \geq 1$
Support vectors are the points on the margin boundary — only they determine the decision boundary
Kernel trick enables nonlinear classification: $K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)$
RBF kernel is the default — maps to infinite-dimensional space
C parameter controls regularization: large C = less margin violations
Always scale features — SVM is distance-based
SVMs work well for high-dimensional data, especially when $d > N$
Slow for large datasets — dual formulation is $O(N^2)$ to $O(N^3)$

What to Learn Next

-> Logistic Regression Classification with probability — from linear to sigmoid.

-> Naive Bayes Bayes' theorem in action — fast, simple, surprisingly powerful.

-> Dimensionality Reduction Reduce features while preserving information with PCA and t-SNE.

Support Vector Machines — Complete Guide

Finding the Optimal Boundary — Maximum Margin Classification

Support Vector Machines — Complete Guide

Maximum Margin Classifier

DfSupport Vector Machine (SVM)

Soft Margin SVM

DfSoft Margin SVM (C-SVM)

The Kernel Trick

DfKernel Trick

RBF Kernel

Kernel Decision Boundaries

Python Implementation

Key Takeaways

Summary: SVM

What to Learn Next

Premium Content

Need Expert Machine Learning Help?