🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Support Vector Machines — Complete Guide

ML FoundationsClassification🟢 Free Lesson

Advertisement

Supervised Learning

Finding the Optimal Boundary — Maximum Margin Classification

SVM finds the hyperplane that maximizes the margin between classes. It is theoretically elegant and powerful in high-dimensional spaces.

  • Maximum Margin — The widest possible gap between classes
  • Kernel Trick — Nonlinear classification without explicit transformation
  • Support Vectors — The critical points that define the decision boundary

"The art of discovery consists of seeing what everyone has seen and thinking what nobody has thought."

Support Vector Machines — Complete Guide

SVM finds the optimal hyperplane that maximizes the margin between classes. It is one of the most theoretically elegant ML algorithms.


Maximum Margin Classifier

DfSupport Vector Machine (SVM)

Given training data {(x(i),y(i))}i=1N\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^{N} with y(i){1,+1}y^{(i)} \in \{-1, +1\}, SVM finds the hyperplane wTx+b=0\mathbf{w}^T\mathbf{x} + b = 0 that maximizes the margin γ=2w\gamma = \frac{2}{\|\mathbf{w}\|}, subject to y(i)(wTx(i)+b)1y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)} + b) \geq 1 for all ii.

SVM: Maximum Margin Hyperplane and Support Vectorsw·x + b = 0w·x + b = −1w·x + b = +1margin = 2/‖w‖Class −1Class +1SVM OptimizationPrimal Problem:min ½‖w‖²s.t. y⁽ⁱ⁾(w·x⁽ⁱ⁾+b) ≥ 1Dual Problem (Lagrange):max Σαᵢ − ½ΣᵢΣⱼ αᵢαⱼy⁽ⁱ⁾y⁽ʲ⁾K(x⁽ⁱ⁾,x⁽ʲ⁾)s.t. 0 ≤ αᵢ ≤ C, Σαᵢy⁽ⁱ⁾ = 0Support Vectors:Points where αᵢ > 0 (margin boundary)Only SVs determine the decision boundary!

Soft Margin SVM

DfSoft Margin SVM (C-SVM)

For non-separable data, allow margin violations with slack variables ξi0\xi_i \geq 0:

minw,b,ξ12w2+Ci=1Nξi\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{N}\xi_i
s.t. y(i)(wTx(i)+b)1ξi,ξi0\text{s.t. } y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)} + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

The C Parameter

  • Large C: Less regularization, fewer margin violations (may overfit)
  • Small C: More regularization, more margin violations (smoother boundary)
  • C=C = \infty: Hard margin SVM (requires perfect separation)

The Kernel Trick

DfKernel Trick

The kernel trick allows SVM to learn nonlinear decision boundaries by implicitly mapping inputs into high-dimensional feature spaces without explicitly computing the transformation. The dual formulation only needs dot products K(xi,xj)=ϕ(xi)Tϕ(xj)K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j), which can be computed efficiently via kernel functions.

The Kernel Trick: Mapping to Higher DimensionsOriginal Space (2D)Not linearly separableφ(x)Higher Dimension (3D)Now linearly separable!Separating plane

Common Kernel Functions

  • Linear: K(x,z)=xTzK(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z} — no mapping, original space
  • Polynomial: K(x,z)=(xTz+c)dK(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^d — polynomial features
  • RBF (Gaussian): K(x,z)=exp(γxz2)K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2) — infinite-dimensional feature space
  • Sigmoid: K(x,z)=tanh(αxTz+c)K(\mathbf{x}, \mathbf{z}) = \tanh(\alpha\mathbf{x}^T\mathbf{z} + c) — neural network-like

RBF Kernel

K(x,z)=exp(xz22σ2)=exp(γxz2)K(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{\|\mathbf{x}-\mathbf{z}\|^2}{2\sigma^2}\right) = \exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2)

Here,

  • γ\gamma=Inverse bandwidth; large γ → complex boundary
  • σ\sigma=Bandwidth parameter

Kernel Decision Boundaries

SVM Kernel Comparison on Same DataLinearK(x,z) = x·zPolynomial (d=3)K(x,z) = (x·z + 1)³RBF (γ=1)K(x,z) = exp(-γ‖x-z‖²)Effect of CC=0.01Wide marginC=1.0BalancedC=100Narrow margin

Python Implementation

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Linear SVM
pipe_linear = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='linear', C=1.0))
])
pipe_linear.fit(X_train, y_train)
print(f"Linear: {pipe_linear.score(X_test, y_test):.3f}")

# RBF SVM (default)
pipe_rbf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=1.0, gamma='scale'))
])
pipe_rbf.fit(X_train, y_train)
print(f"RBF: {pipe_rbf.score(X_test, y_test):.3f}")

Always Scale Features

SVM is sensitive to feature magnitudes because it uses distances. Always standardize features (zero mean, unit variance) before training SVM.


Key Takeaways

Summary: SVM

  1. SVM finds the maximum margin hyperplane: max2w\max \frac{2}{\|\mathbf{w}\|} s.t. y(i)(wTx(i)+b)1y^{(i)}(\mathbf{w}^T\mathbf{x}^{(i)}+b) \geq 1
  2. Support vectors are the points on the margin boundary — only they determine the decision boundary
  3. Kernel trick enables nonlinear classification: K(xi,xj)=ϕ(xi)Tϕ(xj)K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)
  4. RBF kernel is the default — maps to infinite-dimensional space
  5. C parameter controls regularization: large C = less margin violations
  6. Always scale features — SVM is distance-based
  7. SVMs work well for high-dimensional data, especially when d>Nd > N
  8. Slow for large datasets — dual formulation is O(N2)O(N^2) to O(N3)O(N^3)

What to Learn Next

-> Logistic Regression Classification with probability — from linear to sigmoid.

-> Naive Bayes Bayes' theorem in action — fast, simple, surprisingly powerful.

-> Dimensionality Reduction Reduce features while preserving information with PCA and t-SNE.

Premium Content

Support Vector Machines — Complete Guide

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement