πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Convex Optimization

OptimizationConvex🟒 Free Lesson

Advertisement

Convex Optimization

Why It Matters

Convex optimization is the backbone of modern machine learning. Every local minimum of a convex function is a global minimum, making these problems tractable and reliable. SVMs, linear regression, logistic regression, LASSO, ridge regression, and many neural network subproblems are all convex. Understanding convexity is the single most important concept for anyone building or analyzing optimization-based ML systems.

Convex optimization bridges pure mathematics and practical ML engineering. While non-convex problems (like training deep neural networks) dominate headlines, convex problems form the foundation: they have efficient solvers with provable guarantees, they appear as subroutines in larger algorithms, and many heuristics for non-convex problems work by solving convex relaxations. Mastering this topic means understanding why some optimization problems are "easy" and others are "hard," and having the tools to transform hard problems into tractable ones.


Convex Set

DfConvex Set

A set CβŠ†RnC \subseteq \mathbb{R}^n is convex if for all x,y∈Cx, y \in C and all θ∈[0,1]\theta \in [0, 1]:

ΞΈx+(1βˆ’ΞΈ)y∈C\theta x + (1 - \theta) y \in C

Geometrically, this means the line segment connecting any two points in CC lies entirely within CC. There are no "dents," indentations, or holes.

Intuition

Imagine drawing a rubber band around the boundary of the set. If the rubber band touches every point on the boundary without cutting through the interior, the set is convex. A circle is convex; a crescent moon is not. The intersection of any collection of convex sets is convex, which is a powerful tool for constructing complex convex feasible regions.

ThExamples of Convex Sets

SetDescriptionConvex?
Rn\mathbb{R}^nEntire Euclidean spaceYes
Halfspace {x:aTx≀b}\{x : a^Tx \leq b\}Linear inequalityYes
Hyperplane {x:aTx=b}\{x : a^Tx = b\}Linear equalityYes
Ball {x:βˆ₯xβˆ’cβˆ₯2≀r}\{x : \|x - c\|_2 \leq r\}Euclidean ballYes
Nonnegative orthant R+n\mathbb{R}^n_+{x:xiβ‰₯0}\{x : x_i \geq 0\}Yes
Simplex Ξ”n\Delta_n{x:xiβ‰₯0,βˆ‘xi=1}\{x : x_i \geq 0, \sum x_i = 1\}Yes
Positive semidefinite cone S+n\mathcal{S}^n_+{X:Xβͺ°0}\{X : X \succeq 0\}Yes
Set with a bite taken out{x:βˆ₯xβˆ₯2≀1}βˆ–{x:βˆ₯xβˆ₯2≀0.5}\{x : \|x\|_2 \leq 1\} \setminus \{x : \|x\|_2 \leq 0.5\}No

Problem: Is the Intersection Convex?

Is the set C={x∈R2:x12+x22≀1,β€…β€Šx1+x2β‰₯1}C = \{x \in \mathbb{R}^2 : x_1^2 + x_2^2 \leq 1, \; x_1 + x_2 \geq 1\} convex?

Solution

Yes. The set CC is the intersection of a disk {x:x12+x22≀1}\{x : x_1^2 + x_2^2 \leq 1\} (convex) and a halfspace {x:x1+x2β‰₯1}\{x : x_1 + x_2 \geq 1\} (convex). The intersection of convex sets is convex. Geometrically, CC is a circular segment.


Convex Function

DfConvex Function

A function f:Rnβ†’Rf: \mathbb{R}^n \to \mathbb{R} is convex if its domain dom(f)\text{dom}(f) is a convex set and for all x,y∈dom(f)x, y \in \text{dom}(f) and θ∈[0,1]\theta \in [0, 1]:

f(ΞΈx+(1βˆ’ΞΈ)y)≀θf(x)+(1βˆ’ΞΈ)f(y)f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y)

This is the epigraph condition: the line segment connecting (x,f(x))(x, f(x)) and (y,f(y))(y, f(y)) lies on or above the graph of ff. A function is concave if βˆ’f-f is convex.

Intuition

A convex function curves "upward" everywhere. If you stand on the graph and look along the curve, the ground always rises in both directions. This means there are no "dips" that create local minima distinct from the global minimum. The epigraph {(x,t):f(x)≀t}\{(x, t) : f(x) \leq t\} is a convex set β€” this connects convex functions to convex sets.

First-Order Condition

ThFirst-Order Optimality Condition

A differentiable function f:Rnβ†’Rf: \mathbb{R}^n \to \mathbb{R} is convex if and only if for all x,y∈dom(f)x, y \in \text{dom}(f):

f(y)β‰₯f(x)+βˆ‡f(x)T(yβˆ’x)f(y) \geq f(x) + \nabla f(x)^T (y - x)

The tangent hyperplane at any point lies on or below the graph. This is the global linear lower bound characterization. Moreover, xβˆ—x^* is a global minimum of a convex function ff if and only if:

βˆ‡f(xβˆ—)=0\nabla f(x^*) = 0

Second-Order Condition

ThSecond-Order Condition

A twice-differentiable function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} is convex if and only if its Hessian is positive semidefinite everywhere:

βˆ‡2f(x)βͺ°0βˆ€x∈dom(f)\nabla^2 f(x) \succeq 0 \quad \forall x \in \text{dom}(f)

For strict convexity, we need βˆ‡2f(x)≻0\nabla^2 f(x) \succ 0 (positive definite) everywhere. For functions of one variable, this reduces to fβ€²β€²(x)β‰₯0f''(x) \geq 0.

Strict vs. Strong Convexity

Strict convexity (βˆ‡2f≻0\nabla^2 f \succ 0) guarantees a unique minimizer but says nothing about the "flatness" near the minimum. Strong convexity (defined below) adds a curvature lower bound and gives much stronger convergence guarantees.

Common Convex and Non-Convex Functions

FunctionConvex?Notes
f(x)=cf(x) = cYesConstant functions are convex (and concave)
f(x)=x2f(x) = x^2YesStrictly convex
f(x)=exf(x) = e^xYesStrictly convex, fβ€²β€²(x)=ex>0f''(x) = e^x > 0
f(x)=βˆ’log⁑(x)f(x) = -\log(x)YesStrictly convex on (0,∞)(0, \infty)
f(x)=βˆ₯xβˆ₯f(x) = \|x\|YesConvex but not differentiable at 0
f(x)=x4f(x) = x^4YesStrictly convex, though fβ€²β€²(0)=0f''(0) = 0
f(x)=x3f(x) = x^3NoInflection point at x=0x = 0
f(x)=sin⁑(x)f(x) = \sin(x)NoConcave and convex regions
f(x)=1/xf(x) = 1/xNoOn R\mathbb{R}; convex on (0,∞)(0, \infty)
f(X)=log⁑det⁑(X)f(X) = \log\det(X)YesOn S++n\mathcal{S}^n_{++} (positive definite cone)
f(x)=βˆ₯xβˆ₯22+βˆ₯xβˆ₯1f(x) = \|x\|_2^2 + \|x\|_1YesSum of convex functions
f(x)=x1x2f(x) = x_1 x_2NoBilinear (saddle-shaped)

Properties of Convex Functions

ThFundamental Properties of Convex Functions

Let f,gf, g be convex functions and Ξ±β‰₯0\alpha \geq 0.

1. Non-negative weighted sums are convex:

h(x)=Ξ±f(x)+Ξ²g(x)Β isΒ convexΒ forΒ Ξ±,Ξ²β‰₯0h(x) = \alpha f(x) + \beta g(x) \text{ is convex for } \alpha, \beta \geq 0

2. Pointwise maximum preserves convexity:

h(x)=max⁑(f(x),g(x)) is convexh(x) = \max(f(x), g(x)) \text{ is convex}

3. Affine composition preserves convexity:

h(x)=f(Ax+b)Β isΒ convexh(x) = f(Ax + b) \text{ is convex}

4. Perspective scaling preserves convexity:

h(x,t)=tβ‹…f(x/t)Β isΒ convexΒ onΒ {(x,t):t>0}h(x, t) = t \cdot f(x/t) \text{ is convex on } \{(x, t) : t > 0\}

5. Partial minimization preserves convexity: If g(x,y)g(x, y) is convex in (x,y)(x, y) and CC is a convex set, then h(x)=inf⁑y∈Cg(x,y)h(x) = \inf_{y \in C} g(x, y) is convex.

6. Supremum of convex functions is convex:

f(x)=sup⁑α∈Afα(x) is convex if each fα is convexf(x) = \sup_{\alpha \in \mathcal{A}} f_\alpha(x) \text{ is convex if each } f_\alpha \text{ is convex}

7. Restriction to a line preserves convexity: ff is convex if and only if g(t)=f(x+tv)g(t) = f(x + tv) is convex in tt for all x∈dom(f)x \in \text{dom}(f) and all directions vv.

Building Complex Convex Functions

These closure properties are extraordinarily powerful. You can build complex convex functions from simple building blocks: norms, quadratics, logs, and exponentials. For example, f(x)=log⁑(ex1+ex2)f(x) = \log(e^{x_1} + e^{x_2}) is convex because it is the composition of the convex function log⁑(β‹…)\log(\cdot) applied to the convex sum of exponentials (log-sum-exp). Most convex functions encountered in ML are constructed this way.


Strongly Convex Functions

DfStrongly Convex Function

A function f:Rnβ†’Rf: \mathbb{R}^n \to \mathbb{R} is mm-strongly convex (m>0m > 0) if for all x,y∈dom(f)x, y \in \text{dom}(f) and θ∈[0,1]\theta \in [0, 1]:

f(ΞΈx+(1βˆ’ΞΈ)y)≀θf(x)+(1βˆ’ΞΈ)f(y)βˆ’m2ΞΈ(1βˆ’ΞΈ)βˆ₯xβˆ’yβˆ₯2f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y) - \frac{m}{2}\theta(1-\theta)\|x - y\|^2

Equivalently, a differentiable ff is mm-strongly convex if and only if:

f(y)β‰₯f(x)+βˆ‡f(x)T(yβˆ’x)+m2βˆ₯yβˆ’xβˆ₯2βˆ€x,yf(y) \geq f(x) + \nabla f(x)^T(y - x) + \frac{m}{2}\|y - x\|^2 \quad \forall x, y

Or in terms of the Hessian: βˆ‡2f(x)βͺ°mI\nabla^2 f(x) \succeq mI for all xx.

ThStrong Convexity Gives Linear Convergence

If ff is mm-strongly convex and LL-smooth (gradient is LL-Lipschitz), then gradient descent with step size Ξ±=1/L\alpha = 1/L satisfies:

f(xk)βˆ’fβˆ—β‰€(Lβˆ’mL+m)2k(f(x0)βˆ’fβˆ—)f(x_k) - f^* \leq \left(\frac{L - m}{L + m}\right)^{2k} (f(x_0) - f^*)

The convergence rate ρ=Lβˆ’mL+m<1\rho = \frac{L-m}{L+m} < 1 is linear. The condition number ΞΊ=L/m\kappa = L/m controls the rate: smaller ΞΊ\kappa means faster convergence.

Problem: Strong Convexity of Quadratic

Show that f(x)=12xTAx+bTx+cf(x) = \frac{1}{2}x^TAx + b^Tx + c with A≻0A \succ 0 is mm-strongly convex where m=Ξ»min⁑(A)m = \lambda_{\min}(A).

Solution

The Hessian is βˆ‡2f(x)=A\nabla^2 f(x) = A. Since AA is positive definite with minimum eigenvalue Ξ»min⁑(A)>0\lambda_{\min}(A) > 0, we have Aβͺ°Ξ»min⁑(A)IA \succeq \lambda_{\min}(A) I. Thus ff is Ξ»min⁑(A)\lambda_{\min}(A)-strongly convex. This gives linear convergence rate ρ=Ξ»max⁑(A)βˆ’Ξ»min⁑(A)Ξ»max⁑(A)+Ξ»min⁑(A)\rho = \frac{\lambda_{\max}(A) - \lambda_{\min}(A)}{\lambda_{\max}(A) + \lambda_{\min}(A)}.


Convex Optimization Problem

DfConvex Optimization Problem

A convex optimization problem has the form:

min⁑xf(x)s.t.gi(x)≀0β€…β€Š(i=1,…,m),hj(x)=0β€…β€Š(j=1,…,p)\min_x f(x) \quad \text{s.t.} \quad g_i(x) \leq 0 \; (i = 1, \ldots, m), \quad h_j(x) = 0 \; (j = 1, \ldots, p)

where ff and each gig_i are convex functions, and each hjh_j is affine (linear). The feasible set is convex (intersection of convex sets and hyperplanes), and the objective is convex over this set.

Why Affine Equality Constraints Only

Equality constraints must be affine because a nonlinear equality h(x)=0h(x) = 0 defines a non-convex feasible set (it's a lower-dimensional manifold with no "interior"). For example, x2+y2=1x^2 + y^2 = 1 (a circle) is not a convex set β€” you can find two points on the circle whose midpoint is not on the circle.

ThKey Properties of Convex Optimization

1. Local = Global: Every local minimum of a convex problem is a global minimum.

2. Uniqueness: If ff is strictly convex, the minimizer xβˆ—x^* is unique.

3. Optimality conditions: xβˆ—x^* is optimal if and only if:

βˆ‡f(xβˆ—)T(yβˆ’xβˆ—)β‰₯0βˆ€y∈F\nabla f(x^*)^T (y - x^*) \geq 0 \quad \forall y \in \mathcal{F}

where F\mathcal{F} is the feasible set. For unconstrained problems, this reduces to βˆ‡f(xβˆ—)=0\nabla f(x^*) = 0.

4. Computational tractability: Convex problems can be solved in polynomial time (in theory and practice) using interior-point methods.

5. Duality: The dual of a convex problem provides lower bounds, optimality certificates, and insight into constraint sensitivity.


Common Convex Problems

Linear Programming (LP)

DfLinear Program

A linear program has the form:

min⁑xcTxs.t.Ax≀b,Aeqx=beq\min_x c^Tx \quad \text{s.t.} \quad Ax \leq b, \quad A_{eq}x = b_{eq}

The objective and constraints are all linear. LP is the simplest class of convex optimization and can be solved in polynomial time using interior-point methods (or the simplex method in practice).

LP Applications

LP appears in resource allocation, scheduling, network flow, production planning, and as a relaxation for combinatorial problems. The dual of an LP is also an LP (LP duality is exact).

Quadratic Programming (QP)

DfQuadratic Program

A quadratic program has the form:

min⁑x12xTQx+cTxs.t.Ax≀b\min_x \frac{1}{2}x^TQx + c^Tx \quad \text{s.t.} \quad Ax \leq b

where Qβͺ°0Q \succeq 0 (positive semidefinite) ensures convexity. If Q≻0Q \succ 0 (positive definite), the problem is strictly convex.

QP Applications

Ridge regression (min⁑βˆ₯yβˆ’Xwβˆ₯2+Ξ»βˆ₯wβˆ₯2\min \|y - Xw\|^2 + \lambda\|w\|^2) is an unconstrained QP. SVMs with linear kernel are QPs. Portfolio optimization (Markowitz) is a QP. Model predictive control (MPC) solves a QP at each time step.

Second-Order Cone Programming (SOCP)

DfSecond-Order Cone Program

An SOCP has the form:

min⁑xcTxs.t.βˆ₯Aix+biβˆ₯2≀ciTx+di(i=1,…,m)\min_x c^Tx \quad \text{s.t.} \quad \|A_ix + b_i\|_2 \leq c_i^Tx + d_i \quad (i = 1, \ldots, m)

SOCPs generalize LPs and QPs. The constraint βˆ₯Ax+bβˆ₯2≀cTx+d\|Ax + b\|_2 \leq c^Tx + d is a second-order (Lorentz) cone constraint.

SOCP Applications

SOCPs arise in robust optimization (worst-case constraints), portfolio optimization with transaction costs, antenna design, and signal processing. Many problems that appear nonlinear can be reformulated as SOCPs.

Semidefinite Programming (SDP)

DfSemidefinite Program

An SDP has the form:

min⁑Xtr(CX)s.t.tr(AiX)=biβ€…β€Š(i=1,…,m),Xβͺ°0\min_X \text{tr}(CX) \quad \text{s.t.} \quad \text{tr}(A_iX) = b_i \; (i = 1, \ldots, m), \quad X \succeq 0

The variable XX is a symmetric positive semidefinite matrix. SDPs generalize LPs (when XX is diagonal) and are among the most powerful convex optimization formulations.

SDP Applications

SDPs appear in relaxation of combinatorial problems (MAX-CUT, graph coloring), control theory (LMI conditions), quantum information, and polynomial optimization. The Goemans-Williamson algorithm for MAX-CUT uses an SDP relaxation.

Problem Hierarchy

ThConvex Optimization Hierarchy

Each problem class is a special case of the one below it:

LPβŠ‚QPβŠ‚SOCPβŠ‚SDPβŠ‚Conic\text{LP} \subset \text{QP} \subset \text{SOCP} \subset \text{SDP} \subset \text{Conic}

Solvers like MOSEK and ECOS handle SOCPs and SDPs natively. For LPs, solvers like Gurobi and CLPK are extremely fast. CVXPY automatically transforms problems into the appropriate form.


Duality

DfLagrangian Dual

Given the convex optimization problem:

min⁑xf(x)s.t.gi(x)≀0β€…β€Š(i=1,…,m),hj(x)=0β€…β€Š(j=1,…,p)\min_x f(x) \quad \text{s.t.} \quad g_i(x) \leq 0 \; (i=1,\ldots,m), \quad h_j(x) = 0 \; (j=1,\ldots,p)

the Lagrangian is:

L(x,Ξ»,Ξ½)=f(x)+βˆ‘i=1mΞ»igi(x)+βˆ‘j=1pΞ½jhj(x)\mathcal{L}(x, \lambda, \nu) = f(x) + \sum_{i=1}^{m} \lambda_i g_i(x) + \sum_{j=1}^{p} \nu_j h_j(x)

The Lagrangian dual function is:

g(λ,ν)=inf⁑x∈dom(f)L(x,λ,ν)g(\lambda, \nu) = \inf_{x \in \text{dom}(f)} \mathcal{L}(x, \lambda, \nu)

The dual problem is:

max⁑λβ‰₯0, νg(Ξ»,Ξ½)\max_{\lambda \geq 0, \, \nu} g(\lambda, \nu)

Intuition

The dual function g(Ξ»,Ξ½)g(\lambda, \nu) provides a lower bound on the optimal value pβˆ—p^* of the primal problem for any Ξ»β‰₯0,Ξ½\lambda \geq 0, \nu. By maximizing this lower bound, we find the tightest possible bound β€” this is the dual problem. The dual is always convex (even if the primal is not), because it is the pointwise infimum of affine functions in (Ξ»,Ξ½)(\lambda, \nu).

Weak and Strong Duality

ThWeak and Strong Duality

Let pβˆ—p^* be the optimal value of the primal and dβˆ—d^* be the optimal value of the dual.

Weak Duality (always holds):

dβˆ—β‰€pβˆ—d^* \leq p^*

The dual optimal value is always a lower bound on the primal optimal value. The gap pβˆ—βˆ’dβˆ—p^* - d^* is called the duality gap.

Strong Duality (for convex problems under constraint qualifications):

dβˆ—=pβˆ—d^* = p^*

Strong duality holds for convex problems when Slater's condition is satisfied: there exists a strictly feasible point xx with gi(x)<0g_i(x) < 0 for all ii and hj(x)=0h_j(x) = 0 for all jj.

KKT Conditions and Strong Duality

When strong duality holds, the KKT conditions are both necessary and sufficient for optimality. The dual variables Ξ»i,Ξ½j\lambda_i, \nu_j at the optimum are the Lagrange multipliers. Complementary slackness (Ξ»igi(xβˆ—)=0\lambda_i g_i(x^*) = 0) ensures that only active constraints have nonzero multipliers. The dual variables quantify how much the optimal value changes per unit relaxation of each constraint β€” they are shadow prices.

Dual Interpretation in ML

Dual Variables in Machine Learning

In SVMs, the dual variables Ξ±i\alpha_i are nonzero only for support vectors. In regularization, the dual variable Ξ»\lambda represents the trade-off between fit and complexity. In fairness-constrained ML, dual variables measure the "cost of fairness" β€” how much accuracy is sacrificed per unit of fairness enforcement. Understanding duality gives you insight into which constraints are binding and which data points matter.


Python Implementation

Using CVXPY

import cvxpy as cp
import numpy as np

# ============================================
# Example 1: Linear Program
# Minimize cost subject to constraints
# ============================================

n = 4  # number of variables
c = np.array([1.0, 2.0, 3.0, 4.0])  # cost vector
A = np.array([[1, 1, 0, 0],
              [0, 0, 1, 1],
              [1, 0, 1, 0]])
b = np.array([10.0, 8.0, 12.0])

x = cp.Variable(n)
constraints = [A @ x <= b, x >= 0]
objective = cp.Minimize(c @ x)
prob = cp.Problem(objective, constraints)
prob.solve()

print(f"Status: {prob.status}")
print(f"Optimal value: {prob.value:.4f}")
print(f"Optimal x: {x.value}")

# ============================================
# Example 2: Quadratic Program (Ridge Regression)
# ============================================

np.random.seed(42)
m, n = 50, 10
X = np.random.randn(m, n)
y = X @ np.random.randn(n) + 0.1 * np.random.randn(m)

w = cp.Variable(n)
lam = 0.5  # regularization parameter
objective = cp.Minimize(0.5 * cp.sum_squares(y - X @ w) + lam * cp.squares(w))
prob = cp.Problem(objective)
prob.solve()

print(f"\nRidge regression:")
print(f"Optimal w: {w.value[:3]}...")
print(f"Optimal loss: {prob.value:.4f}")

# ============================================
# Example 3: Second-Order Cone Program
# ============================================

x_socp = cp.Variable(3)
A_socp = np.array([[1, 0, 0], [0, 1, 0]])
b_socp = np.array([0.5, 0.5])
c_socp = np.array([1.0, 2.0, 3.0])

objective = cp.Minimize(c_socp @ x_socp)
constraints = [cp.norm(A_socp @ x_socp - b_socp, 2) <= 1.0]
prob = cp.Problem(objective, constraints)
prob.solve()

print(f"\nSOCP:")
print(f"Optimal value: {prob.value:.4f}")
print(f"Optimal x: {x_socp.value}")

# ============================================
# Example 4: Semidefinite Program
# ============================================

X_sdp = cp.Variable((3, 3), symmetric=True)
C = np.array([[1, 0, 0], [0, 2, 0], [0, 0, 3]])
A1 = np.array([[1, 0, 0], [0, 0, 0], [0, 0, 0]])

objective = cp.Minimize(cp.trace(C @ X_sdp))
constraints = [cp.trace(A1 @ X_sdp) == 1, X_sdp >> 0]  # X >> 0 means PSD
prob = cp.Problem(objective, constraints)
prob.solve()

print(f"\nSDP:")
print(f"Optimal value: {prob.value:.4f}")
print(f"X is PSD: {np.all(np.linalg.eigvals(X_sdp.value) > -1e-8)}")

Using SciPy

from scipy.optimize import minimize, LinearConstraint, NonlinearConstraint

# ============================================
# Example 1: Unconstrained convex function
# ============================================

f = lambda x: (x[0] - 1)**2 + (x[1] - 2)**2
grad_f = lambda x: np.array([2*(x[0] - 1), 2*(x[1] - 2)])

result = minimize(f, x0=[0.0, 0.0], jac=grad_f, method='L-BFGS-B')
print(f"Unconstrained minimum: x = {result.x}, f(x) = {result.fun:.6f}")

# ============================================
# Example 2: Linear constraints (box constraints)
# ============================================

f2 = lambda x: (x[0] - 2)**2 + (x[1] - 3)**2
bounds = [(0, 4), (0, 4)]

result = minimize(f2, x0=[0, 0], bounds=bounds, method='L-BFGS-B')
print(f"Box-constrained: x = {result.x}, f(x) = {result.fun:.6f}")

# ============================================
# Example 3: Equality and inequality constraints
# ============================================

f3 = lambda x: x[0]**2 + x[1]**2 + x[2]**2
jac_f3 = lambda x: np.array([2*x[0], 2*x[1], 2*x[2]])

constraints = [
    {'type': 'eq', 'fun': lambda x: x[0] + x[1] + x[2] - 1},
    {'type': 'ineq', 'fun': lambda x: x[0] - 0.1},  # x[0] >= 0.1
    {'type': 'ineq', 'fun': lambda x: x[1] - 0.1},  # x[1] >= 0.1
]

result = minimize(f3, x0=[1/3, 1/3, 1/3], jac=jac_f3, constraints=constraints, method='SLSQP')
print(f"Constrained: x = {result.x}, f(x) = {result.fun:.6f}")

Applications in AI/ML

Support Vector Machines

SVM as Convex Optimization

The SVM primal problem is:

min⁑w,b12βˆ₯wβˆ₯2s.t.yi(wTxi+b)β‰₯1β€…β€Šβˆ€i\min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1 \; \forall i

This is a convex QP. The dual is also a convex QP, and complementary slackness reveals that only support vectors (Ξ±i>0\alpha_i > 0) determine the decision boundary. The kernel trick works because the dual depends only on dot products xiTxjx_i^Tx_j.

Ridge Regression (Tikhonov Regularization)

Ridge Regression

wβˆ—=arg⁑min⁑wβˆ₯yβˆ’Xwβˆ₯22+Ξ»βˆ₯wβˆ₯22w^* = \arg\min_w \|y - Xw\|_2^2 + \lambda\|w\|_2^2

Here,

  • XX=Design matrix (n x p)
  • yy=Response vector (n x 1)
  • Ξ»\lambda=Regularization strength
  • ww=Parameter vector (p x 1)

The closed-form solution is wβˆ—=(XTX+Ξ»I)βˆ’1XTyw^* = (X^TX + \lambda I)^{-1}X^Ty. This is a strictly convex problem (quadratic with XTX+Ξ»I≻0X^TX + \lambda I \succ 0), guaranteeing a unique global minimum. Ridge regression is an unconstrained QP and is convex for any Ξ»>0\lambda > 0.

LASSO Regression

LASSO Regression

wβˆ—=arg⁑min⁑wβˆ₯yβˆ’Xwβˆ₯22+Ξ»βˆ₯wβˆ₯1w^* = \arg\min_w \|y - Xw\|_2^2 + \lambda\|w\|_1

Here,

  • βˆ₯wβˆ₯1\|w\|_1=L1 norm inducing sparsity
  • Ξ»\lambda=Regularization strength

LASSO is convex (the L1 norm is convex) but not differentiable at wi=0w_i = 0. It promotes sparsity β€” many coefficients are driven exactly to zero, performing automatic feature selection. The proximal gradient method (ISTA/FISTA) is the standard solver.

Logistic Regression

Logistic Regression as Convex Optimization

The negative log-likelihood for logistic regression is:

f(w)=βˆ’βˆ‘i=1n[yilog⁑(Οƒ(wTxi))+(1βˆ’yi)log⁑(1βˆ’Οƒ(wTxi))]f(w) = -\sum_{i=1}^{n} \left[ y_i \log(\sigma(w^Tx_i)) + (1-y_i)\log(1-\sigma(w^Tx_i)) \right]

This is convex in ww (though nonlinear), and is a fundamental building block of ML pipelines. It can be solved by gradient descent, Newton's method, or interior-point methods.

Neural Network Training (Non-Convex but Informed)

Convexity in Deep Learning

Training a neural network is non-convex β€” the loss landscape has many local minima and saddle points. However, convex optimization concepts still apply: (1) many local minima are nearly as good as the global minimum; (2) batch normalization and skip connections make the landscape "more convex"; (3) second-order methods use Hessian information; (4) convex relaxations provide bounds on the global minimum.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Assuming fβ€²β€²(x)>0f''(x) > 0 means convex everywherefβ€²β€²(x)β‰₯0f''(x) \geq 0 is necessary for convexity, but a function can be convex even if fβ€²β€²=0f'' = 0 at isolated points (e.g., f(x)=x4f(x) = x^4)Check βˆ‡2f(x)βͺ°0\nabla^2 f(x) \succeq 0 for all xx in the domain
Confusing convex and concaveMany people mix up "convex up" vs "convex down"Convex = bowl-shaped (holds water); concave = umbrella-shaped
Treating local minima as global minima in non-convex problemsLocal minima in non-convex problems may be far from global minimaUse convex relaxations, multi-start, or second-order methods
Forgetting that equality constraints must be affineh(x)=x2+y2βˆ’1=0h(x) = x^2 + y^2 - 1 = 0 does NOT define a convex feasible setNonlinear equalities create non-convex manifolds
Ignoring Slater's conditionStrong duality requires strict feasibilityVerify βˆƒx:gi(x)<0\exists x: g_i(x) < 0 and hj(x)=0h_j(x) = 0
Assuming all norms lead to convex problemsSome non-convex norms (like L0L_0) do notUse L1L_1, L2L_2, or other convex norms
Using gradient descent without checking Lipschitz continuityNon-Lipschitz gradients can cause divergenceVerify or bound the Lipschitz constant LL
Not checking if the Hessian is PSDUsing Newton's method on a non-convex function may find saddle pointsCheck eigenvalues of βˆ‡2f\nabla^2 f or use trust-region methods
Assuming the sum of convex functions is always strictly convexSum of convex functions is convex, but may not be strictly convexf(x)=x4+(βˆ’x4)=0f(x) = x^4 + (-x^4) = 0 is convex but not strictly convex
Forgetting perspective scalingf(x/t)β‹…tf(x/t) \cdot t is convex in (x,t)(x, t) but many miss thisUse perspective to handle fractional objectives

Interview Questions

Q1: What is the difference between convex and strictly convex?

Answer

A convex function satisfies f(ΞΈx+(1βˆ’ΞΈ)y)≀θf(x)+(1βˆ’ΞΈ)f(y)f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y). A strictly convex function satisfies the strict inequality f(ΞΈx+(1βˆ’ΞΈ)y)<ΞΈf(x)+(1βˆ’ΞΈ)f(y)f(\theta x + (1-\theta)y) < \theta f(x) + (1-\theta)f(y) for θ∈(0,1)\theta \in (0,1) and xβ‰ yx \neq y. Convexity guarantees local = global; strict convexity additionally guarantees the global minimum is unique. However, strict convexity does not imply strong convexity β€” f(x)=x4f(x) = x^4 is strictly convex but not strongly convex (the Hessian vanishes at 0).

Q2: Why are convex problems easier to solve than non-convex problems?

Answer

Convex problems have three key properties: (1) every local minimum is a global minimum, so first-order methods (gradient descent) can find the optimum without getting stuck in local minima; (2) the KKT conditions are necessary and sufficient for optimality, providing a verifiable optimality certificate; (3) polynomial-time algorithms exist (interior-point methods). Non-convex problems may have exponentially many local minima, no efficient way to certify global optimality, and can be NP-hard in the worst case.

Q3: What is Slater's condition and why does it matter?

Answer

Slater's condition requires the existence of a strictly feasible point: βˆƒx\exists x in the relative interior of dom(f)\text{dom}(f) such that gi(x)<0g_i(x) < 0 for all inequality constraints and hj(x)=0h_j(x) = 0 for all equality constraints. When Slater's condition holds for a convex problem, strong duality holds (pβˆ—=dβˆ—p^* = d^*), meaning the dual optimal value equals the primal optimal value and the duality gap is zero. This enables: (1) recovering the primal solution from the dual; (2) using dual methods to solve the primal; (3) obtaining sensitivity information from dual variables. For linear programs, strong duality always holds (no Slater's condition needed).

Q4: How does strong convexity improve convergence?

Answer

Strong convexity (βˆ‡2fβͺ°mI\nabla^2 f \succeq mI) ensures the gradient grows linearly away from the optimum: βˆ₯βˆ‡f(x)βˆ₯β‰₯mβˆ₯xβˆ’xβˆ—βˆ₯\|\nabla f(x)\| \geq m\|x - x^*\|. This gives linear convergence for gradient descent with rate ρ=Lβˆ’mL+m\rho = \frac{L-m}{L+m}, compared to sublinear O(1/k)O(1/k) for general convex functions. The condition number ΞΊ=L/m\kappa = L/m controls the rate: ΞΊ=1\kappa = 1 (perfectly conditioned) gives immediate convergence; large ΞΊ\kappa gives slow convergence. Preconditioning reduces ΞΊ\kappa by transforming the problem.

Q5: Explain the dual of an SVM and its significance.

Answer

The SVM primal maximizes the margin 2βˆ₯wβˆ₯\frac{2}{\|w\|} subject to yi(wTxi+b)β‰₯1y_i(w^Tx_i + b) \geq 1. Introducing dual variables Ξ±iβ‰₯0\alpha_i \geq 0 and applying KKT yields the dual: maxβ‘Ξ±βˆ‘Ξ±iβˆ’12βˆ‘i,jΞ±iΞ±jyiyjxiTxj\max_\alpha \sum \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_iy_j x_i^Tx_j subject to Ξ±iβ‰₯0\alpha_i \geq 0 and βˆ‘Ξ±iyi=0\sum \alpha_iy_i = 0. Significance: (1) it depends only on dot products, enabling the kernel trick; (2) complementary slackness (Ξ±i>0\alpha_i > 0 only for support vectors) identifies which training points matter; (3) the dual is a QP with nn variables, independent of the feature dimension β€” useful when nβ‰ͺdn \ll d.

Q6: What happens when strong duality does not hold?

Answer

When strong duality fails, there is a nonzero duality gap pβˆ—βˆ’dβˆ—>0p^* - d^* > 0. The dual provides a lower bound but not the exact value. This can happen when: (1) the problem is not convex (Slater's condition is sufficient but not necessary for convex problems); (2) the problem is convex but Slater's condition fails (no strictly feasible point exists); (3) the problem involves integer constraints (combinatorial optimization). In practice, the duality gap can be used as a stopping criterion for interior-point methods β€” terminate when pβˆ—βˆ’dβˆ—<Ο΅p^* - d^* < \epsilon.


Practice Problems

Problem 1: Verify Convexity

Convexity Check

Show that f(x,y)=ex+y+x2+y2f(x, y) = e^{x+y} + x^2 + y^2 is convex on R2\mathbb{R}^2.

Solution

Compute the Hessian:

βˆ‡2f=(ex+y+2ex+yex+yex+y+2)\nabla^2 f = \begin{pmatrix} e^{x+y} + 2 & e^{x+y} \\ e^{x+y} & e^{x+y} + 2 \end{pmatrix}

The eigenvalues of (abba)\begin{pmatrix} a & b \\ b & a \end{pmatrix} are a+ba + b and aβˆ’ba - b where a=ex+y+2a = e^{x+y} + 2 and b=ex+yb = e^{x+y}.

Eigenvalues: Ξ»1=2ex+y+2>0\lambda_1 = 2e^{x+y} + 2 > 0 and Ξ»2=2>0\lambda_2 = 2 > 0.

Since βˆ‡2f≻0\nabla^2 f \succ 0 everywhere, ff is strictly convex.


Problem 2: Duality Gap

Non-Convex Problem

Consider f(x)=x3f(x) = x^3 on the interval [βˆ’1,1][-1, 1]. What can you say about the duality gap if we form a Lagrangian dual?

Solution

f(x)=x3f(x) = x^3 is not convex (it has an inflection point at x=0x = 0). The unconstrained minimum on [βˆ’1,1][-1, 1] is at x=βˆ’1x = -1 with f(βˆ’1)=βˆ’1f(-1) = -1. However, since the problem is non-convex, we cannot guarantee that the dual provides a tight lower bound β€” there may be a duality gap. The KKT conditions may identify saddle points or local minima that are not global.


Problem 3: Ridge Regression Closed Form

Derive Ridge Regression Solution

Derive the closed-form solution for wβˆ—=arg⁑min⁑wβˆ₯yβˆ’Xwβˆ₯2+Ξ»βˆ₯wβˆ₯2w^* = \arg\min_w \|y - Xw\|^2 + \lambda\|w\|^2.

Solution

Expand the objective:

f(w)=yTyβˆ’2yTXw+wTXTXw+Ξ»wTwf(w) = y^Ty - 2y^TXw + w^TX^TXw + \lambda w^Tw

Take the gradient and set to zero:

βˆ‡f(w)=βˆ’2XTy+2XTXw+2Ξ»w=0\nabla f(w) = -2X^Ty + 2X^TXw + 2\lambda w = 0
XTXw+Ξ»w=XTyX^TXw + \lambda w = X^Ty
(XTX+Ξ»I)w=XTy(X^TX + \lambda I)w = X^Ty
wβˆ—=(XTX+Ξ»I)βˆ’1XTyw^* = (X^TX + \lambda I)^{-1}X^Ty

This exists for all Ξ»>0\lambda > 0 because XTX+Ξ»I≻0X^TX + \lambda I \succ 0. The problem is strictly convex, so this is the unique global minimum.


Problem 4: Strong Convexity Certificate

Strong Convexity of Logistic Loss

Show that the logistic loss β„“(z)=log⁑(1+eβˆ’z)\ell(z) = \log(1 + e^{-z}) is convex but not strongly convex.

Solution

First derivative: β„“β€²(z)=βˆ’eβˆ’z1+eβˆ’z=βˆ’Οƒ(βˆ’z)\ell'(z) = \frac{-e^{-z}}{1 + e^{-z}} = -\sigma(-z).

Second derivative: β„“β€²β€²(z)=Οƒ(z)(1βˆ’Οƒ(z))\ell''(z) = \sigma(z)(1 - \sigma(z)) where Οƒ(z)=11+eβˆ’z\sigma(z) = \frac{1}{1+e^{-z}}.

Since Οƒ(z)∈(0,1)\sigma(z) \in (0, 1), we have β„“β€²β€²(z)>0\ell''(z) > 0 for all zz, so the logistic loss is strictly convex.

However, β„“β€²β€²(z)β†’0\ell''(z) \to 0 as ∣zβˆ£β†’βˆž|z| \to \infty (since Οƒ(z)(1βˆ’Οƒ(z))β†’0\sigma(z)(1-\sigma(z)) \to 0). There is no m>0m > 0 such that β„“β€²β€²(z)β‰₯m\ell''(z) \geq m for all zz. Therefore the logistic loss is convex and strictly convex, but not strongly convex.


Quick Reference

Key Takeaways

  • Convex Set: A set CC is convex if every convex combination ΞΈx+(1βˆ’ΞΈ)y∈C\theta x + (1-\theta)y \in C for θ∈[0,1]\theta \in [0,1]. Intersections of convex sets are convex.
  • Convex Function: ff is convex if f(ΞΈx+(1βˆ’ΞΈ)y)≀θf(x)+(1βˆ’ΞΈ)f(y)f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y). First-order condition: tangent hyperplanes are global underestimators. Second-order: βˆ‡2f(x)βͺ°0\nabla^2 f(x) \succeq 0.
  • Strong Convexity: ff is mm-strongly convex if βˆ‡2f(x)βͺ°mI\nabla^2 f(x) \succeq mI. This gives linear convergence rate ρ=Lβˆ’mL+m\rho = \frac{L-m}{L+m} for gradient descent.
  • Convex Optimization Problem: Minimize convex ff subject to convex gi≀0g_i \leq 0 and affine hj=0h_j = 0. Every local minimum is a global minimum.
  • Problem Hierarchy: LP βŠ‚\subset QP βŠ‚\subset SOCP βŠ‚\subset SDP. Each class includes the previous one.
  • Duality: The Lagrangian dual provides a lower bound. Strong duality (dβˆ—=pβˆ—d^* = p^*) holds for convex problems under Slater's condition.
  • KKT Conditions: Stationarity, primal feasibility, dual feasibility (Ξ»β‰₯0\lambda \geq 0), and complementary slackness (Ξ»igi=0\lambda_i g_i = 0) are necessary and sufficient for optimality when strong duality holds.
  • Applications: SVMs (QP with kernel trick), ridge regression (closed-form QP), LASSO (convex but non-smooth), logistic regression (convex nonlinear).
  • Python: Use CVXPY for modeling and solving convex problems; use SciPy for smaller problems with manual constraint specification.
  • Key Insight: Convexity is the dividing line between "tractable" and "intractable" optimization. If your problem is convex, you can solve it efficiently and certify global optimality.

Cross-References

  • Constrained Optimization: KKT conditions, penalty methods, and interior-point methods -> Constrained Optimization
  • Gradient Descent: First-order methods for convex optimization -> Gradient Descent
  • Newton's Method: Second-order methods using Hessian curvature -> Newton's Method
  • Linear Algebra: Positive Definite Matrices: Hessian and PSD conditions for convexity -> Positive Definite
  • Linear Algebra: Norms: Norms used in regularization constraints -> Norms
  • Matrix Calculus: Derivatives of matrix-valued functions for SDP and SVM -> Matrix Calculus
  • Linear Programming: Specialized methods for LP problems -> Linear Programming
  • Quadratic Programming: Specialized methods for QP problems -> Quadratic Programming
  • Lagrange Multipliers: Foundation for duality and KKT -> Lagrange Multipliers
  • Multivariable Calculus: Gradients, Jacobians, and Hessians -> Multivariable Calculus
⭐

Premium Content

Convex Optimization

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Mathematics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement