Calculus Optimization | ChatWhole Learn

Why It Matters

Optimization is the mathematical backbone of machine learning. Every time a neural network trains, every time a model tunes its hyperparameters, every time an algorithm minimizes a loss function — optimization is happening. Understanding calculus-based optimization gives you the theoretical foundation to not just use ML tools, but to understand why they work, when they fail, and how to improve them. From finding the minimum of a simple quadratic to training billion-parameter models, the core principles are the same: find where the derivative is zero, verify it is a minimum, and iterate.

The Big Picture

Optimization connects calculus to every quantitative discipline: physics (least action), economics (utility maximization), engineering (design optimization), and AI (loss minimization). Mastering this topic unlocks deep understanding across all these fields.

What is Optimization

DfOptimization

Optimization is the process of finding the input values that make a function achieve its maximum or minimum value. Formally, given a function $f: \mathbb{R}^n \to \mathbb{R}$ , we seek a point $\vec{x}^*$ such that $f(\vec{x}^*) \leq f(\vec{x})$ for all $\vec{x}$ in some domain (minimization) or $f(\vec{x}^*) \geq f(\vec{x})$ for all $\vec{x}$ (maximization).

Standard Optimization Problem

\min_{\vec{x} \in \mathcal{D}} f(\vec{x})

Here,

$f(\vec{x})$ =The objective function to minimize
$\vec{x}$ =The decision variable (input vector)
$\mathcal{D}$ =The feasible domain or constraint set
$\vec{x}^*$ =The optimal solution (minimizer)

Minimization vs Maximization

Maximizing $f(x)$ is equivalent to minimizing $-f(x)$ . In practice, almost all optimization literature and software frames problems as minimization. When you see a "maximization" problem in ML (like maximizing log-likelihood), it is converted to minimization by taking the negative.

Critical Points

DfCritical Points

A critical point of a function $f(x)$ is any point $c$ in the domain where either $f'(c) = 0$ or $f'(c)$ does not exist. Critical points are candidates for local maxima and minima — they are necessary but not sufficient conditions for optimality.

Critical Point Conditions

f'(c) = 0 \quad \text{or} \quad f'(c) \text{ is undefined}

Here,

$c$ =A point in the domain of f
$f'(c) = 0$ =Horizontal tangent (stationary point)
$f'(c) \text{ undefined}$ =Cusp, corner, or vertical tangent

Critical ≠ Optimal

Not every critical point is an extremum. The function $f(x) = x^3$ has $f'(0) = 0$ , but $x = 0$ is an inflection point, not a max or min. You must use the first or second derivative test to classify critical points.

Finding Critical Points

Problem: Find all critical points of $f(x) = x^3 - 6x^2 + 9x + 1$ .

Solution

f'(x) = 3x^2 - 12x + 9 = 3(x^2 - 4x + 3) = 3(x-1)(x-3)

Setting $f'(x) = 0$ : $x = 1$ and $x = 3$ .

$f(1) = 1 - 6 + 9 + 1 = 5$ and $f(3) = 27 - 54 + 27 + 1 = 1$ .

So the critical points are $(1, 5)$ and $(3, 1)$ .

First Derivative Test

ThFirst Derivative Test

Let $f$ be continuous at a critical point $c$ where $f'(c) = 0$ . If $f'$ changes sign at $c$ :

If $f'$ changes from negative to positive at $c$ , then $f$ has a local minimum at $c$ .
If $f'$ changes from positive to negative at $c$ , then $f$ has a local maximum at $c$ .
If $f'$ does not change sign at $c$ , then $c$ is neither a local maximum nor minimum (it is an inflection point).

Intuition

The first derivative test works because the sign of $f'$ tells you whether the function is increasing or decreasing. If $f$ is decreasing (negative derivative) before $c$ and increasing (positive derivative) after $c$ , the function must have "bottomed out" at $c$ — a local minimum.

First Derivative Test

Problem: Classify the critical points of $f(x) = x^4 - 4x^3 + 6x^2 - 4x$ .

Solution

f'(x) = 4x^3 - 12x^2 + 12x - 4 = 4(x-1)^3

Setting $f'(x) = 0$ : $x = 1$ is the only critical point.

Test signs: For $x < 1$ , $f'(x) = 4(x-1)^3 < 0$ . For $x > 1$ , $f'(x) = 4(x-1)^3 > 0$ .

Since $f'$ changes from negative to positive at $x = 1$ , there is a local minimum at $x = 1$ with $f(1) = 1 - 4 + 6 - 4 = -1$ .

Second Derivative Test

ThSecond Derivative Test

Let $f$ be twice differentiable at a critical point $c$ where $f'(c) = 0$ :

If $f''(c) > 0$ , then $f$ has a local minimum at $c$ .
If $f''(c) < 0$ , then $f$ has a local maximum at $c$ .
If $f''(c) = 0$ , the test is inconclusive (use the first derivative test or higher-order analysis).

Why It Works

If $f''(c) > 0$ , the function is concave up at $c$ , meaning it curves upward like a cup — the bottom of the cup is a minimum. If $f''(c) < 0$ , the function is concave down, curving downward like a dome — the top is a maximum.

Second Derivative Test Formula

f''(c) \begin{cases} > 0 \implies \text{local min} \\ < 0 \implies \text{local max} \\ = 0 \implies \text{inconclusive} \end{cases}

Here,

$f''(c)$ =Second derivative evaluated at critical point c
$f'(c) = 0$ =Prerequisite: c must be a stationary point

Second Derivative Test

Problem: Find and classify the extrema of $f(x) = 2x^3 - 9x^2 + 12x - 3$ .

Solution

f'(x) = 6x^2 - 18x + 12 = 6(x-1)(x-2) = 0 \implies x = 1, x = 2

f''(x) = 12x - 18

At $x = 1$ : $f''(1) = 12 - 18 = -6 < 0$ -> local maximum, $f(1) = 2$ .

At $x = 2$ : $f''(2) = 24 - 18 = 6 > 0$ -> local minimum, $f(2) = 1$ .

Global vs Local Extrema

ThGlobal vs Local Extrema

A local (relative) minimum at $c$ means $f(c) \leq f(x)$ for all $x$ in some open interval containing $c$ .
A global (absolute) minimum at $c$ means $f(c) \leq f(x)$ for all $x$ in the entire domain.
Every global extremum is a local extremum, but not every local extremum is global.

ThExtreme Value Theorem

If $f$ is continuous on a closed, bounded interval $[a, b]$ , then $f$ attains both a global maximum and a global minimum on $[a, b]$ . The global extrema occur either at critical points or at the endpoints $a$ and $b$ .

Procedure for Global Extrema on [a,b]

To find the absolute maximum and minimum of $f$ on $[a, b]$ :

Find all critical points of $f$ in $(a, b)$ .
Evaluate $f$ at each critical point.
Evaluate $f$ at the endpoints $a$ and $b$ .
The largest value is the global maximum; the smallest is the global minimum.

Unbounded Domains

On unbounded domains (e.g., $(-\infty, \infty)$ ), a continuous function may have no global extrema. For example, $f(x) = x^2$ has a global minimum at $x = 0$ , but $f(x) = x^3$ has neither a global max nor min. Always check whether the function grows without bound in both directions.

Global Extrema

Problem: Find the global extrema of $f(x) = x^3 - 3x + 2$ on $[-2, 2]$ .

Solution

f'(x) = 3x^2 - 3 = 3(x-1)(x+1) = 0 \implies x = -1, x = 1

Evaluate: $f(-2) = -8 + 6 + 2 = 0$ , $f(-1) = -1 + 3 + 2 = 4$ , $f(1) = 1 - 3 + 2 = 0$ , $f(2) = 8 - 6 + 2 = 4$ .

Global maximum: $4$ at $x = -1$ and $x = 2$ . Global minimum: $0$ at $x = -2$ and $x = 1$ .

Optimization Process

Step-by-Step Optimization Procedure

For any optimization problem, follow this systematic approach:

Step 1: Identify the objective function. Determine what you are maximizing or minimizing. Write it as a function of one or more variables.

Step 2: Define the domain. Identify constraints on the variables (physical, logical, or problem-imposed). Determine whether the domain is bounded or unbounded.

Step 3: Express as a single-variable problem (if possible). Use constraints to eliminate variables. For example, if $x + y = 10$ , substitute $y = 10 - x$ .

Step 4: Find critical points. Compute the derivative and solve $f'(x) = 0$ . Also check where $f'(x)$ is undefined.

Step 5: Classify critical points. Use the first derivative test (sign analysis) or the second derivative test.

Step 6: Check endpoints (if bounded). Evaluate the function at the boundaries of the domain.

Step 7: Compare all candidates. The largest value is the global maximum; the smallest is the global minimum.

Step 8: Verify and interpret. Check that the answer makes sense in the context of the problem.

Worked Examples

Example 1: Maximum Volume of a Box

Open Box from Cardboard

Problem: A square piece of cardboard of side 12 inches has squares cut from each corner. The sides are folded up to form an open box. Find the dimensions that maximize the volume.

Solution

Let $x$ = side length of cut squares. Then the box has dimensions $(12-2x) \times (12-2x) \times x$ .

V(x) = x(12-2x)^2 = x(144 - 48x + 4x^2) = 4x^3 - 48x^2 + 144x

Domain: $0 < x < 6$ (cut cannot exceed half the side).

V'(x) = 12x^2 - 96x + 144 = 12(x^2 - 8x + 12) = 12(x-2)(x-6)

$V'(x) = 0 \implies x = 2$ (since $x = 6$ is at the boundary).

$V''(x) = 24x - 96$ , so $V''(2) = 48 - 96 = -48 < 0$ -> local maximum.

Dimensions: $x = 2$ , base $= 8 \times 8$ , height $= 2$ . Maximum volume $= 32$ cubic inches.

Example 2: Shortest Distance

Shortest Distance from Point to Curve

Problem: Find the point on $y = x^2$ closest to $(0, 3)$ .

Solution

Distance squared: $D(x) = x^2 + (x^2 - 3)^2 = x^2 + x^4 - 6x^2 + 9 = x^4 - 5x^2 + 9$

Minimizing $D(x)$ is equivalent to minimizing distance (since $\sqrt{\cdot}$ is monotonic).

$D'(x) = 4x^3 - 10x = 2x(2x^2 - 5) = 0 \implies x = 0$ or $x = \pm\sqrt{5/2}$

D''(x) = 12x^2 - 10

At $x = 0$ : $D''(0) = -10 < 0$ -> local maximum (not what we want).

At $x = \pm\sqrt{5/2}$ : $D''(\pm\sqrt{5/2}) = 12(5/2) - 10 = 20 > 0$ -> local minimum.

Closest points: $\left(\sqrt{5/2}, 5/2\right)$ and $\left(-\sqrt{5/2}, 5/2\right)$ .

Example 3: Minimum Surface Area

Cylinder with Fixed Volume

Problem: Find the radius and height of a cylinder with volume $V = 16\pi$ that minimizes surface area.

Solution

Volume: $\pi r^2 h = 16\pi \implies h = \frac{16}{r^2}$

Surface area: $S = 2\pi r^2 + 2\pi r h = 2\pi r^2 + \frac{32\pi}{r}$

S'(r) = 4\pi r - \frac{32\pi}{r^2} = 0 \implies 4r^3 = 32 \implies r^3 = 8 \implies r = 2

h = \frac{16}{4} = 4

$S''(r) = 4\pi + \frac{64\pi}{r^3}$ , so $S''(2) = 4\pi + 8\pi = 12\pi > 0$ -> minimum.

Optimal cylinder: radius $= 2$ , height $= 4$ (height = diameter).

Constrained Optimization

DfConstrained Optimization

Constrained optimization involves minimizing or maximizing a function subject to one or more constraints on the variables. The constraints define a feasible region, and the optimum must lie within or on the boundary of this region.

Standard Constrained Problem

\min_{\vec{x}} f(\vec{x}) \quad \text{subject to} \quad g_i(\vec{x}) = 0, \quad h_j(\vec{x}) \leq 0

Here,

$f(\vec{x})$ =The objective function
$g_i(\vec{x}) = 0$ =Equality constraints
$h_j(\vec{x}) \leq 0$ =Inequality constraints

DfLagrange Multipliers

The method of Lagrange multipliers converts a constrained optimization problem into a system of equations. To minimize $f(\vec{x})$ subject to $g(\vec{x}) = 0$ , we solve $\nabla f = \lambda \nabla g$ and $g(\vec{x}) = 0$ simultaneously, where $\lambda$ is the Lagrange multiplier.

Lagrange Condition

\nabla f(\vec{x}^*) = \lambda \nabla g(\vec{x}^*), \quad g(\vec{x}^*) = 0

Here,

$abla f$ =Gradient of the objective function
$lambda$ =Lagrange multiplier (sensitivity measure)
$abla g$ =Gradient of the constraint function
$ec{x}^*$ =The optimal point

Geometric Interpretation

At the optimum, the gradient of the objective function is parallel to the gradient of the constraint. The multiplier $\lambda$ measures how much the optimal value changes if the constraint is relaxed by one unit — it is the "shadow price" of the constraint.

ThKarush-Kuhn-Tucker (KKT) Conditions

For inequality constraints $h_j(\vec{x}) \leq 0$ , the KKT conditions generalize Lagrange multipliers:

Stationarity: $\nabla f + \sum_j \mu_j \nabla h_j = 0$
Primal feasibility: $h_j(\vec{x}^*) \leq 0$ for all $j$
Dual feasibility: $\mu_j \geq 0$ for all $j$
Complementary slackness: $\mu_j h_j(\vec{x}^*) = 0$ for all $j$

Lagrange Multiplier Problem

Problem: Maximize $f(x, y) = xy$ subject to $x + y = 10$ .

Solution

$\nabla f = (y, x)$ , $\nabla g = (1, 1)$ where $g(x,y) = x + y - 10$ .

$\nabla f = \lambda \nabla g \implies y = \lambda, x = \lambda \implies x = y = 5$ .

Constraint: $5 + 5 = 10$ ✓. Maximum value: $f(5,5) = 25$ .

Python Implementation

Univariate Optimization with SciPy

import numpy as np
from scipy.optimize import minimize_scalar

# Minimize f(x) = x^4 - 4x^3 + 6x^2 - 4x + 1
f = lambda x: x**4 - 4*x**3 + 6*x**2 - 4*x + 1

# Bounded search
result = minimize_scalar(f, bounds=(-10, 10), method='bounded')
print(f"Minimum at x = {result.x:.6f}")
print(f"Minimum value = {result.fun:.6f}")

# With derivative information for faster convergence
def f_and_grad(x):
    val = x**4 - 4*x**3 + 6*x**2 - 4*x + 1
    grad = 4*x**3 - 12*x**2 + 12*x - 4
    return val, grad

from scipy.optimize import minimize
result = minimize(lambda x: f_and_grad(x)[0], x0=0.5,
                 jac=lambda x: f_and_grad(x)[1], method='BFGS')
print(f"Optimum at x = {result.x[0]:.6f}")

Multivariate Optimization

import numpy as np
from scipy.optimize import minimize

# Rosenbrock function: f(x,y) = (1-x)^2 + 100(y-x^2)^2
def rosenbrock(x):
    return (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    dx = -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2)
    dy = 200*(x[1] - x[0]**2)
    return np.array([dx, dy])

# Minimize with gradient
result = minimize(rosenbrock, x0=[-1, 1], jac/rosenbrock_grad, method='L-BFGS-B')
print(f"Optimum: {result.x}, Value: {result.fun:.2e}")

# Without gradient (numerical approximation)
result = minimize(rosenbrock, x0=[-1, 1], method='Nelder-Mead')
print(f"Optimum: {result.x}, Value: {result.fun:.2e}")

Constrained Optimization

import numpy as np
from scipy.optimize import minimize

# Minimize f(x,y) = x^2 + y^2 subject to x + y = 1
def objective(x):
    return x[0]**2 + x[1]**2

def constraint_eq(x):
    return x[0] + x[1] - 1  # Must equal 0

from scipy.optimize import LinearConstraint, NonlinearConstraint

# Method 1: SLSQP with constraint dict
constraints = {'type': 'eq', 'fun': constraint_eq}
result = minimize(objective, x0=[0.5, 0.5], method='SLSQP', constraints=constraints)
print(f"Optimum: {result.x}, Value: {result.fun:.4f}")
# Expected: [0.5, 0.5] with value 0.5

# Method 2: Lagrange multiplier verification
from scipy.optimize import minimize

def lagrangian(x, lam):
    return x[0]**2 + x[1]**2 + lam * (x[0] + x[1] - 1)

# Solve the KKT system analytically
# x* = y* = 0.5, lambda* = -1
print(f"Verification: f(0.5, 0.5) = {0.5**2 + 0.5**2}")

Gradient Descent Implementation

import numpy as np

def gradient_descent(f, grad_f, x0, lr=0.01, n_iters=1000, tol=1e-8):
    """Basic gradient descent with convergence check."""
    x = np.array(x0, dtype=float)
    history = [x.copy()]

    for i in range(n_iters):
        g = grad_f(x)
        x_new = x - lr * g
        history.append(x_new.copy())

        if np.linalg.norm(x_new - x) < tol:
            print(f"Converged at iteration {i}")
            break
        x = x_new

    return x, np.array(history)

# Minimize f(x,y) = x^2 + 4y^2
f = lambda x: x[0]**2 + 4*x[1]**2
grad = lambda x: np.array([2*x[0], 8*x[1]])

x_opt, hist = gradient_descent(f, grad, [5.0, 5.0], lr=0.1)
print(f"Optimum: {x_opt}")  # Should be near [0, 0]

Applications in AI/ML

Loss Minimization

Loss Functions as Optimization

Every ML model training process is an optimization problem. The loss function $\mathcal{L}(\theta)$ measures how wrong the model is, and training seeks $\theta^* = \arg\min_\theta \mathcal{L}(\theta)$ . Common losses include:

MSE for regression: $\mathcal{L} = \frac{1}{n}\sum(y_i - \hat{y}_i)^2$
Cross-entropy for classification: $\mathcal{L} = -\sum y_i \log \hat{y}_i$
Hinge loss for SVMs: $\mathcal{L} = \sum \max(0, 1 - y_i \hat{y}_i)$

Gradient Descent in Neural Networks

Parameter Update Rule

\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)

Here,

$heta_t$ =Model parameters at iteration t
$eta$ =Learning rate (step size)
$\nabla_\theta \mathcal{L}$ =Gradient of loss w.r.t. parameters

Learning Rate Selection

The learning rate $\eta$ is the most important hyperparameter. Too large: the optimizer overshoots and diverges. Too small: convergence is painfully slow. Best practices: start with $\eta = 0.001$ for Adam, $\eta = 0.01$ for SGD; use learning rate scheduling (cosine annealing, warm-up); monitor the loss curve for divergence or plateau.

Hyperparameter Tuning as Optimization

Black-Box Optimization

Hyperparameter tuning (learning rate, batch size, architecture choices) is optimization where the objective function is the validation loss, and the "gradient" is not available. Methods include:

Grid search: Exhaustive search over a predefined grid.
Random search: Sample random configurations (often more efficient than grid).
Bayesian optimization: Build a surrogate model of the objective and use it to choose the next evaluation point.
Evolutionary algorithms: Maintain a population of candidates and evolve them.

Common Mistakes

Mistake	Incorrect	Correct	Why It Matters
Assuming critical point is extremum	$f'(c) = 0 \implies$ min or max	Must test with 1st or 2nd derivative test	Inflection points also have $f' = 0$
Forgetting endpoints on bounded domains	Only check critical points	Also check $f(a)$ and $f(b)$	Global extremum may be at boundary
Confusing local and global extrema	Local min = global min	Local min is only optimal in a neighborhood	ML training often gets stuck in local minima
Wrong second derivative test	$f''(c) = 0 \implies$ minimum	Test is inconclusive; use 1st derivative test	$f(x) = x^4$ has $f''(0) = 0$ but $x = 0$ is a minimum
Ignoring domain constraints	Optimize over all $\mathbb{R}$	Respect physical/logical constraints	Negative lengths, probabilities $> 1$ are invalid
Not checking convergence	Run gradient descent for fixed iterations	Monitor gradient norm or loss change	May stop before reaching optimum
Using wrong learning rate	$\eta = 1.0$ for all problems	Tune $\eta$ per problem (typically $10^{-3}$ to $10^{-1}$ )	Too large diverges, too small is slow
Ignoring saddle points in high dimensions	All critical points are extrema	Saddle points are common in high dimensions	Neural network loss landscapes have many saddle points

Interview Questions

Q1: What is the difference between a local and global minimum?

Answer

A local minimum at $x^*$ means $f(x^*) \leq f(x)$ for all $x$ in some neighborhood of $x^*$ . A global minimum means $f(x^*) \leq f(x)$ for all $x$ in the entire domain. Every global minimum is also a local minimum, but the reverse is not true. In non-convex optimization (like neural network training), there can be many local minima, and finding the global minimum is generally NP-hard.

Q2: Explain the second derivative test. When does it fail?

Answer

The second derivative test classifies a critical point $c$ (where $f'(c) = 0$ ): if $f''(c) > 0$ , $c$ is a local minimum (concave up); if $f''(c) < 0$ , $c$ is a local maximum (concave down). It fails when $f''(c) = 0$ — the point could be a min, max, or inflection point. Example: $f(x) = x^4$ has $f''(0) = 0$ but $x = 0$ is a minimum. $f(x) = -x^4$ has $f''(0) = 0$ but $x = 0$ is a maximum. $f(x) = x^3$ has $f''(0) = 0$ and it is an inflection point. In these cases, use the first derivative test or examine higher-order derivatives.

Q3: Why is non-convex optimization hard for neural networks?

Answer

Neural network loss landscapes are non-convex, meaning they have many local minima, saddle points, and flat regions. Key challenges:

Local minima: Gradient descent may converge to a suboptimal local minimum.
Saddle points: Points where the gradient is zero but it is neither a min nor max; common in high dimensions.
Vanishing/exploding gradients: Deep networks can have gradients that shrink or explode, stalling learning.
Loss plateaus: Flat regions where gradients are near zero, slowing convergence.

Despite this, in practice, most local minima in large neural networks have similar loss values, so finding any "good enough" local minimum often suffices.

Q4: How does the learning rate affect gradient descent convergence?

Answer

The learning rate $\eta$ controls the step size in gradient descent. If $\eta$ is too large, the iterates overshoot the minimum and may diverge (oscillate with increasing amplitude). If $\eta$ is too small, convergence is extremely slow, and the algorithm may get stuck in flat regions. An optimal learning rate allows fast convergence without oscillation. Adaptive methods (Adam, RMSprop, Adagrad) adjust the learning rate per-parameter based on historical gradient information, often converging faster and more reliably than fixed learning rates. Common practice: start with $\eta = 0.001$ for Adam, use learning rate decay, and reduce $\eta$ when the loss plateaus.

Q5: What are KKT conditions and when are they used?

Answer

The Karush-Kuhn-Tucker (KKT) conditions are first-order necessary conditions for a solution in nonlinear programming to be optimal, provided some regularity conditions are satisfied. For a problem with inequality constraints $h_j(\vec{x}) \leq 0$ :

Stationarity: $\nabla f + \sum \mu_j \nabla h_j = 0$
Primal feasibility: $h_j(\vec{x}^*) \leq 0$
Dual feasibility: $\mu_j \geq 0$
Complementary slackness: $\mu_j h_j(\vec{x}^*) = 0$

They are used in constrained optimization, SVM derivation (the dual formulation), and any problem with inequality constraints. Complementary slackness means either the constraint is active ( $h_j = 0$ ) or its multiplier is zero — the constraint does not affect the solution.

Q6: Explain the vanishing gradient problem and its connection to optimization.

Answer

The vanishing gradient problem occurs when gradients become exponentially small as they propagate backward through deep network layers. Since parameter updates are proportional to the gradient ( $\theta \leftarrow \theta - \eta \nabla\mathcal{L}$ ), near-zero gradients mean early layers learn extremely slowly or not at all.

Causes: Saturating activation functions (sigmoid, tanh) have derivatives near zero for large inputs. Deep compositions multiply many small derivatives together.

Solutions: Use ReLU activations (derivative is 0 or 1), batch normalization, residual connections, careful weight initialization (Xavier/He), and gradient clipping.

Practice Problems

Problem 1: Find and Classify Critical Points

Critical Point Analysis

Problem: Find all critical points of $f(x) = 2x^3 - 3x^2 - 12x + 5$ and classify each as a local maximum, local minimum, or neither.

Solution

f'(x) = 6x^2 - 6x - 12 = 6(x^2 - x - 2) = 6(x-2)(x+1) = 0

Critical points: $x = 2$ and $x = -1$ .

f''(x) = 12x - 6

At $x = -1$ : $f''(-1) = -18 < 0$ -> local maximum, $f(-1) = -2 - 3 + 12 + 5 = 12$ .

At $x = 2$ : $f''(2) = 18 > 0$ -> local minimum, $f(2) = 16 - 12 - 24 + 5 = -15$ .

Problem 2: Optimization Word Problem

Minimizing Cost

Problem: A cylindrical can must hold 1000 cm³ of liquid. The material for the top and bottom costs $0.05/cm² and the side costs$ 0.03/cm². Find the dimensions that minimize cost.

Solution

Volume: $\pi r^2 h = 1000 \implies h = \frac{1000}{\pi r^2}$

Cost: $C = 2(0.05)\pi r^2 + 0.03(2\pi r h) = 0.1\pi r^2 + \frac{60}{r}$

$C'(r) = 0.2\pi r - \frac{60}{r^2} = 0 \implies r^3 = \frac{60}{0.2\pi} = \frac{300}{\pi} \implies r = \left(\frac{300}{\pi}\right)^{1/3} \approx 4.57$ cm

$h = \frac{1000}{\pi (4.57)^2} \approx 15.24$ cm

$C''(r) = 0.2\pi + \frac{120}{r^3} > 0$ -> confirmed minimum.

Problem 3: Prove Convexity Implies Global Minimum

Convex Function Property

Problem: Prove that if $f$ is strictly convex on an interval and $f'(c) = 0$ , then $c$ is the unique global minimum.

Solution

By strict convexity, for any $x \neq c$ : $f(x) > f(c) + f'(c)(x - c) = f(c) + 0 = f(c)$ .

Therefore $f(x) > f(c)$ for all $x \neq c$ , which means $c$ is the unique global minimum.

This is why convex optimization is so powerful — any local minimum is automatically the global minimum. This property underpins the reliability of logistic regression, SVMs, and other convex ML models.

Problem 4: Multivariable Optimization

Find the Extrema

Problem: Find and classify all critical points of $f(x, y) = x^3 + y^3 - 3xy$ .

Solution

f_x = 3x^2 - 3y = 0 \implies y = x^2

f_y = 3y^2 - 3x = 0 \implies x = y^2

Substituting: $x = (x^2)^2 = x^4 \implies x^4 - x = 0 \implies x(x^3 - 1) = 0$

So $x = 0$ or $x = 1$ . Corresponding: $(0, 0)$ and $(1, 1)$ .

Hessian: $H = \begin{pmatrix} 6x & -3 \\ -3 & 6y \end{pmatrix}$

At $(0, 0)$ : $\det(H) = 0 - 9 = -9 < 0$ -> saddle point.

At $(1, 1)$ : $\det(H) = 36 - 9 = 27 > 0$ and $f_{xx} = 6 > 0$ -> local minimum, $f(1,1) = -1$ .

Quick Reference

Key Takeaways

Critical Points: Where $f'(x) = 0$ or is undefined; candidates for extrema but not guaranteed to be extrema.
First Derivative Test: Check sign change of $f'$ at critical point. Negative -> positive = minimum; positive -> negative = maximum; no change = inflection.
Second Derivative Test: $f''(c) > 0$ = minimum (concave up), $f''(c) < 0$ = maximum (concave down), $f''(c) = 0$ = inconclusive.
Global Extrema on [a,b]: Compare $f$ at all critical points and endpoints; largest is global max, smallest is global min.
Lagrange Multipliers: For constrained optimization, solve $\nabla f = \lambda \nabla g$ with $g = 0$ ; $\lambda$ measures constraint sensitivity.
KKT Conditions: Generalize Lagrange to inequality constraints; complementary slackness means either constraint is active or its multiplier is zero.
Gradient Descent: $\theta_{t+1} = \theta_t - \eta \nabla\mathcal{L}$ ; the workhorse of ML training; learning rate $\eta$ is the critical hyperparameter.
Convex Functions: Any local minimum is the global minimum; this property makes convex optimization tractable and reliable.
Saddle Points: Critical points that are neither max nor min; common in high-dimensional non-convex problems like neural networks.
Practical Tip: In ML, we rarely find exact minima — we iterate until the loss stops decreasing meaningfully (early stopping).

Cross-References

Derivatives: Foundation for computing gradients needed in optimization -> Derivatives and Differentiation
Chain Rule: Essential for backpropagation and computing gradients of composite functions -> Chain Rule
Partial Derivatives: Gradients in multivariable optimization -> Partial Derivatives
Taylor Series: Quadratic approximation used in Newton's method -> Taylor Series
Lagrange Multipliers: Deep dive into constrained optimization -> Lagrange Multipliers
Multivariable Calculus: Gradients, Jacobians, and Hessians for higher dimensions -> Multivariable Calculus
Convex Optimization: Properties that make optimization tractable -> Convex Optimization
Gradient Descent Algorithms: SGD, Adam, and advanced optimizers -> Gradient Descent
Newton's Method: Second-order optimization using Hessians -> Newton's Method
Constrained Optimization: Linear programming, quadratic programming -> Constrained Optimization
Hyperparameter Tuning: Bayesian optimization and search strategies -> Hyperparameter Optimization