Gradient Descent

Why It Matters

Gradient descent is the foundational optimization algorithm that powers nearly all of modern machine learning. From training a simple linear regression to fine-tuning billion-parameter large language models, gradient descent and its variants are the engine behind every weight update. Understanding its mechanics, convergence properties, and practical pitfalls is essential for any ML practitioner or researcher.

At its core, gradient descent is an iterative method for finding local minima of differentiable functions. The intuition is beautifully simple: if you are standing on a mountain in thick fog and want to reach the valley, feel the slope under your feet and take a step downhill. Repeat until you reach the bottom. Despite this simplicity, the algorithm's interplay with learning rate choice, batch size, and loss landscape geometry creates deep and rich behavior that demands careful study.

The Algorithm

Gradient Descent Update Rule

x_{k+1} = x_k - \alpha_k \nabla f(x_k)

Here,

$x_k$ =Current parameter vector at iteration k
$\alpha_k$ =Learning rate (step size) at iteration k
$\nabla f(x_k)$ =Gradient of the objective function f at x_k
$x_{k+1}$ =Updated parameter vector after one step

Gradient Direction

The negative gradient $-\nabla f(x_k)$ points in the direction of steepest descent. Moving opposite the gradient is guaranteed to decrease $f$ for sufficiently small step sizes, since $\nabla f(x_k)^T \cdot [-\nabla f(x_k)] = -\|\nabla f(x_k)\|^2 < 0$ whenever $\nabla f(x_k) \neq 0$ . This is the fundamental justification for the algorithm.

Learning Rate

DfLearning Rate

The learning rate $\alpha > 0$ is the step size that controls how far we move along the negative gradient direction at each iteration. It is the single most important hyperparameter in gradient descent. The choice of $\alpha$ determines whether the algorithm converges, diverges, or oscillates without reaching a minimum.

Choosing the Learning Rate

Too Large

If $\alpha$ is too large, gradient descent overshoots the minimum and may diverge. For a quadratic $f(x) = \frac{1}{2}ax^2$ , the update becomes $x_{k+1} = x_k(1 - \alpha a)$ . If $|1 - \alpha a| > 1$ , the iterates grow without bound. The algorithm oscillates with increasing amplitude and never converges.

Too Small

If $\alpha$ is too small, convergence is painfully slow. Each step makes negligible progress, and the algorithm may require millions of iterations to reach the minimum. This wastes compute and can get stuck in plateaus or shallow local minima before reaching the true optimum.

Just Right

A well-chosen learning rate balances speed and stability. The iterates make large enough progress to converge in reasonable time while remaining stable enough to approach the minimum without overshooting. In practice, the optimal $\alpha$ depends on the curvature of the loss landscape and is typically found via learning rate schedules or adaptive methods.

Batch Gradient Descent

Batch Gradient Descent (Full Gradient)

x_{k+1} = x_k - \alpha \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla f_i(x_k)

Here,

$n$ =Total number of training samples
$f_i(x)$ =Loss for the i-th training sample
$\frac{1}{n} \sum_{i=1}^{n} \nabla f_i(x_k)$ =Full gradient averaged over the entire dataset

Batch gradient descent computes the exact gradient using all $n$ training samples before each update. This guarantees the gradient direction is accurate but becomes prohibitively expensive for large datasets. Computing the full gradient requires $O(n)$ forward and backward passes per iteration, which is infeasible when $n$ is in the millions.

Stochastic and Mini-batch Variants

Stochastic Gradient Descent (SGD)

x_{k+1} = x_k - \alpha \cdot \nabla f_{i_k}(x_k)

Here,

$i_k$ =Randomly sampled index from {1, ..., n}
$\nabla f_{i_k}(x_k)$ =Gradient from a single sample (noisy estimate)

Mini-batch Gradient Descent

x_{k+1} = x_k - \alpha \cdot \frac{1}{|B_k|} \sum_{i \in B_k} \nabla f_i(x_k)

Here,

$B_k$ =Random mini-batch of size b sampled at iteration k
$|B_k| = b$ =Mini-batch size (typically 32 to 512)

Variant	Batch Size	Gradient Noise	Per-Iteration Cost	Use Case
Batch GD	All $n$	None	$O(n)$	Small datasets, convex problems
Stochastic GD	1	High	$O(1)$	Online learning, large-scale
Mini-batch GD	$b$	Moderate	$O(b)$	Deep learning (default)

Convergence Analysis

ThConvergence Rate for Convex Functions

Let $f$ be $L$ -Lipschitz continuous and $\mu$ -strongly convex. Then gradient descent with constant step size $\alpha = \frac{1}{L}$ satisfies:

f(x_k) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2k}

This gives a convergence rate of $O(1/k)$ — sublinear convergence. The iterates decrease the optimality gap at a rate inversely proportional to the number of iterations.

ThLinear Convergence for Strongly Convex Functions

If $f$ is $\mu$ -strongly convex and $L$ -smooth, gradient descent with step size $\alpha = \frac{2}{L + \mu}$ satisfies:

f(x_k) - f(x^*) \leq \left(\frac{L - \mu}{L + \mu}\right)^{2k} \|x_0 - x^*\|^2

The condition number $\kappa = L/\mu$ governs the convergence rate: $\frac{L - \mu}{L + \mu} = \frac{\kappa - 1}{\kappa + 1}$ . When $\kappa = 1$ (perfect conditioning), convergence is immediate. When $\kappa \gg 1$ (ill-conditioned), convergence is extremely slow — the iterates zigzag through narrow valleys.

Non-convex Landscape

For non-convex problems (like neural network training), gradient descent is not guaranteed to find the global minimum. It converges to a stationary point where $\nabla f(x) = 0$ , which could be a local minimum, global minimum, or saddle point. In high-dimensional spaces, saddle points are exponentially more prevalent than local minima, and saddle points with negative curvature are often the primary obstacle to convergence.

Line Search Methods

DfExact Line Search

After computing the search direction $d_k = -\nabla f(x_k)$ , exact line search finds the step size that minimizes $f$ along that direction:

\alpha_k = \arg\min_{\alpha > 0} f(x_k + \alpha d_k)

This is optimal but expensive — it requires evaluating $f$ at many points along the ray. In practice, exact line search is rarely used because each evaluation costs $O(n)$ and the minimization itself is an inner optimization problem.

DfBacktracking Line Search

Backtracking is a practical alternative. Starting with an initial step size (e.g., $\alpha = 1$ ), it repeatedly shrinks $\alpha$ by a factor $\beta \in (0, 1)$ until the Armijo condition is satisfied:

f(x_k + \alpha d_k) \leq f(x_k) + c \alpha \nabla f(x_k)^T d_k

where $c \in (0, 1)$ is a constant (typically $c = 10^{-4}$ ). The parameter $\beta$ (typically $0.5$ ) controls how aggressively we shrink. This guarantees sufficient decrease at each step while avoiding the cost of exact minimization.

Momentum

Momentum Update Rule

\begin{aligned} v_k &= \beta v_{k-1} + \nabla f(x_k) \\ x_{k+1} &= x_k - \alpha v_k \end{aligned}

Here,

$v_k$ =Velocity (accumulated gradient) at iteration k
$\beta$ =Momentum coefficient (typically 0.9)
$\alpha$ =Learning rate

Physics Analogy

Imagine a ball rolling downhill on a frictionless surface. It accumulates velocity as it rolls, gaining speed in directions of consistent descent and slowing down through oscillations. The momentum term $\beta v_{k-1}$ retains a fraction of the previous velocity, so the ball speeds up in directions where the gradient consistently points downhill and decelerates where the gradient flips sign.

Why momentum helps: In narrow valleys, vanilla gradient descent oscillates wildly because the gradient alternates direction. Momentum averages these oscillations, smoothing the trajectory. On flat plateaus, momentum accumulates velocity and accelerates through regions where vanilla GD stalls. The effective learning rate is amplified by $\frac{1}{1-\beta}$ in consistent directions — with $\beta = 0.9$ , this is a 10x speedup along persistent gradient directions.

Nesterov Accelerated Gradient

Nesterov Momentum

\begin{aligned} v_k &= \beta v_{k-1} + \nabla f(x_k - \alpha \beta v_{k-1}) \\ x_{k+1} &= x_k - \alpha v_k \end{aligned}

Here,

$\nabla f(x_k - \alpha \beta v_{k-1})$ =Gradient evaluated at the 'look-ahead' position

Nesterov momentum evaluates the gradient not at the current position but at the position we would arrive at if we continued with the current velocity. This "look-ahead" correction provides better convergence guarantees: $O(1/k^2)$ for convex functions compared to $O(1/k)$ for standard gradient descent. It is the theoretical foundation behind many practical accelerations.

Learning Rate Schedules

Step Decay Schedule

\alpha_k = \alpha_0 \cdot \gamma^{\lfloor k / s \rfloor}

Here,

$\alpha_0$ =Initial learning rate
$\gamma$ =Decay factor (typically 0.1 or 0.5)
$s$ =Step size — decay every s iterations
$\lfloor k/s \rfloor$ =Floor division (integer part)

Cosine Annealing Schedule

\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{\pi k}{T}\right)\right)

Here,

$\alpha_{\min}$ =Minimum learning rate (floor)
$\alpha_{\max}$ =Maximum learning rate (typically = \alpha_0)
$T$ =Total number of iterations for one cycle
$k$ =Current iteration number

Cosine Annealing with Warm Restarts

\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{\pi \, (k \bmod T)}{T}\right)\right)

Here,

$k \bmod T$ =Iteration within the current cycle (resets at each restart)

Warm restarts periodically reset the learning rate to a high value. The high learning rate helps escape sharp minima and saddle points, while the annealing within each cycle allows fine-grained convergence to flat, generalizable minima.

Schedule	Formula	Key Benefit
Constant	$\alpha_k = \alpha_0$	Simple baseline
Step decay	$\alpha_k = \alpha_0 \gamma^{\lfloor k/s \rfloor}$	Adapts to progress
Exponential decay	$\alpha_k = \alpha_0 e^{-\lambda k}$	Smooth, monotonic decrease
Cosine annealing	$\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})(1 + \cos(\pi k / T))$	Smooth, periodic
Warm restart	Periodic cosine reset	Escapes local minima
Linear warmup	$\alpha_k = \alpha_0 \cdot k / K$ for $k < K$	Stabilizes early training

Python Implementation

import numpy as np
import matplotlib.pyplot as plt


def gradient_descent(f, grad_f, x0, lr=0.01, n_iters=1000, tol=1e-6):
    """Vanilla gradient descent with convergence check."""
    x = x0.copy()
    history = [x.copy()]
    losses = [f(x)]

    for i in range(n_iters):
        grad = grad_f(x)
        x = x - lr * grad
        history.append(x.copy())
        losses.append(f(x))

        if np.linalg.norm(grad) < tol:
            break

    return x, np.array(history), losses


def momentum_gd(f, grad_f, x0, lr=0.01, beta=0.9, n_iters=1000, tol=1e-6):
    """Gradient descent with momentum."""
    x = x0.copy()
    v = np.zeros_like(x)
    history = [x.copy()]
    losses = [f(x)]

    for i in range(n_iters):
        grad = grad_f(x)
        v = beta * v + grad
        x = x - lr * v
        history.append(x.copy())
        losses.append(f(x))

        if np.linalg.norm(grad) < tol:
            break

    return x, np.array(history), losses


def nesterov_gd(f, grad_f, x0, lr=0.01, beta=0.9, n_iters=1000, tol=1e-6):
    """Gradient descent with Nesterov momentum."""
    x = x0.copy()
    v = np.zeros_like(x)
    history = [x.copy()]
    losses = [f(x)]

    for i in range(n_iters):
        grad = grad_f(x + lr * beta * (-v))
        v = beta * v + grad
        x = x - lr * v
        history.append(x.copy())
        losses.append(f(x))

        if np.linalg.norm(grad) < tol:
            break

    return x, np.array(history), losses


def cosine_annealing_lr(epoch, lr_min, lr_max, T):
    """Cosine annealing learning rate schedule."""
    return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * epoch / T))


def step_decay_lr(epoch, lr_0, gamma, step_size):
    """Step decay learning rate schedule."""
    return lr_0 * (gamma ** (epoch // step_size))


# --- Example usage ---
f = lambda x: (x[0] - 2) ** 2 + (x[1] - 3) ** 2
grad_f = lambda x: np.array([2 * (x[0] - 2), 2 * (x[1] - 3)])

x0 = np.array([0.0, 0.0])

x_vanilla, hist_vanilla, losses_vanilla = gradient_descent(
    f, grad_f, x0, lr=0.1, n_iters=100
)
x_mom, hist_mom, losses_mom = momentum_gd(
    f, grad_f, x0, lr=0.1, beta=0.9, n_iters=100
)
x_nest, hist_nest, losses_nest = nesterov_gd(
    f, grad_f, x0, lr=0.1, beta=0.9, n_iters=100
)

print(f"Vanilla GD optimum: {x_vanilla}, iterations: {len(losses_vanilla)}")
print(f"Momentum GD optimum: {x_mom}, iterations: {len(losses_mom)}")
print(f"Nesterov GD optimum: {x_nest}, iterations: {len(losses_nest)}")

# Learning rate schedule visualization
epochs = range(200)
lr_cosine = [cosine_annealing_lr(e, 1e-4, 1e-1, 200) for e in epochs]
lr_step = [step_decay_lr(e, 1e-1, 0.5, 50) for e in epochs]

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(epochs, lr_cosine, label="Cosine Annealing")
plt.plot(epochs, lr_step, label="Step Decay")
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.legend()
plt.title("Learning Rate Schedules")

plt.subplot(1, 2, 2)
plt.plot(losses_vanilla, label="Vanilla GD")
plt.plot(losses_mom, label="Momentum GD")
plt.plot(losses_nest, label="Nesterov GD")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.legend()
plt.title("Convergence Comparison")
plt.tight_layout()
plt.savefig("gd_comparison.png", dpi=150)
plt.show()

Applications in AI/ML

Neural Network Training

Backpropagation and Gradient Descent

Training a neural network is a gradient descent problem over millions (or billions) of parameters. The loss $\mathcal{L}(\theta)$ measures the discrepancy between predictions and labels. Backpropagation computes $\nabla_\theta \mathcal{L}$ efficiently via the chain rule. Each training step updates all weights in the direction that reduces the loss: $\theta_{t+1} = \theta_t - \alpha \nabla_\theta \mathcal{L}(\theta_t)$ .

In practice, deep learning uses mini-batch SGD with momentum or Adam because:

Full-batch gradients are too expensive to compute for large datasets.
The noise in mini-batch gradients acts as implicit regularization.
Momentum accelerates convergence through consistent gradient directions.
Adam's per-parameter adaptive learning rates handle sparse features and varying curvatures.

Convex Optimization Connection

For convex problems like linear regression and SVMs, gradient descent has strong theoretical guarantees. The objective function has no local minima that are not global minima, and the convergence rate depends on the condition number $\kappa = L/\mu$ . Gradient descent with exact line search achieves linear convergence for strongly convex functions.

Regularized Optimization

Regularized Loss Minimization

\min_\theta \; \frac{1}{n}\sum_{i=1}^{n} \mathcal{L}(f_\theta(x_i), y_i) + \lambda R(\theta)

Here,

$\mathcal{L}$ =Per-sample loss function (e.g., cross-entropy, MSE)
$R(\theta)$ =Regularization term (L1, L2, or other penalty)
$\lambda$ =Regularization strength (trade-off parameter)

Gradient descent naturally handles regularized objectives. The gradient becomes $\nabla_\theta \left[\frac{1}{n}\sum \mathcal{L}_i + \lambda R(\theta)\right] = \frac{1}{n}\sum \nabla_\theta \mathcal{L}_i + \lambda \nabla_\theta R(\theta)$ . For L2 regularization ( $R(\theta) = \|\theta\|^2$ ), the update becomes $x_{k+1} = (1 - 2\alpha\lambda) x_k - \alpha \nabla f(x_k)$ , which shrinks weights toward zero at every step — a phenomenon called weight decay.

Online Learning

Online Gradient Descent

In online learning, data arrives one sample (or one mini-batch) at a time, and we must update the model before seeing the next sample. Online gradient descent processes each sample immediately: $x_{t+1} = x_t - \alpha_t \nabla f_t(x_t)$ , where $f_t$ is the loss on the $t$ -th sample. The regret — cumulative loss compared to the best fixed decision in hindsight — grows as $O(\sqrt{T})$ with appropriately decaying step sizes $\alpha_t = O(1/\sqrt{t})$ . This sublinear regret means the algorithm "almost" converges to the best decision, making it a cornerstone of online learning theory.

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
Using a constant high learning rate	Causes divergence and oscillation around the minimum	Use a learning rate schedule (cosine, step decay)
Ignoring gradient magnitude	A gradient norm of 0.001 and 100 require very different step sizes	Use adaptive methods (Adam) or gradient clipping
Not shuffling training data	Creates systematic bias in mini-batch gradients	Shuffle the dataset at each epoch
Setting batch size too small	Excessive gradient noise prevents convergence	Use batch sizes of 32–512 for stable training
Training without momentum	Slow convergence in ill-conditioned loss landscapes	Add momentum ( $\beta = 0.9$ ) as a default
Using vanilla GD for deep learning	Full-batch computation is prohibitively expensive	Use mini-batch SGD or Adam
Skipping learning rate warmup	Large initial gradients can destabilize early training	Warm up learning rate over the first few epochs
Not monitoring loss curves	Missing divergence, plateaus, or overfitting	Plot training and validation loss every epoch
Assuming convergence to global minimum	Non-convex problems may have many local minima and saddle points	Use multiple restarts or annealing to escape poor minima
Confusing gradient descent with Newton's method	GD uses first-order information only; Newton uses second-order curvature	Use Newton/BFGS for small problems, GD for large-scale

Interview Questions

Q1: Why is the learning rate the most important hyperparameter in gradient descent?

Answer

The learning rate controls the step size at each iteration. If it's too large, the iterates overshoot and diverge. If it's too small, convergence is impractically slow. Unlike other hyperparameters that mainly affect model capacity or regularization, the learning rate directly determines whether the optimization converges at all. For a quadratic $f(x) = \frac{1}{2}ax^2$ , the iteration becomes $x_{k+1} = x_k(1 - \alpha a)$ , which converges only if $|1 - \alpha a| < 1$ , i.e., $\alpha < 2/a$ . This critical dependence makes learning rate tuning essential.

Q2: What is the difference between batch gradient descent, SGD, and mini-batch GD?

Answer

Batch GD computes the exact gradient over all $n$ samples per update — stable but $O(n)$ per step. SGD uses a single random sample — fast but noisy. Mini-batch GD uses $b$ samples — the practical compromise. The noise in SGD/mini-batch is not a bug but a feature: it provides implicit regularization, helps escape saddle points, and enables training on data streams. Modern deep learning universally uses mini-batch GD with $b \in [32, 512]$ .

Q3: How does momentum improve gradient descent?

Answer

Momentum accumulates past gradients: $v_k = \beta v_{k-1} + \nabla f(x_k)$ , then updates $x_{k+1} = x_k - \alpha v_k$ . This has two key benefits: (1) It averages out oscillations in narrow valleys, smoothing the trajectory toward the minimum. (2) It accelerates through flat regions where the gradient is small but consistent, as velocity accumulates over iterations. The effective learning rate in consistent directions is amplified by $\frac{1}{1-\beta} \approx 10\times$ when $\beta = 0.9$ . Nesterov momentum improves this further by evaluating the gradient at a look-ahead position, achieving $O(1/k^2)$ convergence for convex functions.

Q4: Explain the convergence guarantees of gradient descent for convex functions.

Answer

For $L$ -Lipschitz smooth functions, GD with $\alpha = 1/L$ achieves $O(1/k)$ convergence: $f(x_k) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2k}$ . For $\mu$ -strongly convex and $L$ -smooth functions, GD with $\alpha = 2/(L+\mu)$ achieves linear convergence at rate $\left(\frac{\kappa-1}{\kappa+1}\right)^2$ per iteration, where $\kappa = L/\mu$ is the condition number. For non-convex functions, GD converges to a stationary point ( $\|\nabla f\| < \epsilon$ ) in $O(1/\epsilon^2)$ iterations, but there is no guarantee of reaching the global minimum.

Q5: When would you choose SGD over Adam?

Answer

SGD with momentum often achieves better generalization than Adam in the final stages of training. Adam's adaptive per-parameter learning rates can converge faster initially but may generalize worse because they escape sharp minima less effectively. The common practice is: use Adam for initial rapid convergence, then switch to SGD with momentum (and a decaying learning rate) for the final training phase. This "warm start with Adam, finish with SGD" strategy combines the speed of adaptive methods with the generalization of SGD.

Second-Order Effects: Condition Number

DfCondition Number

The condition number $\kappa = L/\mu$ measures the ratio of the largest to smallest curvature of the function. For $f(x) = \frac{1}{2}x^T A x$ with eigenvalues $\lambda_{\max}$ and $\lambda_{\min}$ , $\kappa = \lambda_{\max}/\lambda_{\min}$ . A high condition number means the function has very different curvatures in different directions — creating narrow, elongated valleys where gradient descent oscillates.

Ill-conditioned Problems

When $\kappa \gg 1$ , gradient descent requires $O(\kappa \log(1/\epsilon))$ iterations to reach accuracy $\epsilon$ . For example, if $\kappa = 1000$ , you need roughly 1000 times more iterations than if $\kappa = 1$ . Preconditioning (transforming coordinates to reduce $\kappa$ ) or using second-order methods is essential for such problems.

Practice Problems

Problem 1: Manual Gradient Descent Step

One Step of Gradient Descent

Problem: Given $f(x, y) = (x - 3)^2 + 2(y + 1)^2$ , starting point $(x_0, y_0) = (0, 0)$ , and learning rate $\alpha = 0.1$ , perform one gradient descent update.

Solution

Gradient: $\nabla f = \left(2(x-3), \; 4(y+1)\right)$

At $(0, 0)$ : $\nabla f(0, 0) = (2(0-3), \; 4(0+1)) = (-6, \; 4)$

Update: $x_1 = x_0 - \alpha \cdot (-6) = 0 + 0.6 = 0.6$

y_1 = y_0 - \alpha \cdot 4 = 0 - 0.4 = -0.4

New point: $(0.6, -0.4)$

Check: $f(0, 0) = 9 + 2 = 11$ , $f(0.6, -0.4) = (0.6-3)^2 + 2(-0.4+1)^2 = 5.76 + 0.72 = 6.48 < 11$ . The function decreased, confirming the step was in the right direction.

Problem 2: Learning Rate Divergence

Divergent Learning Rate

Problem: For $f(x) = x^2$ with $x_0 = 1$ and gradient descent $x_{k+1} = x_k - \alpha \cdot 2x_k$ , find the range of $\alpha$ for convergence.

Solution

The update is $x_{k+1} = x_k(1 - 2\alpha)$ . This is a geometric sequence: $x_k = (1 - 2\alpha)^k \cdot x_0$ .

Convergence condition: $|1 - 2\alpha| < 1 \implies -1 < 1 - 2\alpha < 1 \implies 0 < \alpha < 1$ .

If $\alpha = 0.5$ : $x_k = 0$ for all $k \geq 1$ (optimal, converges in one step).
If $\alpha = 0.3$ : $x_k = (0.4)^k \to 0$ (converges slowly).
If $\alpha = 1.0$ : $x_k = (-1)^k$ (oscillates, does not converge).
If $\alpha = 1.5$ : $x_k = (-2)^k$ (diverges).

Optimal: $\alpha = 0.5$ gives fastest convergence for this quadratic.

Problem 3: Momentum vs Vanilla GD

Momentum Acceleration

Problem: Consider a function with narrow valley along $y = 0$ where $f(x, y) = x^2 + 100y^2$ . Explain why momentum helps and compute the effective step size after 5 iterations with $\beta = 0.9$ and constant gradient $(2, 0)$ .

Solution

The function has $L = 200$ (from the $y$ -direction) and $\mu = 2$ (from the $x$ -direction), so $\kappa = 100$ . Vanilla GD oscillates wildly in the $y$ -direction due to the large curvature.

Momentum with constant gradient $g = (2, 0)$ : $v_0 = 0$ , $v_1 = \beta \cdot 0 + g = g$

v_2 = \beta v_1 + g = (1 + \beta)g

v_k = (1 + \beta + \beta^2 + \cdots + \beta^{k-1})g = \frac{1 - \beta^k}{1 - \beta} g

After $k = 5$ iterations: $v_5 = \frac{1 - 0.9^5}{1 - 0.9} \cdot (2, 0) = \frac{1 - 0.59049}{0.1} \cdot (2, 0) = 4.0951 \cdot (2, 0) = (8.19, 0)$

The effective step in the $x$ -direction is $\alpha \cdot 8.19$ — about $4\times$ the gradient magnitude. As $k \to \infty$ , $v_k \to \frac{1}{1-\beta} g = 10 \cdot g$ , amplifying the consistent direction by $10\times$ .

Problem 4: Cosine Annealing Schedule

Learning Rate at Specific Epochs

Problem: For cosine annealing with $\alpha_{\min} = 10^{-5}$ , $\alpha_{\max} = 0.1$ , and $T = 100$ epochs, compute the learning rate at epochs $k = 0, 25, 50, 75, 100$ .

Solution

\alpha_k = 10^{-5} + \frac{0.1 - 10^{-5}}{2}\left(1 + \cos\left(\frac{\pi k}{100}\right)\right)

$k = 0$ : $\alpha_0 = 10^{-5} + 0.05 \cdot (1 + \cos 0) = 10^{-5} + 0.05 \cdot 2 = 0.10001 \approx 0.1$
$k = 25$ : $\alpha_{25} = 10^{-5} + 0.05 \cdot (1 + \cos(\pi/4)) = 10^{-5} + 0.05 \cdot 1.7071 = 0.08536$
$k = 50$ : $\alpha_{50} = 10^{-5} + 0.05 \cdot (1 + \cos(\pi/2)) = 10^{-5} + 0.05 \cdot 1 = 0.05001$
$k = 75$ : $\alpha_{75} = 10^{-5} + 0.05 \cdot (1 + \cos(3\pi/4)) = 10^{-5} + 0.05 \cdot 0.2929 = 0.01466$
$k = 100$ : $\alpha_{100} = 10^{-5} + 0.05 \cdot (1 + \cos \pi) = 10^{-5} + 0.05 \cdot 0 = 10^{-5}$

The schedule starts near maximum, decays smoothly through the middle, and ends at the minimum — exactly the behavior needed for training neural networks.

Quick Reference

Key Takeaways

Update Rule: $x_{k+1} = x_k - \alpha_k \nabla f(x_k)$ — move opposite the gradient by step size $\alpha_k$ .
Learning Rate: The most critical hyperparameter. Too large diverges, too small is slow. Use schedules (cosine, step decay) instead of constant values.
Batch vs. SGD vs. Mini-batch: Batch GD uses all data (stable but expensive), SGD uses one sample (fast but noisy), mini-batch GD ( $b = 32$ – $512$ ) is the practical default for deep learning.
Convergence Rate: $O(1/k)$ for convex, linear convergence $\left(\frac{\kappa-1}{\kappa+1}\right)^{2k}$ for strongly convex, $O(1/\epsilon^2)$ to reach $\|\nabla f\| < \epsilon$ for non-convex.
Line Search: Backtracking with Armijo condition $f(x + \alpha d) \leq f(x) + c\alpha \nabla f^T d$ provides adaptive step sizes without exact minimization.
Momentum: $v_k = \beta v_{k-1} + \nabla f(x_k)$ with $\beta \approx 0.9$ averages oscillations and accelerates through consistent gradient directions by $\frac{1}{1-\beta}$ .
Nesterov Momentum: Evaluate gradient at the look-ahead position for $O(1/k^2)$ convergence — the theoretical foundation for accelerated methods.
Schedules: Cosine annealing $\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})(1 + \cos(\pi k/T))$ is the most popular schedule in deep learning. Add warm restarts to escape poor minima.
Deep Learning Practice: Start with Adam (lr=0.001) for fast convergence; switch to SGD+momentum for final training. Use gradient clipping when gradients explode.
Python: Use scipy.optimize.minimize() for small convex problems; implement custom loops with NumPy for full control over training dynamics.

Cross-References

Stochastic Gradient Descent: Adam, RMSProp, and adaptive learning rate methods -> SGD
Newton's Method: Second-order optimization with Hessian information -> Newton's Method
Constrained Optimization: KKT conditions, penalty methods, and interior-point methods -> Constrained Optimization
Convex Optimization: Global optimality guarantees and convex function properties -> Convex Optimization
Hyperparameter Optimization: Learning rate tuning via grid search, Bayesian optimization -> Hyperparameter Optimization
Calculus Optimization: First and second derivative tests, unconstrained optimization -> Calculus Optimization
Partial Derivatives: Gradients and Jacobians needed for computing $\nabla f$ -> Partial Derivatives
Linear Algebra Norms: Vector norms used in convergence analysis -> Linear Algebra Norms

Gradient Descent