πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Gradient Descent

OptimizationFirst-order🟒 Free Lesson

Advertisement

Gradient Descent

Why It Matters

Gradient descent is the foundational optimization algorithm that powers nearly all of modern machine learning. From training a simple linear regression to fine-tuning billion-parameter large language models, gradient descent and its variants are the engine behind every weight update. Understanding its mechanics, convergence properties, and practical pitfalls is essential for any ML practitioner or researcher.

At its core, gradient descent is an iterative method for finding local minima of differentiable functions. The intuition is beautifully simple: if you are standing on a mountain in thick fog and want to reach the valley, feel the slope under your feet and take a step downhill. Repeat until you reach the bottom. Despite this simplicity, the algorithm's interplay with learning rate choice, batch size, and loss landscape geometry creates deep and rich behavior that demands careful study.


The Algorithm

Gradient Descent Update Rule

xk+1=xkβˆ’Ξ±kβˆ‡f(xk)x_{k+1} = x_k - \alpha_k \nabla f(x_k)

Here,

  • xkx_k=Current parameter vector at iteration k
  • Ξ±k\alpha_k=Learning rate (step size) at iteration k
  • βˆ‡f(xk)\nabla f(x_k)=Gradient of the objective function f at x_k
  • xk+1x_{k+1}=Updated parameter vector after one step

Gradient Direction

The negative gradient βˆ’βˆ‡f(xk)-\nabla f(x_k) points in the direction of steepest descent. Moving opposite the gradient is guaranteed to decrease ff for sufficiently small step sizes, since βˆ‡f(xk)Tβ‹…[βˆ’βˆ‡f(xk)]=βˆ’βˆ₯βˆ‡f(xk)βˆ₯2<0\nabla f(x_k)^T \cdot [-\nabla f(x_k)] = -\|\nabla f(x_k)\|^2 < 0 whenever βˆ‡f(xk)β‰ 0\nabla f(x_k) \neq 0. This is the fundamental justification for the algorithm.


Learning Rate

DfLearning Rate

The learning rate Ξ±>0\alpha > 0 is the step size that controls how far we move along the negative gradient direction at each iteration. It is the single most important hyperparameter in gradient descent. The choice of Ξ±\alpha determines whether the algorithm converges, diverges, or oscillates without reaching a minimum.

Choosing the Learning Rate

Too Large

If Ξ±\alpha is too large, gradient descent overshoots the minimum and may diverge. For a quadratic f(x)=12ax2f(x) = \frac{1}{2}ax^2, the update becomes xk+1=xk(1βˆ’Ξ±a)x_{k+1} = x_k(1 - \alpha a). If ∣1βˆ’Ξ±a∣>1|1 - \alpha a| > 1, the iterates grow without bound. The algorithm oscillates with increasing amplitude and never converges.

Too Small

If Ξ±\alpha is too small, convergence is painfully slow. Each step makes negligible progress, and the algorithm may require millions of iterations to reach the minimum. This wastes compute and can get stuck in plateaus or shallow local minima before reaching the true optimum.

Just Right

A well-chosen learning rate balances speed and stability. The iterates make large enough progress to converge in reasonable time while remaining stable enough to approach the minimum without overshooting. In practice, the optimal Ξ±\alpha depends on the curvature of the loss landscape and is typically found via learning rate schedules or adaptive methods.


Batch Gradient Descent

Batch Gradient Descent (Full Gradient)

xk+1=xkβˆ’Ξ±β‹…1nβˆ‘i=1nβˆ‡fi(xk)x_{k+1} = x_k - \alpha \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla f_i(x_k)

Here,

  • nn=Total number of training samples
  • fi(x)f_i(x)=Loss for the i-th training sample
  • 1nβˆ‘i=1nβˆ‡fi(xk)\frac{1}{n} \sum_{i=1}^{n} \nabla f_i(x_k)=Full gradient averaged over the entire dataset

Batch gradient descent computes the exact gradient using all nn training samples before each update. This guarantees the gradient direction is accurate but becomes prohibitively expensive for large datasets. Computing the full gradient requires O(n)O(n) forward and backward passes per iteration, which is infeasible when nn is in the millions.

Stochastic and Mini-batch Variants

Stochastic Gradient Descent (SGD)

xk+1=xkβˆ’Ξ±β‹…βˆ‡fik(xk)x_{k+1} = x_k - \alpha \cdot \nabla f_{i_k}(x_k)

Here,

  • iki_k=Randomly sampled index from {1, ..., n}
  • βˆ‡fik(xk)\nabla f_{i_k}(x_k)=Gradient from a single sample (noisy estimate)

Mini-batch Gradient Descent

xk+1=xkβˆ’Ξ±β‹…1∣Bkβˆ£βˆ‘i∈Bkβˆ‡fi(xk)x_{k+1} = x_k - \alpha \cdot \frac{1}{|B_k|} \sum_{i \in B_k} \nabla f_i(x_k)

Here,

  • BkB_k=Random mini-batch of size b sampled at iteration k
  • ∣Bk∣=b|B_k| = b=Mini-batch size (typically 32 to 512)
VariantBatch SizeGradient NoisePer-Iteration CostUse Case
Batch GDAll nnNoneO(n)O(n)Small datasets, convex problems
Stochastic GD1HighO(1)O(1)Online learning, large-scale
Mini-batch GDbbModerateO(b)O(b)Deep learning (default)

Convergence Analysis

ThConvergence Rate for Convex Functions

Let ff be LL-Lipschitz continuous and ΞΌ\mu-strongly convex. Then gradient descent with constant step size Ξ±=1L\alpha = \frac{1}{L} satisfies:

f(xk)βˆ’f(xβˆ—)≀Lβˆ₯x0βˆ’xβˆ—βˆ₯22kf(x_k) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2k}

This gives a convergence rate of O(1/k)O(1/k) β€” sublinear convergence. The iterates decrease the optimality gap at a rate inversely proportional to the number of iterations.

ThLinear Convergence for Strongly Convex Functions

If ff is ΞΌ\mu-strongly convex and LL-smooth, gradient descent with step size Ξ±=2L+ΞΌ\alpha = \frac{2}{L + \mu} satisfies:

f(xk)βˆ’f(xβˆ—)≀(Lβˆ’ΞΌL+ΞΌ)2kβˆ₯x0βˆ’xβˆ—βˆ₯2f(x_k) - f(x^*) \leq \left(\frac{L - \mu}{L + \mu}\right)^{2k} \|x_0 - x^*\|^2

The condition number ΞΊ=L/ΞΌ\kappa = L/\mu governs the convergence rate: Lβˆ’ΞΌL+ΞΌ=ΞΊβˆ’1ΞΊ+1\frac{L - \mu}{L + \mu} = \frac{\kappa - 1}{\kappa + 1}. When ΞΊ=1\kappa = 1 (perfect conditioning), convergence is immediate. When κ≫1\kappa \gg 1 (ill-conditioned), convergence is extremely slow β€” the iterates zigzag through narrow valleys.

Non-convex Landscape

For non-convex problems (like neural network training), gradient descent is not guaranteed to find the global minimum. It converges to a stationary point where βˆ‡f(x)=0\nabla f(x) = 0, which could be a local minimum, global minimum, or saddle point. In high-dimensional spaces, saddle points are exponentially more prevalent than local minima, and saddle points with negative curvature are often the primary obstacle to convergence.


Line Search Methods

DfExact Line Search

After computing the search direction dk=βˆ’βˆ‡f(xk)d_k = -\nabla f(x_k), exact line search finds the step size that minimizes ff along that direction:

αk=arg⁑min⁑α>0f(xk+αdk)\alpha_k = \arg\min_{\alpha > 0} f(x_k + \alpha d_k)

This is optimal but expensive β€” it requires evaluating ff at many points along the ray. In practice, exact line search is rarely used because each evaluation costs O(n)O(n) and the minimization itself is an inner optimization problem.

DfBacktracking Line Search

Backtracking is a practical alternative. Starting with an initial step size (e.g., α=1\alpha = 1), it repeatedly shrinks α\alpha by a factor β∈(0,1)\beta \in (0, 1) until the Armijo condition is satisfied:

f(xk+Ξ±dk)≀f(xk)+cΞ±βˆ‡f(xk)Tdkf(x_k + \alpha d_k) \leq f(x_k) + c \alpha \nabla f(x_k)^T d_k

where c∈(0,1)c \in (0, 1) is a constant (typically c=10βˆ’4c = 10^{-4}). The parameter Ξ²\beta (typically 0.50.5) controls how aggressively we shrink. This guarantees sufficient decrease at each step while avoiding the cost of exact minimization.


Momentum

Momentum Update Rule

vk=Ξ²vkβˆ’1+βˆ‡f(xk)xk+1=xkβˆ’Ξ±vk\begin{aligned} v_k &= \beta v_{k-1} + \nabla f(x_k) \\ x_{k+1} &= x_k - \alpha v_k \end{aligned}

Here,

  • vkv_k=Velocity (accumulated gradient) at iteration k
  • Ξ²\beta=Momentum coefficient (typically 0.9)
  • Ξ±\alpha=Learning rate

Physics Analogy

Imagine a ball rolling downhill on a frictionless surface. It accumulates velocity as it rolls, gaining speed in directions of consistent descent and slowing down through oscillations. The momentum term Ξ²vkβˆ’1\beta v_{k-1} retains a fraction of the previous velocity, so the ball speeds up in directions where the gradient consistently points downhill and decelerates where the gradient flips sign.

Why momentum helps: In narrow valleys, vanilla gradient descent oscillates wildly because the gradient alternates direction. Momentum averages these oscillations, smoothing the trajectory. On flat plateaus, momentum accumulates velocity and accelerates through regions where vanilla GD stalls. The effective learning rate is amplified by 11βˆ’Ξ²\frac{1}{1-\beta} in consistent directions β€” with Ξ²=0.9\beta = 0.9, this is a 10x speedup along persistent gradient directions.


Nesterov Accelerated Gradient

Nesterov Momentum

vk=Ξ²vkβˆ’1+βˆ‡f(xkβˆ’Ξ±Ξ²vkβˆ’1)xk+1=xkβˆ’Ξ±vk\begin{aligned} v_k &= \beta v_{k-1} + \nabla f(x_k - \alpha \beta v_{k-1}) \\ x_{k+1} &= x_k - \alpha v_k \end{aligned}

Here,

  • βˆ‡f(xkβˆ’Ξ±Ξ²vkβˆ’1)\nabla f(x_k - \alpha \beta v_{k-1})=Gradient evaluated at the 'look-ahead' position

Nesterov momentum evaluates the gradient not at the current position but at the position we would arrive at if we continued with the current velocity. This "look-ahead" correction provides better convergence guarantees: O(1/k2)O(1/k^2) for convex functions compared to O(1/k)O(1/k) for standard gradient descent. It is the theoretical foundation behind many practical accelerations.


Learning Rate Schedules

Step Decay Schedule

Ξ±k=Ξ±0β‹…Ξ³βŒŠk/sβŒ‹\alpha_k = \alpha_0 \cdot \gamma^{\lfloor k / s \rfloor}

Here,

  • Ξ±0\alpha_0=Initial learning rate
  • Ξ³\gamma=Decay factor (typically 0.1 or 0.5)
  • ss=Step size β€” decay every s iterations
  • ⌊k/sβŒ‹\lfloor k/s \rfloor=Floor division (integer part)

Cosine Annealing Schedule

Ξ±k=Ξ±min⁑+12(Ξ±maxβ‘βˆ’Ξ±min⁑)(1+cos⁑(Ο€kT))\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{\pi k}{T}\right)\right)

Here,

  • Ξ±min⁑\alpha_{\min}=Minimum learning rate (floor)
  • Ξ±max⁑\alpha_{\max}=Maximum learning rate (typically = \alpha_0)
  • TT=Total number of iterations for one cycle
  • kk=Current iteration number

Cosine Annealing with Warm Restarts

Ξ±k=Ξ±min⁑+12(Ξ±maxβ‘βˆ’Ξ±min⁑)(1+cos⁑(π (kβ€Šmodβ€ŠT)T))\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{\pi \, (k \bmod T)}{T}\right)\right)

Here,

  • kβ€Šmodβ€ŠTk \bmod T=Iteration within the current cycle (resets at each restart)

Warm restarts periodically reset the learning rate to a high value. The high learning rate helps escape sharp minima and saddle points, while the annealing within each cycle allows fine-grained convergence to flat, generalizable minima.

ScheduleFormulaKey Benefit
ConstantΞ±k=Ξ±0\alpha_k = \alpha_0Simple baseline
Step decayΞ±k=Ξ±0γ⌊k/sβŒ‹\alpha_k = \alpha_0 \gamma^{\lfloor k/s \rfloor}Adapts to progress
Exponential decayΞ±k=Ξ±0eβˆ’Ξ»k\alpha_k = \alpha_0 e^{-\lambda k}Smooth, monotonic decrease
Cosine annealingΞ±k=Ξ±min⁑+12(Ξ±maxβ‘βˆ’Ξ±min⁑)(1+cos⁑(Ο€k/T))\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})(1 + \cos(\pi k / T))Smooth, periodic
Warm restartPeriodic cosine resetEscapes local minima
Linear warmupΞ±k=Ξ±0β‹…k/K\alpha_k = \alpha_0 \cdot k / K for k<Kk < KStabilizes early training

Python Implementation

import numpy as np
import matplotlib.pyplot as plt


def gradient_descent(f, grad_f, x0, lr=0.01, n_iters=1000, tol=1e-6):
    """Vanilla gradient descent with convergence check."""
    x = x0.copy()
    history = [x.copy()]
    losses = [f(x)]

    for i in range(n_iters):
        grad = grad_f(x)
        x = x - lr * grad
        history.append(x.copy())
        losses.append(f(x))

        if np.linalg.norm(grad) < tol:
            break

    return x, np.array(history), losses


def momentum_gd(f, grad_f, x0, lr=0.01, beta=0.9, n_iters=1000, tol=1e-6):
    """Gradient descent with momentum."""
    x = x0.copy()
    v = np.zeros_like(x)
    history = [x.copy()]
    losses = [f(x)]

    for i in range(n_iters):
        grad = grad_f(x)
        v = beta * v + grad
        x = x - lr * v
        history.append(x.copy())
        losses.append(f(x))

        if np.linalg.norm(grad) < tol:
            break

    return x, np.array(history), losses


def nesterov_gd(f, grad_f, x0, lr=0.01, beta=0.9, n_iters=1000, tol=1e-6):
    """Gradient descent with Nesterov momentum."""
    x = x0.copy()
    v = np.zeros_like(x)
    history = [x.copy()]
    losses = [f(x)]

    for i in range(n_iters):
        grad = grad_f(x + lr * beta * (-v))
        v = beta * v + grad
        x = x - lr * v
        history.append(x.copy())
        losses.append(f(x))

        if np.linalg.norm(grad) < tol:
            break

    return x, np.array(history), losses


def cosine_annealing_lr(epoch, lr_min, lr_max, T):
    """Cosine annealing learning rate schedule."""
    return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * epoch / T))


def step_decay_lr(epoch, lr_0, gamma, step_size):
    """Step decay learning rate schedule."""
    return lr_0 * (gamma ** (epoch // step_size))


# --- Example usage ---
f = lambda x: (x[0] - 2) ** 2 + (x[1] - 3) ** 2
grad_f = lambda x: np.array([2 * (x[0] - 2), 2 * (x[1] - 3)])

x0 = np.array([0.0, 0.0])

x_vanilla, hist_vanilla, losses_vanilla = gradient_descent(
    f, grad_f, x0, lr=0.1, n_iters=100
)
x_mom, hist_mom, losses_mom = momentum_gd(
    f, grad_f, x0, lr=0.1, beta=0.9, n_iters=100
)
x_nest, hist_nest, losses_nest = nesterov_gd(
    f, grad_f, x0, lr=0.1, beta=0.9, n_iters=100
)

print(f"Vanilla GD optimum: {x_vanilla}, iterations: {len(losses_vanilla)}")
print(f"Momentum GD optimum: {x_mom}, iterations: {len(losses_mom)}")
print(f"Nesterov GD optimum: {x_nest}, iterations: {len(losses_nest)}")

# Learning rate schedule visualization
epochs = range(200)
lr_cosine = [cosine_annealing_lr(e, 1e-4, 1e-1, 200) for e in epochs]
lr_step = [step_decay_lr(e, 1e-1, 0.5, 50) for e in epochs]

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(epochs, lr_cosine, label="Cosine Annealing")
plt.plot(epochs, lr_step, label="Step Decay")
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.legend()
plt.title("Learning Rate Schedules")

plt.subplot(1, 2, 2)
plt.plot(losses_vanilla, label="Vanilla GD")
plt.plot(losses_mom, label="Momentum GD")
plt.plot(losses_nest, label="Nesterov GD")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.legend()
plt.title("Convergence Comparison")
plt.tight_layout()
plt.savefig("gd_comparison.png", dpi=150)
plt.show()

Applications in AI/ML

Neural Network Training

Backpropagation and Gradient Descent

Training a neural network is a gradient descent problem over millions (or billions) of parameters. The loss L(ΞΈ)\mathcal{L}(\theta) measures the discrepancy between predictions and labels. Backpropagation computes βˆ‡ΞΈL\nabla_\theta \mathcal{L} efficiently via the chain rule. Each training step updates all weights in the direction that reduces the loss: ΞΈt+1=ΞΈtβˆ’Ξ±βˆ‡ΞΈL(ΞΈt)\theta_{t+1} = \theta_t - \alpha \nabla_\theta \mathcal{L}(\theta_t).

In practice, deep learning uses mini-batch SGD with momentum or Adam because:

  1. Full-batch gradients are too expensive to compute for large datasets.
  2. The noise in mini-batch gradients acts as implicit regularization.
  3. Momentum accelerates convergence through consistent gradient directions.
  4. Adam's per-parameter adaptive learning rates handle sparse features and varying curvatures.

Convex Optimization Connection

For convex problems like linear regression and SVMs, gradient descent has strong theoretical guarantees. The objective function has no local minima that are not global minima, and the convergence rate depends on the condition number ΞΊ=L/ΞΌ\kappa = L/\mu. Gradient descent with exact line search achieves linear convergence for strongly convex functions.

Regularized Optimization

Regularized Loss Minimization

minβ‘ΞΈβ€…β€Š1nβˆ‘i=1nL(fΞΈ(xi),yi)+Ξ»R(ΞΈ)\min_\theta \; \frac{1}{n}\sum_{i=1}^{n} \mathcal{L}(f_\theta(x_i), y_i) + \lambda R(\theta)

Here,

  • L\mathcal{L}=Per-sample loss function (e.g., cross-entropy, MSE)
  • R(ΞΈ)R(\theta)=Regularization term (L1, L2, or other penalty)
  • Ξ»\lambda=Regularization strength (trade-off parameter)

Gradient descent naturally handles regularized objectives. The gradient becomes βˆ‡ΞΈ[1nβˆ‘Li+Ξ»R(ΞΈ)]=1nβˆ‘βˆ‡ΞΈLi+Ξ»βˆ‡ΞΈR(ΞΈ)\nabla_\theta \left[\frac{1}{n}\sum \mathcal{L}_i + \lambda R(\theta)\right] = \frac{1}{n}\sum \nabla_\theta \mathcal{L}_i + \lambda \nabla_\theta R(\theta). For L2 regularization (R(ΞΈ)=βˆ₯ΞΈβˆ₯2R(\theta) = \|\theta\|^2), the update becomes xk+1=(1βˆ’2Ξ±Ξ»)xkβˆ’Ξ±βˆ‡f(xk)x_{k+1} = (1 - 2\alpha\lambda) x_k - \alpha \nabla f(x_k), which shrinks weights toward zero at every step β€” a phenomenon called weight decay.

Online Learning

Online Gradient Descent

In online learning, data arrives one sample (or one mini-batch) at a time, and we must update the model before seeing the next sample. Online gradient descent processes each sample immediately: xt+1=xtβˆ’Ξ±tβˆ‡ft(xt)x_{t+1} = x_t - \alpha_t \nabla f_t(x_t), where ftf_t is the loss on the tt-th sample. The regret β€” cumulative loss compared to the best fixed decision in hindsight β€” grows as O(T)O(\sqrt{T}) with appropriately decaying step sizes Ξ±t=O(1/t)\alpha_t = O(1/\sqrt{t}). This sublinear regret means the algorithm "almost" converges to the best decision, making it a cornerstone of online learning theory.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Using a constant high learning rateCauses divergence and oscillation around the minimumUse a learning rate schedule (cosine, step decay)
Ignoring gradient magnitudeA gradient norm of 0.001 and 100 require very different step sizesUse adaptive methods (Adam) or gradient clipping
Not shuffling training dataCreates systematic bias in mini-batch gradientsShuffle the dataset at each epoch
Setting batch size too smallExcessive gradient noise prevents convergenceUse batch sizes of 32–512 for stable training
Training without momentumSlow convergence in ill-conditioned loss landscapesAdd momentum (Ξ²=0.9\beta = 0.9) as a default
Using vanilla GD for deep learningFull-batch computation is prohibitively expensiveUse mini-batch SGD or Adam
Skipping learning rate warmupLarge initial gradients can destabilize early trainingWarm up learning rate over the first few epochs
Not monitoring loss curvesMissing divergence, plateaus, or overfittingPlot training and validation loss every epoch
Assuming convergence to global minimumNon-convex problems may have many local minima and saddle pointsUse multiple restarts or annealing to escape poor minima
Confusing gradient descent with Newton's methodGD uses first-order information only; Newton uses second-order curvatureUse Newton/BFGS for small problems, GD for large-scale

Interview Questions

Q1: Why is the learning rate the most important hyperparameter in gradient descent?

Answer

The learning rate controls the step size at each iteration. If it's too large, the iterates overshoot and diverge. If it's too small, convergence is impractically slow. Unlike other hyperparameters that mainly affect model capacity or regularization, the learning rate directly determines whether the optimization converges at all. For a quadratic f(x)=12ax2f(x) = \frac{1}{2}ax^2, the iteration becomes xk+1=xk(1βˆ’Ξ±a)x_{k+1} = x_k(1 - \alpha a), which converges only if ∣1βˆ’Ξ±a∣<1|1 - \alpha a| < 1, i.e., Ξ±<2/a\alpha < 2/a. This critical dependence makes learning rate tuning essential.

Q2: What is the difference between batch gradient descent, SGD, and mini-batch GD?

Answer

Batch GD computes the exact gradient over all nn samples per update β€” stable but O(n)O(n) per step. SGD uses a single random sample β€” fast but noisy. Mini-batch GD uses bb samples β€” the practical compromise. The noise in SGD/mini-batch is not a bug but a feature: it provides implicit regularization, helps escape saddle points, and enables training on data streams. Modern deep learning universally uses mini-batch GD with b∈[32,512]b \in [32, 512].

Q3: How does momentum improve gradient descent?

Answer

Momentum accumulates past gradients: vk=Ξ²vkβˆ’1+βˆ‡f(xk)v_k = \beta v_{k-1} + \nabla f(x_k), then updates xk+1=xkβˆ’Ξ±vkx_{k+1} = x_k - \alpha v_k. This has two key benefits: (1) It averages out oscillations in narrow valleys, smoothing the trajectory toward the minimum. (2) It accelerates through flat regions where the gradient is small but consistent, as velocity accumulates over iterations. The effective learning rate in consistent directions is amplified by 11βˆ’Ξ²β‰ˆ10Γ—\frac{1}{1-\beta} \approx 10\times when Ξ²=0.9\beta = 0.9. Nesterov momentum improves this further by evaluating the gradient at a look-ahead position, achieving O(1/k2)O(1/k^2) convergence for convex functions.

Q4: Explain the convergence guarantees of gradient descent for convex functions.

Answer

For LL-Lipschitz smooth functions, GD with Ξ±=1/L\alpha = 1/L achieves O(1/k)O(1/k) convergence: f(xk)βˆ’f(xβˆ—)≀Lβˆ₯x0βˆ’xβˆ—βˆ₯22kf(x_k) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2k}. For ΞΌ\mu-strongly convex and LL-smooth functions, GD with Ξ±=2/(L+ΞΌ)\alpha = 2/(L+\mu) achieves linear convergence at rate (ΞΊβˆ’1ΞΊ+1)2\left(\frac{\kappa-1}{\kappa+1}\right)^2 per iteration, where ΞΊ=L/ΞΌ\kappa = L/\mu is the condition number. For non-convex functions, GD converges to a stationary point (βˆ₯βˆ‡fβˆ₯<Ο΅\|\nabla f\| < \epsilon) in O(1/Ο΅2)O(1/\epsilon^2) iterations, but there is no guarantee of reaching the global minimum.

Q5: When would you choose SGD over Adam?

Answer

SGD with momentum often achieves better generalization than Adam in the final stages of training. Adam's adaptive per-parameter learning rates can converge faster initially but may generalize worse because they escape sharp minima less effectively. The common practice is: use Adam for initial rapid convergence, then switch to SGD with momentum (and a decaying learning rate) for the final training phase. This "warm start with Adam, finish with SGD" strategy combines the speed of adaptive methods with the generalization of SGD.


Second-Order Effects: Condition Number

DfCondition Number

The condition number ΞΊ=L/ΞΌ\kappa = L/\mu measures the ratio of the largest to smallest curvature of the function. For f(x)=12xTAxf(x) = \frac{1}{2}x^T A x with eigenvalues Ξ»max⁑\lambda_{\max} and Ξ»min⁑\lambda_{\min}, ΞΊ=Ξ»max⁑/Ξ»min⁑\kappa = \lambda_{\max}/\lambda_{\min}. A high condition number means the function has very different curvatures in different directions β€” creating narrow, elongated valleys where gradient descent oscillates.

Ill-conditioned Problems

When κ≫1\kappa \gg 1, gradient descent requires O(ΞΊlog⁑(1/Ο΅))O(\kappa \log(1/\epsilon)) iterations to reach accuracy Ο΅\epsilon. For example, if ΞΊ=1000\kappa = 1000, you need roughly 1000 times more iterations than if ΞΊ=1\kappa = 1. Preconditioning (transforming coordinates to reduce ΞΊ\kappa) or using second-order methods is essential for such problems.


Practice Problems

Problem 1: Manual Gradient Descent Step

One Step of Gradient Descent

Problem: Given f(x,y)=(xβˆ’3)2+2(y+1)2f(x, y) = (x - 3)^2 + 2(y + 1)^2, starting point (x0,y0)=(0,0)(x_0, y_0) = (0, 0), and learning rate Ξ±=0.1\alpha = 0.1, perform one gradient descent update.

Solution

Gradient: βˆ‡f=(2(xβˆ’3),β€…β€Š4(y+1))\nabla f = \left(2(x-3), \; 4(y+1)\right)

At (0,0)(0, 0): βˆ‡f(0,0)=(2(0βˆ’3),β€…β€Š4(0+1))=(βˆ’6,β€…β€Š4)\nabla f(0, 0) = (2(0-3), \; 4(0+1)) = (-6, \; 4)

Update: x1=x0βˆ’Ξ±β‹…(βˆ’6)=0+0.6=0.6x_1 = x_0 - \alpha \cdot (-6) = 0 + 0.6 = 0.6

y1=y0βˆ’Ξ±β‹…4=0βˆ’0.4=βˆ’0.4y_1 = y_0 - \alpha \cdot 4 = 0 - 0.4 = -0.4

New point: (0.6,βˆ’0.4)(0.6, -0.4)

Check: f(0,0)=9+2=11f(0, 0) = 9 + 2 = 11, f(0.6,βˆ’0.4)=(0.6βˆ’3)2+2(βˆ’0.4+1)2=5.76+0.72=6.48<11f(0.6, -0.4) = (0.6-3)^2 + 2(-0.4+1)^2 = 5.76 + 0.72 = 6.48 < 11. The function decreased, confirming the step was in the right direction.


Problem 2: Learning Rate Divergence

Divergent Learning Rate

Problem: For f(x)=x2f(x) = x^2 with x0=1x_0 = 1 and gradient descent xk+1=xkβˆ’Ξ±β‹…2xkx_{k+1} = x_k - \alpha \cdot 2x_k, find the range of Ξ±\alpha for convergence.

Solution

The update is xk+1=xk(1βˆ’2Ξ±)x_{k+1} = x_k(1 - 2\alpha). This is a geometric sequence: xk=(1βˆ’2Ξ±)kβ‹…x0x_k = (1 - 2\alpha)^k \cdot x_0.

Convergence condition: ∣1βˆ’2α∣<1β€…β€ŠβŸΉβ€…β€Šβˆ’1<1βˆ’2Ξ±<1β€…β€ŠβŸΉβ€…β€Š0<Ξ±<1|1 - 2\alpha| < 1 \implies -1 < 1 - 2\alpha < 1 \implies 0 < \alpha < 1.

  • If Ξ±=0.5\alpha = 0.5: xk=0x_k = 0 for all kβ‰₯1k \geq 1 (optimal, converges in one step).
  • If Ξ±=0.3\alpha = 0.3: xk=(0.4)kβ†’0x_k = (0.4)^k \to 0 (converges slowly).
  • If Ξ±=1.0\alpha = 1.0: xk=(βˆ’1)kx_k = (-1)^k (oscillates, does not converge).
  • If Ξ±=1.5\alpha = 1.5: xk=(βˆ’2)kx_k = (-2)^k (diverges).

Optimal: Ξ±=0.5\alpha = 0.5 gives fastest convergence for this quadratic.


Problem 3: Momentum vs Vanilla GD

Momentum Acceleration

Problem: Consider a function with narrow valley along y=0y = 0 where f(x,y)=x2+100y2f(x, y) = x^2 + 100y^2. Explain why momentum helps and compute the effective step size after 5 iterations with Ξ²=0.9\beta = 0.9 and constant gradient (2,0)(2, 0).

Solution

The function has L=200L = 200 (from the yy-direction) and ΞΌ=2\mu = 2 (from the xx-direction), so ΞΊ=100\kappa = 100. Vanilla GD oscillates wildly in the yy-direction due to the large curvature.

Momentum with constant gradient g=(2,0)g = (2, 0): v0=0v_0 = 0, v1=Ξ²β‹…0+g=gv_1 = \beta \cdot 0 + g = g

v2=Ξ²v1+g=(1+Ξ²)gv_2 = \beta v_1 + g = (1 + \beta)gvk=(1+Ξ²+Ξ²2+β‹―+Ξ²kβˆ’1)g=1βˆ’Ξ²k1βˆ’Ξ²gv_k = (1 + \beta + \beta^2 + \cdots + \beta^{k-1})g = \frac{1 - \beta^k}{1 - \beta} g

After k=5k = 5 iterations: v5=1βˆ’0.951βˆ’0.9β‹…(2,0)=1βˆ’0.590490.1β‹…(2,0)=4.0951β‹…(2,0)=(8.19,0)v_5 = \frac{1 - 0.9^5}{1 - 0.9} \cdot (2, 0) = \frac{1 - 0.59049}{0.1} \cdot (2, 0) = 4.0951 \cdot (2, 0) = (8.19, 0)

The effective step in the xx-direction is Ξ±β‹…8.19\alpha \cdot 8.19 β€” about 4Γ—4\times the gradient magnitude. As kβ†’βˆžk \to \infty, vkβ†’11βˆ’Ξ²g=10β‹…gv_k \to \frac{1}{1-\beta} g = 10 \cdot g, amplifying the consistent direction by 10Γ—10\times.


Problem 4: Cosine Annealing Schedule

Learning Rate at Specific Epochs

Problem: For cosine annealing with Ξ±min⁑=10βˆ’5\alpha_{\min} = 10^{-5}, Ξ±max⁑=0.1\alpha_{\max} = 0.1, and T=100T = 100 epochs, compute the learning rate at epochs k=0,25,50,75,100k = 0, 25, 50, 75, 100.

Solution

Ξ±k=10βˆ’5+0.1βˆ’10βˆ’52(1+cos⁑(Ο€k100))\alpha_k = 10^{-5} + \frac{0.1 - 10^{-5}}{2}\left(1 + \cos\left(\frac{\pi k}{100}\right)\right)
  • k=0k = 0: Ξ±0=10βˆ’5+0.05β‹…(1+cos⁑0)=10βˆ’5+0.05β‹…2=0.10001β‰ˆ0.1\alpha_0 = 10^{-5} + 0.05 \cdot (1 + \cos 0) = 10^{-5} + 0.05 \cdot 2 = 0.10001 \approx 0.1
  • k=25k = 25: Ξ±25=10βˆ’5+0.05β‹…(1+cos⁑(Ο€/4))=10βˆ’5+0.05β‹…1.7071=0.08536\alpha_{25} = 10^{-5} + 0.05 \cdot (1 + \cos(\pi/4)) = 10^{-5} + 0.05 \cdot 1.7071 = 0.08536
  • k=50k = 50: Ξ±50=10βˆ’5+0.05β‹…(1+cos⁑(Ο€/2))=10βˆ’5+0.05β‹…1=0.05001\alpha_{50} = 10^{-5} + 0.05 \cdot (1 + \cos(\pi/2)) = 10^{-5} + 0.05 \cdot 1 = 0.05001
  • k=75k = 75: Ξ±75=10βˆ’5+0.05β‹…(1+cos⁑(3Ο€/4))=10βˆ’5+0.05β‹…0.2929=0.01466\alpha_{75} = 10^{-5} + 0.05 \cdot (1 + \cos(3\pi/4)) = 10^{-5} + 0.05 \cdot 0.2929 = 0.01466
  • k=100k = 100: Ξ±100=10βˆ’5+0.05β‹…(1+cos⁑π)=10βˆ’5+0.05β‹…0=10βˆ’5\alpha_{100} = 10^{-5} + 0.05 \cdot (1 + \cos \pi) = 10^{-5} + 0.05 \cdot 0 = 10^{-5}

The schedule starts near maximum, decays smoothly through the middle, and ends at the minimum β€” exactly the behavior needed for training neural networks.


Quick Reference

Key Takeaways

  • Update Rule: xk+1=xkβˆ’Ξ±kβˆ‡f(xk)x_{k+1} = x_k - \alpha_k \nabla f(x_k) β€” move opposite the gradient by step size Ξ±k\alpha_k.
  • Learning Rate: The most critical hyperparameter. Too large diverges, too small is slow. Use schedules (cosine, step decay) instead of constant values.
  • Batch vs. SGD vs. Mini-batch: Batch GD uses all data (stable but expensive), SGD uses one sample (fast but noisy), mini-batch GD (b=32b = 32–512512) is the practical default for deep learning.
  • Convergence Rate: O(1/k)O(1/k) for convex, linear convergence (ΞΊβˆ’1ΞΊ+1)2k\left(\frac{\kappa-1}{\kappa+1}\right)^{2k} for strongly convex, O(1/Ο΅2)O(1/\epsilon^2) to reach βˆ₯βˆ‡fβˆ₯<Ο΅\|\nabla f\| < \epsilon for non-convex.
  • Line Search: Backtracking with Armijo condition f(x+Ξ±d)≀f(x)+cΞ±βˆ‡fTdf(x + \alpha d) \leq f(x) + c\alpha \nabla f^T d provides adaptive step sizes without exact minimization.
  • Momentum: vk=Ξ²vkβˆ’1+βˆ‡f(xk)v_k = \beta v_{k-1} + \nabla f(x_k) with Ξ²β‰ˆ0.9\beta \approx 0.9 averages oscillations and accelerates through consistent gradient directions by 11βˆ’Ξ²\frac{1}{1-\beta}.
  • Nesterov Momentum: Evaluate gradient at the look-ahead position for O(1/k2)O(1/k^2) convergence β€” the theoretical foundation for accelerated methods.
  • Schedules: Cosine annealing Ξ±k=Ξ±min⁑+12(Ξ±maxβ‘βˆ’Ξ±min⁑)(1+cos⁑(Ο€k/T))\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})(1 + \cos(\pi k/T)) is the most popular schedule in deep learning. Add warm restarts to escape poor minima.
  • Deep Learning Practice: Start with Adam (lr=0.001) for fast convergence; switch to SGD+momentum for final training. Use gradient clipping when gradients explode.
  • Python: Use scipy.optimize.minimize() for small convex problems; implement custom loops with NumPy for full control over training dynamics.

Cross-References

  • Stochastic Gradient Descent: Adam, RMSProp, and adaptive learning rate methods -> SGD
  • Newton's Method: Second-order optimization with Hessian information -> Newton's Method
  • Constrained Optimization: KKT conditions, penalty methods, and interior-point methods -> Constrained Optimization
  • Convex Optimization: Global optimality guarantees and convex function properties -> Convex Optimization
  • Hyperparameter Optimization: Learning rate tuning via grid search, Bayesian optimization -> Hyperparameter Optimization
  • Calculus Optimization: First and second derivative tests, unconstrained optimization -> Calculus Optimization
  • Partial Derivatives: Gradients and Jacobians needed for computing βˆ‡f\nabla f -> Partial Derivatives
  • Linear Algebra Norms: Vector norms used in convergence analysis -> Linear Algebra Norms
⭐

Premium Content

Gradient Descent

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Mathematics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement