πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Multivariable Calculus

CalculusAdvanced🟒 Free Lesson

Advertisement

Multivariable Calculus

Why It Matters

Multivariable calculus is the mathematical foundation of modern machine learning and artificial intelligence. Every neural network training step involves computing gradients in high-dimensional spaces, where the loss function depends on millions or billions of parameters. Understanding partial derivatives, the gradient vector, directional derivatives, and second-order information (Hessian matrix) enables you to diagnose training issues, design better optimization algorithms, and understand why certain methods converge while others fail. In generative models like VAEs and GANs, change of variables through the Jacobian determinant is essential for computing probability densities. Double and multiple integrals appear in Bayesian inference, expectation calculations, and probabilistic models. Without multivariable calculus, advanced AI research and implementation would be impossible.


Multivariable Functions

DfMultivariable Function

A multivariable function is a mapping f:Rnβ†’Rmf: \mathbb{R}^n \to \mathbb{R}^m that takes a vector of nn inputs and produces a vector of mm outputs. When m=1m = 1, it is a scalar-valued function; when m>1m > 1, it is a vector-valued function. Common notation includes f(x1,x2,…,xn)f(x_1, x_2, \ldots, x_n) for scalar functions and fβƒ—(xβƒ—)\vec{f}(\vec{x}) for vector functions.

Scalar Function

f:Rnβ†’R,xβƒ—=(x1,x2,…,xn)↦f(xβƒ—)f: \mathbb{R}^n \to \mathbb{R}, \quad \vec{x} = (x_1, x_2, \ldots, x_n) \mapsto f(\vec{x})

Here,

  • nn=Number of input variables (dimension)
  • xβƒ—\vec{x}=Input vector in n-dimensional space
  • f(xβƒ—)f(\vec{x})=Scalar output value

Vector Function

f⃗:RntoRm,quadx⃗mapsto(f1(x⃗),f2(x⃗),ldots,fm(x⃗))\vec{f}: \mathbb{R}^n \\to \mathbb{R}^m, \\quad \vec{x} \\mapsto (f_1(\vec{x}), f_2(\vec{x}), \\ldots, f_m(\vec{x}))

Here,

  • mm=Number of output components
  • fif_i=The i-th component function

Example: A neural network layer with 256 inputs and 64 outputs is a vector function f⃗:R256→R64\vec{f}: \mathbb{R}^{256} \to \mathbb{R}^{64}.


Partial Derivatives Review

Quick Summary

The partial derivative of f(x1,…,xn)f(x_1, \ldots, x_n) with respect to xix_i is the derivative of ff while holding all other variables constant. It measures how ff changes as only xix_i varies.

Partial Derivative

βˆ‚fβˆ‚xi=lim⁑hto0f(x1,ldots,xi+h,ldots,xn)βˆ’f(x1,ldots,xi,ldots,xn)h\frac{\partial f}{\partial x_i} = \lim_{h \\to 0} \frac{f(x_1, \\ldots, x_i + h, \\ldots, x_n) - f(x_1, \\ldots, x_i, \\ldots, x_n)}{h}

Here,

  • βˆ‚fβˆ‚xi\frac{\partial f}{\partial x_i}=Partial derivative with respect to x_i
  • hh=Infinitesimal increment in x_i

Key Rules:

  • Sum Rule: βˆ‚βˆ‚xi(f+g)=βˆ‚fβˆ‚xi+βˆ‚gβˆ‚xi\frac{\partial}{\partial x_i}(f + g) = \frac{\partial f}{\partial x_i} + \frac{\partial g}{\partial x_i}
  • Product Rule: βˆ‚βˆ‚xi(fβ‹…g)=βˆ‚fβˆ‚xig+fβˆ‚gβˆ‚xi\frac{\partial}{\partial x_i}(f \cdot g) = \frac{\partial f}{\partial x_i} g + f \frac{\partial g}{\partial x_i}
  • Chain Rule: βˆ‚βˆ‚xif(g1(xβƒ—),…)=βˆ‘jβˆ‚fβˆ‚gjβˆ‚gjβˆ‚xi\frac{\partial}{\partial x_i} f(g_1(\vec{x}), \ldots) = \sum_j \frac{\partial f}{\partial g_j} \frac{\partial g_j}{\partial x_i}
  • Clairaut's Theorem: βˆ‚2fβˆ‚xiβˆ‚xj=βˆ‚2fβˆ‚xjβˆ‚xi\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i} (for continuous second derivatives)

The Gradient Vector

DfGradient Vector

The gradient of a scalar-valued function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} is the vector of all first-order partial derivatives. It points in the direction of steepest ascent and has magnitude equal to the maximum rate of change of ff.

Gradient

βˆ‡f=(βˆ‚fβˆ‚x1,βˆ‚fβˆ‚x2,ldots,βˆ‚fβˆ‚xn)\nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \\ldots, \frac{\partial f}{\partial x_n} \right)

Here,

  • βˆ‡f\nabla f=Gradient vector (nabla f)
  • βˆ‚fβˆ‚xi\frac{\partial f}{\partial x_i}=Partial derivative along i-th axis

Geometric Interpretation

  • The gradient βˆ‡f(xβƒ—0)\nabla f(\vec{x}_0) is perpendicular (normal) to the level curve f(xβƒ—)=cf(\vec{x}) = c passing through xβƒ—0\vec{x}_0.
  • The gradient points in the direction of steepest increase of ff.
  • The negative gradient βˆ’βˆ‡f-\nabla f points in the direction of steepest decrease.
  • The magnitude βˆ£βˆ‡f∣|\nabla f| equals the maximum rate of change at that point.

Gradient Magnitude as Directional Max

βˆ£βˆ‡f∣=max⁑βˆ₯uβƒ—βˆ₯=1Duβƒ—f=max⁑βˆ₯uβƒ—βˆ₯=1βˆ‡fβ‹…uβƒ—|\nabla f| = \max_{\|\vec{u}\| = 1} D_{\vec{u}} f = \max_{\|\vec{u}\| = 1} \nabla f \cdot \vec{u}

Here,

  • uβƒ—\vec{u}=Unit direction vector
  • Duβƒ—fD_{\vec{u}} f=Directional derivative in direction u

Example: For f(x,y)=x2y+sin⁑(y)f(x, y) = x^2 y + \sin(y):

βˆ‡f=(2xy,β€…β€Šx2+cos⁑(y))\nabla f = \left( 2xy, \; x^2 + \cos(y) \right)

At point (1,Ο€)(1, \pi): βˆ‡f(1,Ο€)=(2Ο€,β€…β€Š1βˆ’1)=(2Ο€,0)\nabla f(1, \pi) = (2\pi, \; 1 - 1) = (2\pi, 0)


Directional Derivative

DfDirectional Derivative

The directional derivative of ff at point x⃗0\vec{x}_0 in the direction of unit vector u⃗\vec{u} measures the rate of change of ff as we move from x⃗0\vec{x}_0 in direction u⃗\vec{u}.

Directional Derivative

Duβƒ—f(xβƒ—0)=lim⁑hto0f(xβƒ—0+huβƒ—)βˆ’f(xβƒ—0)h=βˆ‡f(xβƒ—0)β‹…uβƒ—D_{\vec{u}} f(\vec{x}_0) = \lim_{h \\to 0} \frac{f(\vec{x}_0 + h\vec{u}) - f(\vec{x}_0)}{h} = \nabla f(\vec{x}_0) \cdot \vec{u}

Here,

  • Duβƒ—fD_{\vec{u}} f=Directional derivative in direction u
  • uβƒ—\vec{u}=Unit vector (||u|| = 1)
  • βˆ‡fβ‹…uβƒ—\nabla f \cdot \vec{u}=Dot product with gradient

ThCauchy-Schwarz Bound

The directional derivative is bounded by:

∣Duβƒ—fβˆ£β‰€βˆ£βˆ‡fβˆ£β‹…βˆ£uβƒ—βˆ£=βˆ£βˆ‡f∣|D_{\vec{u}} f| \leq |\nabla f| \cdot |\vec{u}| = |\nabla f|

Equality holds when uβƒ—\vec{u} is parallel to βˆ‡f\nabla f (maximum ascent) or anti-parallel (maximum descent).

Important

The direction vector uβƒ—\vec{u} must be a unit vector for the formula Duβƒ—f=βˆ‡fβ‹…uβƒ—D_{\vec{u}} f = \nabla f \cdot \vec{u} to give the rate of change per unit distance. If vβƒ—\vec{v} is not normalized, use uβƒ—=vβƒ—/∣vβƒ—βˆ£\vec{u} = \vec{v}/|\vec{v}|.

Example: For f(x,y)=x2+y2f(x,y) = x^2 + y^2 at (1,1)(1,1), direction u⃗=(1/2,1/2)\vec{u} = (1/\sqrt{2}, 1/\sqrt{2}):

βˆ‡f(1,1)=(2,2)\nabla f(1,1) = (2, 2), so Duβƒ—f=2β‹…12+2β‹…12=22D_{\vec{u}} f = 2 \cdot \frac{1}{\sqrt{2}} + 2 \cdot \frac{1}{\sqrt{2}} = 2\sqrt{2}


Hessian Matrix

DfHessian Matrix

The Hessian matrix of a scalar-valued function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} is the n×nn \times n matrix of all second-order partial derivatives. It captures the local curvature (convexity/concavity) of ff and is essential for classifying critical points and designing second-order optimization methods.

Hessian

Hij=βˆ‚2fβˆ‚xiβˆ‚xj=βˆ‚βˆ‚xi(βˆ‚fβˆ‚xj)H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial}{\partial x_i} \left( \frac{\partial f}{\partial x_j} \right)

Here,

  • HijH_{ij}=The (i,j) entry of the Hessian
  • ff=The scalar-valued function
  • nn=Dimension of the input space

Properties

  • The Hessian is always symmetric (by Clairaut's theorem): Hij=HjiH_{ij} = H_{ji} for all i,ji, j.
  • It has n(n+1)/2n(n+1)/2 unique entries (not n2n^2).
  • At a critical point (βˆ‡f=0\nabla f = 0), the Hessian determines whether the point is a minimum, maximum, or saddle point.
  • The trace of the Hessian equals the Laplacian: tr(H)=βˆ‡2f=βˆ‘iβˆ‚2fβˆ‚xi2\text{tr}(H) = \nabla^2 f = \sum_i \frac{\partial^2 f}{\partial x_i^2}.

Example: For f(x,y)=x4+y4βˆ’4xy+1f(x, y) = x^4 + y^4 - 4xy + 1:

βˆ‡f=(4x3βˆ’4y,β€…β€Š4y3βˆ’4x)\nabla f = (4x^3 - 4y, \; 4y^3 - 4x) H=[12x2βˆ’4βˆ’412y2]H = \begin{bmatrix} 12x^2 & -4 \\ -4 & 12y^2 \end{bmatrix}

At critical point (1,1)(1,1): H=[12βˆ’4βˆ’412]H = \begin{bmatrix} 12 & -4 \\ -4 & 12 \end{bmatrix}


Multivariable Taylor Series

Second-Order Taylor Expansion

f(xβƒ—0+hβƒ—)=f(xβƒ—0)+βˆ‡f(xβƒ—0)Thβƒ—+12hβƒ—TH(xβƒ—0)hβƒ—+O(βˆ₯hβƒ—βˆ₯3)f(\vec{x}_0 + \vec{h}) = f(\vec{x}_0) + \nabla f(\vec{x}_0)^T \vec{h} + \frac{1}{2} \vec{h}^T H(\vec{x}_0) \vec{h} + O(\|\vec{h}\|^3)

Here,

  • xβƒ—0\vec{x}_0=Expansion point
  • hβƒ—\vec{h}=Small displacement vector
  • βˆ‡f(xβƒ—0)\nabla f(\vec{x}_0)=Gradient at x_0
  • H(xβƒ—0)H(\vec{x}_0)=Hessian at x_0

Interpretation of Terms

  • Zeroth order: f(xβƒ—0)f(\vec{x}_0) β€” the function value at the expansion point.
  • First order: βˆ‡fThβƒ—\nabla f^T \vec{h} β€” linear approximation using the gradient.
  • Second order: 12hβƒ—THhβƒ—\frac{1}{2} \vec{h}^T H \vec{h} β€” curvature correction using the Hessian.
  • Higher-order terms O(βˆ₯hβƒ—βˆ₯3)O(\|\vec{h}\|^3) become negligible for small hβƒ—\vec{h}.

ThTaylor's Theorem with Remainder

For f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} with continuous partial derivatives up to order k+1k+1:

f(xβƒ—0+hβƒ—)=βˆ‘βˆ£Ξ±βˆ£β‰€kDΞ±f(xβƒ—0)Ξ±!hβƒ—Ξ±+Rk(hβƒ—)f(\vec{x}_0 + \vec{h}) = \sum_{|\alpha| \leq k} \frac{D^\alpha f(\vec{x}_0)}{\alpha!} \vec{h}^\alpha + R_k(\vec{h})

where Ξ±\alpha is a multi-index and RkR_k is the remainder term.

Second-order approximation (scalar case):

f(x0+h)β‰ˆf(x0)+fβ€²(x0)h+12fβ€²β€²(x0)h2f(x_0 + h) \approx f(x_0) + f'(x_0)h + \frac{1}{2}f''(x_0)h^2

Vector form: hβƒ—THhβƒ—=βˆ‘i,jHijhihj\vec{h}^T H \vec{h} = \sum_{i,j} H_{ij} h_i h_j is a quadratic form.


Critical Points Classification

DfCritical Point

A point xβƒ—0\vec{x}_0 is a critical point of ff if βˆ‡f(xβƒ—0)=0βƒ—\nabla f(\vec{x}_0) = \vec{0} (all partial derivatives are zero simultaneously).

ThSecond Derivative Test

At a critical point xβƒ—0\vec{x}_0 where βˆ‡f(xβƒ—0)=0βƒ—\nabla f(\vec{x}_0) = \vec{0}, classify using the Hessian H(xβƒ—0)H(\vec{x}_0):

  • Positive definite (all eigenvalues >0> 0): xβƒ—0\vec{x}_0 is a local minimum.
  • Negative definite (all eigenvalues <0< 0): xβƒ—0\vec{x}_0 is a local maximum.
  • Indefinite (eigenvalues of mixed signs): xβƒ—0\vec{x}_0 is a saddle point.
  • Semi-definite (some eigenvalues =0= 0): Test is inconclusive; need higher-order analysis.

Determinant Criterion (2D)

det(H)=H11H22βˆ’H122\\det(H) = H_{11}H_{22} - H_{12}^2

Here,

  • H11H_{11}=Second partial w.r.t. x
  • H22H_{22}=Second partial w.r.t. y
  • H12H_{12}=Mixed partial

Classification rules for f(x,y)f(x,y):

H11H_{11}det⁑(H)\det(H)Classification
>0> 0>0> 0Local minimum
<0< 0>0> 0Local maximum
β€”<0< 0Saddle point
β€”=0= 0Inconclusive

Example: f(x,y)=x2+y2βˆ’2xβˆ’4y+5f(x,y) = x^2 + y^2 - 2x - 4y + 5

βˆ‡f=(2xβˆ’2,β€…β€Š2yβˆ’4)=(0,0)β€…β€ŠβŸΉβ€…β€Š(x,y)=(1,2)\nabla f = (2x - 2, \; 2y - 4) = (0,0) \implies (x,y) = (1,2)

H=[2002]H = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}, eigenvalues {2,2}>0β€…β€ŠβŸΉβ€…β€Š\{2, 2\} > 0 \implies local minimum


Double Integrals

Double Integral over a Region

iintDf(x,y),dA=∫ab∫g1(x)g2(x)f(x,y),dy,dx\\iint_D f(x,y) \\, dA = \int_a^b \int_{g_1(x)}^{g_2(x)} f(x,y) \\, dy \\, dx

Here,

  • DD=The region of integration in the xy-plane
  • f(x,y)f(x,y)=The integrand function
  • g1(x),g2(x)g_1(x), g_2(x)=Lower and upper bounds for y

Geometric Interpretation

  • If f(x,y)β‰₯0f(x,y) \geq 0, the double integral ∬Df(x,y) dA\iint_D f(x,y) \, dA represents the volume under the surface z=f(x,y)z = f(x,y) and above the region DD.
  • If f(x,y)=1f(x,y) = 1, the integral ∬D1 dA=Area(D)\iint_D 1 \, dA = \text{Area}(D) gives the area of region DD.
  • Probability density: ∬Dp(x,y) dA\iint_D p(x,y) \, dA gives the probability that (X,Y)∈D(X,Y) \in D.

Triple Integral

iiintVf(x,y,z),dV=∫ab∫g1(x)g2(x)∫h1(x,y)h2(x,y)f(x,y,z),dz,dy,dx\\iiint_V f(x,y,z) \\, dV = \int_a^b \int_{g_1(x)}^{g_2(x)} \int_{h_1(x,y)}^{h_2(x,y)} f(x,y,z) \\, dz \\, dy \\, dx

Here,

  • VV=Volume region in 3D space
  • dVdV=Volume element dx dy dz

Polar Coordinates

For circular regions, use polar coordinates: x=rcos⁑θx = r\cos\theta, y=rsin⁑θy = r\sin\theta.

∬Df(x,y) dx dy=∫αβ∫0Rf(rcos⁑θ,rsin⁑θ) r dr dΞΈ\iint_D f(x,y) \, dx \, dy = \int_\alpha^\beta \int_0^R f(r\cos\theta, r\sin\theta) \, r \, dr \, d\theta

The factor rr is the Jacobian determinant of the polar coordinate transformation.


Change of Variables

DfJacobian Determinant

When changing variables in a multiple integral, the Jacobian determinant accounts for the distortion of area/volume caused by the coordinate transformation. For a transformation (x,y)=T(u,v)(x, y) = T(u, v), the area element transforms as dx dy=∣JTβˆ£β€‰du dvdx \, dy = |J_T| \, du \, dv.

Jacobian Determinant

det(JT)=det[βˆ‚xβˆ‚uβˆ‚xβˆ‚vβˆ‚yβˆ‚uβˆ‚yβˆ‚v]\\det(J_T) = \\det \begin{bmatrix} \frac{\partial x}{\partial u} & \frac{\partial x}{\partial v} \\\\ \frac{\partial y}{\partial u} & \frac{\partial y}{\partial v} \end{bmatrix}

Here,

  • JTJ_T=Jacobian matrix of the transformation T
  • ∣det⁑(JT)∣|\det(J_T)|=Absolute value of Jacobian determinant

Change of Variables Formula (2D)

iintT(D)f(x,y),dx,dy=iintDf(T(u,v)),∣det(JT)∣,du,dv\\iint_{T(D)} f(x,y) \\, dx \\, dy = \\iint_D f(T(u,v)) \\, |\\det(J_T)| \\, du \\, dv

Here,

  • TT=Transformation from (u,v) to (x,y)
  • DD=Region in the (u,v) plane
  • T(D)T(D)=Transformed region in the (x,y) plane

Common transformations:

Transformationxxyy∣det⁑(J)∣|\det(J)|
Polarrcos⁑θr\cos\thetarsin⁑θr\sin\thetarr
Cylindricalrcos⁑θr\cos\thetarsin⁑θr\sin\thetarr
Sphericalρsin⁑ϕcos⁑θ\rho\sin\phi\cos\thetaρsin⁑ϕsin⁑θ\rho\sin\phi\sin\thetaρ2sin⁑ϕ\rho^2\sin\phi

ThInverse Function Theorem

If T:Rnβ†’RnT: \mathbb{R}^n \to \mathbb{R}^n is continuously differentiable and det⁑(JT(xβƒ—0))β‰ 0\det(J_T(\vec{x}_0)) \neq 0, then TT is locally invertible near xβƒ—0\vec{x}_0 and the Jacobian of Tβˆ’1T^{-1} is JTβˆ’1(T(xβƒ—0))=JT(xβƒ—0)βˆ’1J_{T^{-1}}(T(\vec{x}_0)) = J_T(\vec{x}_0)^{-1}.


Python Implementation

import numpy as np
import sympy as sp

# ============================================
# 1. Partial Derivatives with SymPy
# ============================================
x, y = sp.symbols('x y')
f = x**2 * y + sp.sin(y)

# Partial derivatives
df_dx = sp.diff(f, x)
df_dy = sp.diff(f, y)
print(f"f = {f}")
print(f"βˆ‚f/βˆ‚x = {df_dx}")      # 2*x*y
print(f"βˆ‚f/βˆ‚y = {df_dy}")      # x**2 + cos(y)

# Gradient
grad_f = [df_dx, df_dy]
print(f"βˆ‡f = {grad_f}")

# ============================================
# 2. Hessian Matrix with SymPy
# ============================================
f_hess = x**4 + y**4 - 4*x*y + 1
H = sp.hessian(f_hess, (x, y))
print(f"\nHessian of {f_hess}:")
sp.pprint(H)

eigenvals = H.eigenvals()
print(f"Eigenvalues at (1,1): {H.subs({x:1, y:1}).eigenvals()}")

# ============================================
# 3. Gradient Descent Implementation
# ============================================
def gradient_descent(grad_f, x0, lr=0.01, max_iter=1000, tol=1e-8):
    """Simple gradient descent optimization."""
    x = np.array(x0, dtype=float)
    history = [x.copy()]
    
    for i in range(max_iter):
        g = np.array(grad_f(x), dtype=float)
        if np.linalg.norm(g) < tol:
            print(f"Converged in {i} iterations")
            break
        x = x - lr * g
        history.append(x.copy())
    
    return x, np.array(history)

# Minimize f(x,y) = x^2 + y^2 - 2x - 4y + 5
f = lambda v: v[0]**2 + v[1]**2 - 2*v[0] - 4*v[1] + 5
grad = lambda v: np.array([2*v[0]-2, 2*v[1]-4])

x_min, hist = gradient_descent(grad, [5.0, 5.0], lr=0.1)
print(f"\nMinimum at: {x_min}")  # Should be [1, 2]

# ============================================
# 4. Numerical Jacobian
# ============================================
def numerical_jacobian(f, x, h=1e-7):
    """Compute Jacobian matrix numerically using central differences."""
    n = len(x)
    m = len(f(x))
    J = np.zeros((m, n))
    for j in range(n):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[j] += h
        x_minus[j] -= h
        J[:, j] = (f(x_plus) - f(x_minus)) / (2 * h)
    return J

# Example: f(x,y) = [x^2 + y, x*y^2]
f_vec = lambda v: np.array([v[0]**2 + v[1], v[0]*v[1]**2])
x_point = np.array([1.0, 2.0])
J = numerical_jacobian(f_vec, x_point)
print(f"\nJacobian at (1,2):\n{J}")
# Expected: [[2, 1], [4, 4]]

# ============================================
# 5. Numerical Hessian
# ============================================
def numerical_hessian(f, x, h=1e-5):
    """Compute Hessian matrix numerically."""
    n = len(x)
    H = np.zeros((n, n))
    f0 = f(x)
    
    for i in range(n):
        for j in range(n):
            x_pp = x.copy()
            x_pm = x.copy()
            x_mp = x.copy()
            x_mm = x.copy()
            
            x_pp[i] += h; x_pp[j] += h
            x_pm[i] += h; x_pm[j] -= h
            x_mp[i] -= h; x_mp[j] += h
            x_mm[i] -= h; x_mm[j] -= h
            
            H[i, j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4 * h**2)
    
    return H

# Test on f(x,y) = x^4 + y^4 - 4xy + 1
f_test = lambda v: v[0]**4 + v[1]**4 - 4*v[0]*v[1] + 1
H_num = numerical_hessian(f_test, np.array([1.0, 1.0]))
print(f"\nNumerical Hessian at (1,1):\n{H_num}")

# ============================================
# 6. Double Integration with SciPy
# ============================================
from scipy import integrate

def integrand(y, x):
    return np.exp(-(x**2 + y**2))

# Integrate over circle of radius R
result, error = integrate.dblquad(
    integrand,
    -2, 2,           # x bounds
    lambda x: -np.sqrt(4-x**2),  # y lower
    lambda x: np.sqrt(4-x**2)    # y upper
)
print(f"\nDouble integral over circle: {result:.6f}")
print(f"Expected (β‰ˆ Ο€(1-e^(-4))): {np.pi*(1-np.exp(-4)):.6f}")

Applications in AI/ML

Optimization Landscapes

The loss function L(ΞΈ)L(\theta) in machine learning is a scalar function of the parameter vector θ∈Rd\theta \in \mathbb{R}^d. The geometry of this function β€” its gradient, curvature, and critical points β€” directly determines the behavior of training algorithms.

1. Gradient Descent: The most fundamental optimization algorithm uses ΞΈt+1=ΞΈtβˆ’Ξ·βˆ‡L(ΞΈt)\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t). The gradient βˆ‡L\nabla L provides the direction of steepest descent.

2. Second-Order Methods: Newton's method uses the Hessian to achieve quadratic convergence:

ΞΈt+1=ΞΈtβˆ’Hβˆ’1(ΞΈt)βˆ‡L(ΞΈt)\theta_{t+1} = \theta_t - H^{-1}(\theta_t) \nabla L(\theta_t)

The Hessian's eigenvalues determine convergence rate: eigenvalues close to zero slow convergence, while large condition numbers indicate ill-conditioning.

3. Natural Gradient: The Fisher information matrix (an expected Hessian) accounts for the geometry of the parameter space, providing better updates than standard gradient descent.

4. Saddle Points in High Dimensions: In high-dimensional loss landscapes, saddle points are exponentially more common than local minima. The Hessian's indefinite eigenvalues reveal saddle points that trap first-order methods.

5. Normalizing Flows: Use the change of variables formula with the Jacobian determinant to transform simple distributions into complex ones:

log⁑pX(x)=log⁑pZ(fβˆ’1(x))+log⁑∣det⁑(Jfβˆ’1(x))∣\log p_X(x) = \log p_Z(f^{-1}(x)) + \log |\det(J_{f^{-1}}(x))|

6. Backpropagation: Each layer's Jacobian matrix propagates gradients through the chain rule. Vanishing/exploding gradients correspond to Jacobian eigenvalues less than or greater than 1.


Common Mistakes

Common Mistakes in Multivariable Calculus

MistakeCorrect Approach
Forgetting chain rule: ddxf(g(x))=fβ€²(g(x))\frac{d}{dx}f(g(x)) = f'(g(x))ddxf(g(x))=fβ€²(g(x))β‹…gβ€²(x)\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)
Ignoring mixed partialsVerify fxy=fyxf_{xy} = f_{yx} for continuous functions
Wrong integration orderCheck bounds: ∫ab∫g1(x)g2(x)dy dx\int_a^b \int_{g_1(x)}^{g_2(x)} dy \, dx vs ∫cd∫h1(y)h2(y)dx dy\int_c^d \int_{h_1(y)}^{h_2(y)} dx \, dy
Missing Jacobian determinantAlways include ∣det⁑(J)∣|\det(J)| when changing variables
Confusing βˆ‡f\nabla f with dfdfβˆ‡f\nabla f is a vector; dfdf is a differential (covector)
Not normalizing direction vectoruβƒ—\vec{u} must satisfy βˆ₯uβƒ—βˆ₯=1\|\vec{u}\| = 1 for directional derivative
Assuming Hessian is positive definiteCheck eigenvalues, not just diagonal entries
Forgetting absolute value in JacobianUse ∣det⁑(J)∣|\det(J)| for area/volume element

Interview Questions

Question 1: Gradient Properties

Q: Why is the gradient perpendicular to level curves?

Answer

A level curve is defined by f(xβƒ—)=cf(\vec{x}) = c. Taking the total differential: βˆ‡fβ‹…dxβƒ—=0\nabla f \cdot d\vec{x} = 0. Since dxβƒ—d\vec{x} is tangent to the level curve, βˆ‡f\nabla f must be perpendicular to it. This is because the directional derivative along the level curve is zero (the function value doesn't change), so βˆ‡fβ‹…tβƒ—=0\nabla f \cdot \vec{t} = 0 for any tangent vector tβƒ—\vec{t}.

Question 2: Hessian in Optimization

Q: Why can't we always use Newton's method with the Hessian?

Answer

Three main reasons: (1) Computing and storing the nΓ—nn \times n Hessian requires O(n2)O(n^2) memory, infeasible for millions of parameters. (2) Inverting the Hessian costs O(n3)O(n^3). (3) At saddle points, the Hessian is not positive definite, causing Newton's method to move toward the saddle instead of a minimum. Quasi-Newton methods (L-BFGS) approximate the Hessian to address these issues.

Question 3: Jacobian in Neural Networks

Q: How does the Jacobian relate to backpropagation?

Answer

Backpropagation applies the chain rule: if layer ll computes aβƒ—l=Οƒ(Wlaβƒ—lβˆ’1+bβƒ—l)\vec{a}^l = \sigma(W^l \vec{a}^{l-1} + \vec{b}^l), then βˆ‚Lβˆ‚aβƒ—lβˆ’1=(Wl)Tdiag(Οƒβ€²)βˆ‚Lβˆ‚aβƒ—l\frac{\partial L}{\partial \vec{a}^{l-1}} = (W^l)^T \text{diag}(\sigma') \frac{\partial L}{\partial \vec{a}^l}. The Jacobian of layer ll is Jl=diag(Οƒβ€²)WlJ_l = \text{diag}(\sigma') W^l. If eigenvalues of JlJ_l are all <1< 1, gradients vanish; if >1> 1, they explode.

Question 4: Directional Derivative

Q: In what direction does ff increase most rapidly at a point?

Answer

The direction of most rapid increase is the gradient direction βˆ‡f/βˆ₯βˆ‡fβˆ₯\nabla f / \|\nabla f\|. The rate of increase is βˆ₯βˆ‡fβˆ₯\|\nabla f\|. This follows from the Cauchy-Schwarz inequality: Duβƒ—f=βˆ‡fβ‹…u⃗≀βˆ₯βˆ‡fβˆ₯β‹…βˆ₯uβƒ—βˆ₯=βˆ₯βˆ‡fβˆ₯D_{\vec{u}} f = \nabla f \cdot \vec{u} \leq \|\nabla f\| \cdot \|\vec{u}\| = \|\nabla f\|, with equality when uβƒ—=βˆ‡f/βˆ₯βˆ‡fβˆ₯\vec{u} = \nabla f / \|\nabla f\|.

Question 5: Saddle Points

Q: Why are saddle points more common than local minima in high dimensions?

Answer

A critical point is a saddle point if the Hessian has at least one positive and one negative eigenvalue. In dd dimensions, the probability that all eigenvalues have the same sign decreases exponentially with dd. For a random matrix, the probability of being positive definite is approximately 2βˆ’d2^{-d}. Thus, in high dimensions (e.g., d=106d = 10^6), virtually all critical points are saddle points.

Question 6: Taylor Series

Q: When does the second-order Taylor approximation underestimate the true function value?

Answer

The second-order approximation f(xβƒ—0+hβƒ—)β‰ˆf(xβƒ—0)+βˆ‡fThβƒ—+12hβƒ—THhβƒ—f(\vec{x}_0 + \vec{h}) \approx f(\vec{x}_0) + \nabla f^T \vec{h} + \frac{1}{2}\vec{h}^T H \vec{h} underestimates when the remainder R2=16βˆ‡3f(ΞΎ)h3R_2 = \frac{1}{6} \nabla^3 f(\xi) h^3 (in 1D) is positive. For multivariable functions, if ff is convex (Hβͺ°0H \succeq 0) and we expand around a minimum (βˆ‡f=0\nabla f = 0), the approximation is exact for quadratic functions and underestimates for functions with positive higher-order derivatives.


Practice Problems

Problem 1: Gradient and Directional Derivative

Compute the gradient of f(x,y,z)=x2y+yz2f(x,y,z) = x^2 y + yz^2 at the point (1,2,3)(1,2,3) and find the directional derivative in the direction of v⃗=(1,1,1)\vec{v} = (1,1,1).

Solution

βˆ‡f=(2xy,β€…β€Šx2+z2,β€…β€Š2yz)\nabla f = (2xy, \; x^2 + z^2, \; 2yz)

At (1,2,3)(1,2,3): βˆ‡f=(4,β€…β€Š10,β€…β€Š12)\nabla f = (4, \; 10, \; 12)

Normalize: u⃗=13(1,1,1)\vec{u} = \frac{1}{\sqrt{3}}(1,1,1)

Duβƒ—f=βˆ‡fβ‹…uβƒ—=13(4+10+12)=263β‰ˆ15.01D_{\vec{u}} f = \nabla f \cdot \vec{u} = \frac{1}{\sqrt{3}}(4 + 10 + 12) = \frac{26}{\sqrt{3}} \approx 15.01

Problem 2: Hessian Classification

Classify the critical points of f(x,y)=x3βˆ’3xy2+y3f(x,y) = x^3 - 3xy^2 + y^3.

Solution

βˆ‡f=(3x2βˆ’3y2,β€…β€Šβˆ’6xy+3y2)=(0,0)\nabla f = (3x^2 - 3y^2, \; -6xy + 3y^2) = (0,0)

From first equation: x2=y2β€…β€ŠβŸΉβ€…β€Šx=Β±yx^2 = y^2 \implies x = \pm y

If x=yx = y: βˆ’6y2+3y2=βˆ’3y2=0β€…β€ŠβŸΉβ€…β€Šy=0β€…β€ŠβŸΉβ€…β€Š(0,0)-6y^2 + 3y^2 = -3y^2 = 0 \implies y = 0 \implies (0,0)

If x=βˆ’yx = -y: 6y2+3y2=9y2=0β€…β€ŠβŸΉβ€…β€Šy=0β€…β€ŠβŸΉβ€…β€Š(0,0)6y^2 + 3y^2 = 9y^2 = 0 \implies y = 0 \implies (0,0)

H=[6xβˆ’6yβˆ’6yβˆ’6x+6y]H = \begin{bmatrix} 6x & -6y \\ -6y & -6x + 6y \end{bmatrix}

At (0,0)(0,0): H=[0000]H = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}

The Hessian is the zero matrix β€” test is inconclusive. Higher-order analysis shows (0,0)(0,0) is a saddle point (the function takes both positive and negative values in every neighborhood of the origin).

Problem 3: Change of Variables

Evaluate ∬D(x2+y2) dA\iint_D (x^2 + y^2) \, dA where DD is the disk x2+y2≀4x^2 + y^2 \leq 4 using polar coordinates.

Solution

Convert to polar: x=rcos⁑θx = r\cos\theta, y=rsin⁑θy = r\sin\theta, x2+y2=r2x^2 + y^2 = r^2, dA=r dr dΞΈdA = r \, dr \, d\theta

∫02Ο€βˆ«02r2β‹…r dr dΞΈ=∫02Ο€dθ∫02r3 dr\int_0^{2\pi} \int_0^2 r^2 \cdot r \, dr \, d\theta = \int_0^{2\pi} d\theta \int_0^2 r^3 \, dr=2Ο€β‹…[r44]02=2Ο€β‹…164=8Ο€β‰ˆ25.13= 2\pi \cdot \left[\frac{r^4}{4}\right]_0^2 = 2\pi \cdot \frac{16}{4} = 8\pi \approx 25.13

Problem 4: Taylor Expansion

Find the second-order Taylor approximation of f(x,y)=ex+yf(x,y) = e^{x+y} at (0,0)(0,0) and evaluate at (0.1,0.1)(0.1, 0.1).

Solution

At (0,0)(0,0): f=1f = 1, βˆ‡f=(1,1)\nabla f = (1,1), H=[1111]H = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}

f(hβƒ—)β‰ˆ1+(h1+h2)+12(h12+2h1h2+h22)f(\vec{h}) \approx 1 + (h_1 + h_2) + \frac{1}{2}(h_1^2 + 2h_1 h_2 + h_2^2)

At (0.1,0.1)(0.1, 0.1): fβ‰ˆ1+0.2+12(0.01+0.02+0.01)=1+0.2+0.02=1.22f \approx 1 + 0.2 + \frac{1}{2}(0.01 + 0.02 + 0.01) = 1 + 0.2 + 0.02 = 1.22

Exact: e0.2β‰ˆ1.2214e^{0.2} \approx 1.2214 (error β‰ˆ0.0014\approx 0.0014)

Problem 5: Jacobian Determinant

Compute the Jacobian determinant of the transformation x=u2βˆ’v2x = u^2 - v^2, y=2uvy = 2uv and use it to evaluate ∬R(x+y) dx dy\iint_R (x+y) \, dx \, dy where RR is the region bounded by x=0x = 0, x=1x = 1, y=0y = 0, y=1y = 1.

Solution

βˆ‚(x,y)βˆ‚(u,v)=det⁑[2uβˆ’2v2v2u]=4u2+4v2=4(u2+v2)\frac{\partial(x,y)}{\partial(u,v)} = \det \begin{bmatrix} 2u & -2v \\ 2v & 2u \end{bmatrix} = 4u^2 + 4v^2 = 4(u^2+v^2)

The transformation maps to the first quadrant. Setting u2βˆ’v2=0β€…β€ŠβŸΉβ€…β€Šu=vu^2 - v^2 = 0 \implies u = v, and u2βˆ’v2=1u^2 - v^2 = 1 defines a hyperbola.

∬R(x+y) dx dy=∬D(u2βˆ’v2+2uv)β‹…4(u2+v2) du dv\iint_R (x+y) \, dx \, dy = \iint_D (u^2 - v^2 + 2uv) \cdot 4(u^2+v^2) \, du \, dv

This integral evaluates to approximately 0.50.5 after computing the appropriate bounds in (u,v)(u,v) space.


Quick Reference

Multivariable Calculus Quick Reference

ConceptFormulaKey Property
Gradientβˆ‡f=(βˆ‚fβˆ‚x1,…,βˆ‚fβˆ‚xn)\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)Direction of steepest ascent
Directional DerivativeDuβƒ—f=βˆ‡fβ‹…uβƒ—D_{\vec{u}} f = \nabla f \cdot \vec{u}, βˆ₯uβƒ—βˆ₯=1\|\vec{u}\| = 1Rate of change in direction uβƒ—\vec{u}
HessianHij=βˆ‚2fβˆ‚xiβˆ‚xjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}Symmetric matrix; determines curvature
Taylor (2nd order)f(xβƒ—0+hβƒ—)β‰ˆf+βˆ‡fThβƒ—+12hβƒ—THhβƒ—f(\vec{x}_0 + \vec{h}) \approx f + \nabla f^T \vec{h} + \frac{1}{2}\vec{h}^T H \vec{h}Quadratic approximation
Critical pointβˆ‡f(xβƒ—0)=0βƒ—\nabla f(\vec{x}_0) = \vec{0}Candidate for min/max/saddle
MinH≻0H \succ 0 (all Ξ»i>0\lambda_i > 0)Convex curvature
MaxH≺0H \prec 0 (all λi<0\lambda_i < 0)Concave curvature
SaddleHH indefinite (mixed-sign eigenvalues)Curves up in some directions, down in others
Double integral∬Df(x,y) dA=∫∫f(x,y) dy dx\iint_D f(x,y) \, dA = \int \int f(x,y) \, dy \, dxVolume under surface
JacobianJij=βˆ‚fiβˆ‚xjJ_{ij} = \frac{\partial f_i}{\partial x_j}Local linear approximation of fβƒ—\vec{f}
Jacobian det.det⁑(J)\det(J)Area/volume scaling factor
Polar coordsx=rcos⁑θx = r\cos\theta, y=rsin⁑θy = r\sin\theta, dA=r dr dΞΈdA = r \, dr \, d\thetaUse for circular regions

Cross-References

⭐

Premium Content

Multivariable Calculus

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Mathematics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement