Deep Learning

Neural Networks — The Foundation of Modern AI

Discover how neural networks form the backbone of modern AI systems, enabling machines to learn complex patterns from data.

Universal function approximation — learn any mapping from inputs to outputs
Backpropagation — efficient gradient computation for training
Deep architectures — stack layers for hierarchical feature learning

The brain is a computer made of meat, and it is very good at being a brain.

Neural Networks Fundamentals

Neural networks learn complex patterns by stacking simple computational units (neurons) in layers. At the mathematical core, a neural network is a parameterized nonlinear function $f_\theta : \mathbb{R}^n \to \mathbb{R}^m$ that is optimized via gradient-based methods.

The Perceptron

The perceptron is the atomic unit of neural computation. Given input vector $\mathbf{x} \in \mathbb{R}^n$ , weights $\mathbf{w} \in \mathbb{R}^n$ , and bias $b \in \mathbb{R}$ :

z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b

\hat{y} = \sigma(z)

Geometric Interpretation

A single perceptron computes a linear decision boundary $\mathbf{w}^\top \mathbf{x} + b = 0$ in $\mathbb{R}^n$ . The activation function introduces nonlinearity. A single perceptron can only solve linearly separable problems (XOR is impossible with one neuron — Minsky and Papert, 1969).

Activation Functions

Activation functions introduce nonlinearity, enabling networks to approximate arbitrary functions. Without them, a multi-layer network collapses to a single linear transformation.

Properties:

• ReLU: Range [0, ∞), gradient ∈ {0, 1}, dead neurons possible

• Sigmoid: Range (0, 1), gradient ∈ (0, 0.25], vanishing gradients

• Tanh: Range (-1, 1), zero-centered, still vanishing gradients

• GELU: Smooth approximation of ReLU, used in Transformers (BERT, GPT)

• Swish: f(x) = x · σ(x), self-gated, used in EfficientNet

DfActivation Function Derivatives

For backpropagation, we need derivatives:

ReLU: $f'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$
Sigmoid: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$
Tanh: $\tanh'(x) = 1 - \tanh^2(x)$
GELU: $f'(x) = \Phi(x) + x \cdot \phi(x)$ where $\Phi$ is the CDF and $\phi$ is the PDF of $\mathcal{N}(0,1)$

The vanishing gradient problem occurs when $\sigma'(x) \leq 0.25$ is multiplied across many layers, causing gradients to shrink exponentially.

Multi-Layer Perceptron (MLP)

An MLP stacks layers of neurons to form a deep network. Each layer computes an affine transformation followed by a nonlinear activation:

\mathbf{h}^{(l)} = \sigma\left(W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\right)

where $W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$ and $\mathbf{h}^{(0)} = \mathbf{x}$ is the input.

DfUniversal Approximation Theorem

Theorem (Cybenko, 1989; Hornik, 1991): A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$ , given appropriate weights and a non-constant activation function.

Formally: For any continuous $f: [0,1]^n \to \mathbb{R}$ and $\epsilon > 0$ , there exists $N$ , $\alpha_i \in \mathbb{R}$ , $\mathbf{v}_i \in \mathbb{R}^n$ , $b_i \in \mathbb{R}$ such that:

F(\mathbf{x}) = \sum_{i=1}^{N} \alpha_i \cdot \sigma(\mathbf{v}_i^\top \mathbf{x} + b_i)

satisfies $|F(\mathbf{x}) - f(\mathbf{x})| < \epsilon$ for all $\mathbf{x} \in [0,1]^n$ .

Caveat: This theorem is non-constructive — it guarantees existence but not efficiency. The required $N$ may be exponential in $n$ (curse of depth).

Backpropagation

Backpropagation computes the gradient of the loss $\mathcal{L}$ with respect to every parameter via the chain rule applied recursively from output to input.

DfBackpropagation Algorithm

Given a network with $L$ layers, the forward pass computes:

\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}, \quad \mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)})

The loss $\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})$ is computed at the output. Backward pass:

Output error: $\delta^{(L)} = \nabla_{\mathbf{a}} \mathcal{L} \odot \sigma'(\mathbf{z}^{(L)})$
Propagate: $\delta^{(l)} = (W^{(l+1)})^\top \delta^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})$
Gradients: $\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\mathbf{a}^{(l-1)})^\top$ , $\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}$

Computational complexity: Forward $O(\sum_{l=1}^{L} n_l \cdot n_{l-1})$ , backward same. Total: $2 \times$ forward cost.

Gradient Descent Variants

DfSGD with Momentum

Vanilla SGD: $\theta_{t+1} = \theta_t - \alpha \nabla \mathcal{L}(\theta_t)$

Momentum (Polyak, 1964): Accelerates convergence by accumulating velocity:

\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \nabla \mathcal{L}(\theta_t)

\theta_{t+1} = \theta_t - \alpha \mathbf{v}_t

where $\beta \in [0.9, 0.99]$ controls the momentum. This dampens oscillations in narrow valleys of the loss landscape.

Weight Initialization

DfWhy Initialization Matters

Poor initialization causes exploding/vanishing activations. For a network with $L$ layers:

\text{Var}(h^{(l)}) = \text{Var}(h^{(0)}) \prod_{k=1}^{l} \text{Var}(W^{(k)}) \cdot n_{k-1}

If $\text{Var}(W) \cdot n_{k-1} \neq 1$ , variance grows or shrinks exponentially.

Xavier/Glorot (sigmoid/tanh): $W \sim \mathcal{N}(0, \frac{2}{n_{in} + n_{out}})$

He/Kaiming (ReLU): $W \sim \mathcal{N}(0, \frac{2}{n_{in}})$ — accounts for ReLU halving variance.

PyTorch Implementation

Example: Building a Neural Network

import torch
import torch.nn as nn
import torch.optim as optim

class NeuralNet(nn.Module):
    def __init__(self, input_dim=10, hidden_dims=[64, 32], output_dim=1):
        super().__init__()
        layers = []
        prev_dim = input_dim
        for h_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, h_dim),
                nn.BatchNorm1d(h_dim),
                nn.ReLU(),
                nn.Dropout(0.2)
            ])
            prev_dim = h_dim
        layers.append(nn.Linear(prev_dim, output_dim))
        layers.append(nn.Sigmoid())
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

model = NeuralNet()
criterion = nn.BCELoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)

for epoch in range(100):
    model.train()
    y_pred = model(X_train)
    loss = criterion(y_pred, y_train)

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()

Key Takeaways

Summary: Neural Networks

Neural networks are universal function approximators (theorem guarantees existence, not efficiency)
ReLU is the default activation for hidden layers; GELU for Transformers
Backpropagation computes gradients in $O(n)$ time via chain rule (same as forward pass)
Xavier/He initialization prevents vanishing/exploding activations
Learning rate is the most important hyperparameter
SGD with momentum generalizes better; Adam converges faster
Overfitting is controlled by dropout, weight decay, data augmentation, early stopping
GPU acceleration is essential for training large networks

What to Learn Next

-> Convolutional Neural Networks Learn how CNNs process visual data with parameter sharing.

-> RNNs and LSTMs Explore networks designed for sequential data.

-> Training Deep Networks Master optimizers, batch norm, and regularization.

-> Transformers Learn the architecture that replaced RNNs.

-> Weight Initialization Understand Xavier, He, and modern initialization.

-> Optimizers for Deep Learning SGD, Adam, AdamW, and beyond.

Neural Networks Fundamentals — Perceptrons to Deep Learning