Deep Learning
Neural Networks — The Foundation of Modern AI
Discover how neural networks form the backbone of modern AI systems, enabling machines to learn complex patterns from data.
- Universal function approximation — learn any mapping from inputs to outputs
- Backpropagation — efficient gradient computation for training
- Deep architectures — stack layers for hierarchical feature learning
The brain is a computer made of meat, and it is very good at being a brain.
Neural Networks Fundamentals
Neural networks learn complex patterns by stacking simple computational units (neurons) in layers. At the mathematical core, a neural network is a parameterized nonlinear function that is optimized via gradient-based methods.
The Perceptron
The perceptron is the atomic unit of neural computation. Given input vector , weights , and bias :
Geometric Interpretation
A single perceptron computes a linear decision boundary in . The activation function introduces nonlinearity. A single perceptron can only solve linearly separable problems (XOR is impossible with one neuron — Minsky and Papert, 1969).
Activation Functions
Activation functions introduce nonlinearity, enabling networks to approximate arbitrary functions. Without them, a multi-layer network collapses to a single linear transformation.
DfActivation Function Derivatives
For backpropagation, we need derivatives:
- ReLU:
- Sigmoid:
- Tanh:
- GELU: where is the CDF and is the PDF of
The vanishing gradient problem occurs when is multiplied across many layers, causing gradients to shrink exponentially.
Multi-Layer Perceptron (MLP)
An MLP stacks layers of neurons to form a deep network. Each layer computes an affine transformation followed by a nonlinear activation:
where and is the input.
DfUniversal Approximation Theorem
Theorem (Cybenko, 1989; Hornik, 1991): A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of , given appropriate weights and a non-constant activation function.
Formally: For any continuous and , there exists , , , such that:
satisfies for all .
Caveat: This theorem is non-constructive — it guarantees existence but not efficiency. The required may be exponential in (curse of depth).
Backpropagation
Backpropagation computes the gradient of the loss with respect to every parameter via the chain rule applied recursively from output to input.
DfBackpropagation Algorithm
Given a network with layers, the forward pass computes:
The loss is computed at the output. Backward pass:
- Output error:
- Propagate:
- Gradients: ,
Computational complexity: Forward , backward same. Total: forward cost.
Gradient Descent Variants
DfSGD with Momentum
Vanilla SGD:
Momentum (Polyak, 1964): Accelerates convergence by accumulating velocity:
where controls the momentum. This dampens oscillations in narrow valleys of the loss landscape.
Weight Initialization
DfWhy Initialization Matters
Poor initialization causes exploding/vanishing activations. For a network with layers:
If , variance grows or shrinks exponentially.
Xavier/Glorot (sigmoid/tanh):
He/Kaiming (ReLU): — accounts for ReLU halving variance.
PyTorch Implementation
Example: Building a Neural Network
import torch
import torch.nn as nn
import torch.optim as optim
class NeuralNet(nn.Module):
def __init__(self, input_dim=10, hidden_dims=[64, 32], output_dim=1):
super().__init__()
layers = []
prev_dim = input_dim
for h_dim in hidden_dims:
layers.extend([
nn.Linear(prev_dim, h_dim),
nn.BatchNorm1d(h_dim),
nn.ReLU(),
nn.Dropout(0.2)
])
prev_dim = h_dim
layers.append(nn.Linear(prev_dim, output_dim))
layers.append(nn.Sigmoid())
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x)
model = NeuralNet()
criterion = nn.BCELoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
for epoch in range(100):
model.train()
y_pred = model(X_train)
loss = criterion(y_pred, y_train)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
Key Takeaways
Summary: Neural Networks
- Neural networks are universal function approximators (theorem guarantees existence, not efficiency)
- ReLU is the default activation for hidden layers; GELU for Transformers
- Backpropagation computes gradients in time via chain rule (same as forward pass)
- Xavier/He initialization prevents vanishing/exploding activations
- Learning rate is the most important hyperparameter
- SGD with momentum generalizes better; Adam converges faster
- Overfitting is controlled by dropout, weight decay, data augmentation, early stopping
- GPU acceleration is essential for training large networks
What to Learn Next
-> Convolutional Neural Networks Learn how CNNs process visual data with parameter sharing.
-> RNNs and LSTMs Explore networks designed for sequential data.
-> Training Deep Networks Master optimizers, batch norm, and regularization.
-> Transformers Learn the architecture that replaced RNNs.
-> Weight Initialization Understand Xavier, He, and modern initialization.
-> Optimizers for Deep Learning SGD, Adam, AdamW, and beyond.