🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Neural Networks Fundamentals — Perceptrons to Deep Learning

Deep LearningNeural Networks🟢 Free Lesson

Advertisement

Deep Learning

Neural Networks — The Foundation of Modern AI

Discover how neural networks form the backbone of modern AI systems, enabling machines to learn complex patterns from data.

  • Universal function approximation — learn any mapping from inputs to outputs
  • Backpropagation — efficient gradient computation for training
  • Deep architectures — stack layers for hierarchical feature learning

The brain is a computer made of meat, and it is very good at being a brain.

Neural Networks Fundamentals

Neural networks learn complex patterns by stacking simple computational units (neurons) in layers. At the mathematical core, a neural network is a parameterized nonlinear function fθ:RnRmf_\theta : \mathbb{R}^n \to \mathbb{R}^m that is optimized via gradient-based methods.


The Perceptron

The perceptron is the atomic unit of neural computation. Given input vector xRn\mathbf{x} \in \mathbb{R}^n, weights wRn\mathbf{w} \in \mathbb{R}^n, and bias bRb \in \mathbb{R}:

z=wx+b=i=1nwixi+bz = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b
y^=σ(z)\hat{y} = \sigma(z)
x₁x₂x₃w₁w₂w₃Σ+ bσ(·)activationŷInputsSummationActivationOutput

Geometric Interpretation

A single perceptron computes a linear decision boundary wx+b=0\mathbf{w}^\top \mathbf{x} + b = 0 in Rn\mathbb{R}^n. The activation function introduces nonlinearity. A single perceptron can only solve linearly separable problems (XOR is impossible with one neuron — Minsky and Papert, 1969).


Activation Functions

Activation functions introduce nonlinearity, enabling networks to approximate arbitrary functions. Without them, a multi-layer network collapses to a single linear transformation.

ReLU: f(x) = max(0, x)0Sigmoid: σ(x) = 1/(1+e⁻ˣ)0.5Tanh: tanh(x)GELU: x · Φ(x)0

Properties:

• ReLU: Range [0, ∞), gradient ∈ {0, 1}, dead neurons possible

• Sigmoid: Range (0, 1), gradient ∈ (0, 0.25], vanishing gradients

• Tanh: Range (-1, 1), zero-centered, still vanishing gradients

• GELU: Smooth approximation of ReLU, used in Transformers (BERT, GPT)

• Swish: f(x) = x · σ(x), self-gated, used in EfficientNet

DfActivation Function Derivatives

For backpropagation, we need derivatives:

  • ReLU: f(x)={1x>00x0f'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}
  • Sigmoid: σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))
  • Tanh: tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x)
  • GELU: f(x)=Φ(x)+xϕ(x)f'(x) = \Phi(x) + x \cdot \phi(x) where Φ\Phi is the CDF and ϕ\phi is the PDF of N(0,1)\mathcal{N}(0,1)

The vanishing gradient problem occurs when σ(x)0.25\sigma'(x) \leq 0.25 is multiplied across many layers, causing gradients to shrink exponentially.


Multi-Layer Perceptron (MLP)

An MLP stacks layers of neurons to form a deep network. Each layer computes an affine transformation followed by a nonlinear activation:

h(l)=σ(W(l)h(l1)+b(l))\mathbf{h}^{(l)} = \sigma\left(W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\right)

where W(l)Rnl×nl1W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}} and h(0)=x\mathbf{h}^{(0)} = \mathbf{x} is the input.

Inputn = 3x1x2x3x4x5Hidden 164 neuronsHidden 232 neuronsOutput1 neuronσOutputŷW₍¹₎, b₍¹₎W₍²₎, b₍²₎W₍³₎, b₍³₎

DfUniversal Approximation Theorem

Theorem (Cybenko, 1989; Hornik, 1991): A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of Rn\mathbb{R}^n, given appropriate weights and a non-constant activation function.

Formally: For any continuous f:[0,1]nRf: [0,1]^n \to \mathbb{R} and ϵ>0\epsilon > 0, there exists NN, αiR\alpha_i \in \mathbb{R}, viRn\mathbf{v}_i \in \mathbb{R}^n, biRb_i \in \mathbb{R} such that:

F(x)=i=1Nαiσ(vix+bi)F(\mathbf{x}) = \sum_{i=1}^{N} \alpha_i \cdot \sigma(\mathbf{v}_i^\top \mathbf{x} + b_i)

satisfies F(x)f(x)<ϵ|F(\mathbf{x}) - f(\mathbf{x})| < \epsilon for all x[0,1]n\mathbf{x} \in [0,1]^n.

Caveat: This theorem is non-constructive — it guarantees existence but not efficiency. The required NN may be exponential in nn (curse of depth).


Backpropagation

Backpropagation computes the gradient of the loss L\mathcal{L} with respect to every parameter via the chain rule applied recursively from output to input.

Forward Pass →Input xBatch size BLinearz = Wx + bActivationa = σ(z)Loss LL(ŷ, y)← Backward Pass∂L/∂ŷ∂L/∂a · σ'(z)∂L/∂W= Δ · xᵀUpdate WW ← W - α∇LChain Rule (key insight):∂L/∂W₍⁻₎ = ∂L/∂a₍ᵈ₎ · ∏ σ'(z₍ᵏ₎) · W₍ᵏ⁺¹₎ · ∂a₍⁻₎/∂W₍⁻₎

DfBackpropagation Algorithm

Given a network with LL layers, the forward pass computes:

z(l)=W(l)a(l1)+b(l),a(l)=σ(z(l))\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}, \quad \mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)})

The loss L(y^,y)\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y}) is computed at the output. Backward pass:

  1. Output error: δ(L)=aLσ(z(L))\delta^{(L)} = \nabla_{\mathbf{a}} \mathcal{L} \odot \sigma'(\mathbf{z}^{(L)})
  2. Propagate: δ(l)=(W(l+1))δ(l+1)σ(z(l))\delta^{(l)} = (W^{(l+1)})^\top \delta^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})
  3. Gradients: LW(l)=δ(l)(a(l1))\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\mathbf{a}^{(l-1)})^\top, Lb(l)=δ(l)\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}

Computational complexity: Forward O(l=1Lnlnl1)O(\sum_{l=1}^{L} n_l \cdot n_{l-1}), backward same. Total: 2×2 \times forward cost.


Gradient Descent Variants

Batch GDUses entire dataset per updateStable, slowMini-Batch GDUses batch of B samplesNoisy, fast, generalizesSGD (B=1)Uses single sampleVery noisy, escapes local minConvergence ComparisonLossEpochsBatchMini-BSGD

DfSGD with Momentum

Vanilla SGD: θt+1=θtαL(θt)\theta_{t+1} = \theta_t - \alpha \nabla \mathcal{L}(\theta_t)

Momentum (Polyak, 1964): Accelerates convergence by accumulating velocity:

vt=βvt1+L(θt)\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \nabla \mathcal{L}(\theta_t)
θt+1=θtαvt\theta_{t+1} = \theta_t - \alpha \mathbf{v}_t

where β[0.9,0.99]\beta \in [0.9, 0.99] controls the momentum. This dampens oscillations in narrow valleys of the loss landscape.


Weight Initialization

DfWhy Initialization Matters

Poor initialization causes exploding/vanishing activations. For a network with LL layers:

Var(h(l))=Var(h(0))k=1lVar(W(k))nk1\text{Var}(h^{(l)}) = \text{Var}(h^{(0)}) \prod_{k=1}^{l} \text{Var}(W^{(k)}) \cdot n_{k-1}

If Var(W)nk11\text{Var}(W) \cdot n_{k-1} \neq 1, variance grows or shrinks exponentially.

Xavier/Glorot (sigmoid/tanh): WN(0,2nin+nout)W \sim \mathcal{N}(0, \frac{2}{n_{in} + n_{out}})

He/Kaiming (ReLU): WN(0,2nin)W \sim \mathcal{N}(0, \frac{2}{n_{in}}) — accounts for ReLU halving variance.


PyTorch Implementation

Example: Building a Neural Network

import torch
import torch.nn as nn
import torch.optim as optim

class NeuralNet(nn.Module):
    def __init__(self, input_dim=10, hidden_dims=[64, 32], output_dim=1):
        super().__init__()
        layers = []
        prev_dim = input_dim
        for h_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, h_dim),
                nn.BatchNorm1d(h_dim),
                nn.ReLU(),
                nn.Dropout(0.2)
            ])
            prev_dim = h_dim
        layers.append(nn.Linear(prev_dim, output_dim))
        layers.append(nn.Sigmoid())
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

model = NeuralNet()
criterion = nn.BCELoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)

for epoch in range(100):
    model.train()
    y_pred = model(X_train)
    loss = criterion(y_pred, y_train)

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()

Key Takeaways

Summary: Neural Networks

  • Neural networks are universal function approximators (theorem guarantees existence, not efficiency)
  • ReLU is the default activation for hidden layers; GELU for Transformers
  • Backpropagation computes gradients in O(n)O(n) time via chain rule (same as forward pass)
  • Xavier/He initialization prevents vanishing/exploding activations
  • Learning rate is the most important hyperparameter
  • SGD with momentum generalizes better; Adam converges faster
  • Overfitting is controlled by dropout, weight decay, data augmentation, early stopping
  • GPU acceleration is essential for training large networks

What to Learn Next

-> Convolutional Neural Networks Learn how CNNs process visual data with parameter sharing.

-> RNNs and LSTMs Explore networks designed for sequential data.

-> Training Deep Networks Master optimizers, batch norm, and regularization.

-> Transformers Learn the architecture that replaced RNNs.

-> Weight Initialization Understand Xavier, He, and modern initialization.

-> Optimizers for Deep Learning SGD, Adam, AdamW, and beyond.

Premium Content

Neural Networks Fundamentals — Perceptrons to Deep Learning

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement