🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Activation Functions — Sigmoid, ReLU, GELU and The Dead Neuron Problem

FoundationsArchitecture🟢 Free Lesson

Advertisement

DL Foundations

Activation Functions — Adding Non-Linearity to Neural Networks

Without activation functions, deep networks collapse into linear models no matter how many layers are stacked. Activation functions introduce the non-linearity that enables learning of arbitrary decision boundaries.

  • ReLU is the Default — Fast, prevents vanishing gradients, provides sparse activations for hidden layers
  • GELU for Transformers — Smooth approximation to ReLU used in BERT, GPT, and Vision Transformers
  • Dead Neuron Problem — ReLU neurons that always output zero require careful initialization and architecture choices

Activation Functions — Sigmoid, ReLU, GELU and The Dead Neuron Problem

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without them, a deep network would collapse into a linear model.


Why Activation Functions?

DfNon-Linearity

Without activation functions, a deep network is just a linear transformation:

f(x)=WLWL1W1x=Wxf(\mathbf{x}) = \mathbf{W}_L \mathbf{W}_{L-1} \cdots \mathbf{W}_1 \mathbf{x} = \mathbf{W}'\mathbf{x}

No matter how many layers you stack, the result is equivalent to a single linear layer. Activation functions break this linearity, allowing networks to learn arbitrary decision boundaries.

Without Activation Functions: Linear CollapseInputLinearW₁xLinearW₂(W₁x)LinearW₃(W₂W₁x)= W'xSingle linear!W₃ · W₂ · W₁ · x = W' · x (equivalent to single layer)

Sigmoid

DfSigmoid Function

The sigmoid function maps any real number to (0,1)(0, 1):

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}
  • Output range: (0,1)(0, 1)
  • Derivative: σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))
  • Peak derivative: 0.250.25 at x=0x = 0

Sigmoid Function

σ(x)=11+ex,σ(x)=σ(x)(1σ(x))\sigma(x) = \frac{1}{1 + e^{-x}}, \quad \sigma'(x) = \sigma(x)(1 - \sigma(x))

Here,

  • xx=Input (any real number)
  • σ(x)\sigma(x)=Output in (0, 1)
  • σ(x)\sigma'(x)=Derivative (max 0.25 at x=0)

ThVanishing Gradient in Sigmoid

The maximum derivative of the sigmoid function is 0.250.25. In a network with LL layers, the gradient through LL sigmoid layers is bounded by (0.25)L(0.25)^L, which decays exponentially. For L=10L = 10, the gradient is less than 10610^{-6}, making training extremely slow.


Tanh

DfTanh Function

The hyperbolic tangent maps inputs to (1,1)(-1, 1):

tanh(x)=exexex+ex=2σ(2x)1\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1
  • Output range: (1,1)(-1, 1) (zero-centered)
  • Derivative: tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x)
  • Peak derivative: 1.01.0 at x=0x = 0

Activation Function Comparison

Activation Functions: Shape and Derivative ComparisonxySigmoidTanhReLUFunction Shapexy'σ' (max 0.25)tanh' (max 1.0)ReLU' (0 or 1)Derivative Shape

ReLU and Variants

DfReLU (Rectified Linear Unit)

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)
  • Output range: [0,)[0, \infty)
  • Derivative: ReLU(x)={1x>00x0\text{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}
  • Advantages: Computationally efficient,缓解 vanishing gradients, sparse activations
  • Disadvantage: Dead neurons (output always 0)

DfLeaky ReLU

LeakyReLU(x)={xx>0αxx0\text{LeakyReLU}(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}

where α\alpha is a small constant (typically 0.01). Prevents dead neurons by allowing small negative outputs.

DfELU (Exponential Linear Unit)

ELU(x)={xx>0α(ex1)x0\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}

ELU combines benefits of ReLU (no vanishing gradient for positive inputs) with mean activations closer to zero.


GELU and Swish

DfGELU (Gaussian Error Linear Unit)

GELU(x)=xΦ(x)=x12[1+erf(x2)]\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

where Φ(x)\Phi(x) is the CDF of the standard Gaussian. GELU is a smooth approximation to ReLU used in BERT, GPT, and Vision Transformers.

DfSwish (SiLU)

Swish(x)=xσ(βx)\text{Swish}(x) = x \cdot \sigma(\beta x)

When β\beta \to \infty, Swish approaches ReLU. When β=1\beta = 1, it equals SiLU. Swish is self-gated and smooth everywhere.

Modern Activation Functions: GELU and SwishReLUGELUSwishxyGELU: Smooth ReLU approximationUsed in: BERT, GPT, ViT

Softmax

DfSoftmax Function

The softmax function converts logits to probabilities:

softmax(zi)=ezij=1Cezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}}

Properties:

  • Output sums to 1: isoftmax(zi)=1\sum_i \text{softmax}(z_i) = 1
  • Preserves order: larger logits get higher probabilities
  • Differentiable everywhere
  • Temperature parameter τ\tau: softmax(zi/τ)\text{softmax}(z_i/\tau) controls sharpness
Softmax with Temperature
softmax(zi/τ)=ezi/τj=1Cezj/τ\text{softmax}(z_i / \tau) = \frac{e^{z_i / \tau}}{\sum_{j=1}^{C} e^{z_j / \tau}}

Numerical Stability

In practice, subtract the max logit before softmax to prevent overflow:

softmax(zi)=ezimax(z)jezjmax(z)\text{softmax}(z_i) = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}}

This is mathematically equivalent but numerically stable.


When to Use Each Activation

Activation Function Selection GuideHidden LayersReLU (default)GELU (transformers)Swish (deep nets)Leaky ReLU (avoid dead neurons)Output LayerSoftmax (multi-class)Sigmoid (binary)None (regression)Multi-hot (multi-label)Special CasesMish (self-regularizing)Hardswish (mobile)PReLU (learned α)CELU (smooth ELU)

DfActivation Function Properties

FunctionRangeZero-CenteredDead NeuronsComputation
Sigmoid(0, 1)NoNoExpensive
Tanh(-1, 1)YesNoExpensive
ReLU[0, ∞)NoYesVery cheap
Leaky ReLU(-∞, ∞)NoNoCheap
GELU(-0.17, ∞)ApproxNoModerate
Swish(-0.28, ∞)ApproxNoModerate
Softmax(0, 1) sum=1N/ANoModerate

The Dead Neuron Problem

DfDead Neurons

A dead neuron is a ReLU neuron that always outputs 0 because its weights are such that Wx+b<0W\mathbf{x} + b < 0 for all training inputs. Dead neurons receive zero gradient and never recover.

Causes:

  • Learning rate too high (weights become too negative)
  • Weight initialization too negative
  • Biases initialized too negatively

Solutions:

  • Use Leaky ReLU or ELU
  • Careful weight initialization (He initialization)
  • Lower learning rate
  • Batch normalization

Dead Neuron Statistics

In large networks with ReLU, typically 1-10% of neurons may be dead after training. While some sparsity is beneficial, too many dead neurons indicate training problems. Monitor the fraction of zero activations during training.


Summary

  • Activation functions introduce non-linearity, enabling deep networks to learn complex patterns
  • ReLU is the default for hidden layers: fast, prevents vanishing gradients, but causes dead neurons
  • GELU/Swish are preferred for transformers and deep networks: smooth, self-gated
  • Softmax for multi-class output, Sigmoid for binary output, None for regression
  • Dead neurons are a practical concern with ReLU; use Leaky ReLU or careful initialization to mitigate

Next: Loss Functions for Deep Learning

Premium Content

Activation Functions — Sigmoid, ReLU, GELU and The Dead Neuron Problem

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement