DL Foundations

Activation Functions — Adding Non-Linearity to Neural Networks

Without activation functions, deep networks collapse into linear models no matter how many layers are stacked. Activation functions introduce the non-linearity that enables learning of arbitrary decision boundaries.

ReLU is the Default — Fast, prevents vanishing gradients, provides sparse activations for hidden layers
GELU for Transformers — Smooth approximation to ReLU used in BERT, GPT, and Vision Transformers
Dead Neuron Problem — ReLU neurons that always output zero require careful initialization and architecture choices

Activation Functions — Sigmoid, ReLU, GELU and The Dead Neuron Problem

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without them, a deep network would collapse into a linear model.

Why Activation Functions?

DfNon-Linearity

Without activation functions, a deep network is just a linear transformation:

f(\mathbf{x}) = \mathbf{W}_L \mathbf{W}_{L-1} \cdots \mathbf{W}_1 \mathbf{x} = \mathbf{W}'\mathbf{x}

No matter how many layers you stack, the result is equivalent to a single linear layer. Activation functions break this linearity, allowing networks to learn arbitrary decision boundaries.

Sigmoid

DfSigmoid Function

The sigmoid function maps any real number to $(0, 1)$ :

\sigma(x) = \frac{1}{1 + e^{-x}}

Output range: $(0, 1)$
Derivative: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$
Peak derivative: $0.25$ at $x = 0$

Sigmoid Function

\sigma(x) = \frac{1}{1 + e^{-x}}, \quad \sigma'(x) = \sigma(x)(1 - \sigma(x))

Here,

$x$ =Input (any real number)
$\sigma(x)$ =Output in (0, 1)
$\sigma'(x)$ =Derivative (max 0.25 at x=0)

ThVanishing Gradient in Sigmoid

The maximum derivative of the sigmoid function is $0.25$ . In a network with $L$ layers, the gradient through $L$ sigmoid layers is bounded by $(0.25)^L$ , which decays exponentially. For $L = 10$ , the gradient is less than $10^{-6}$ , making training extremely slow.

Tanh

DfTanh Function

The hyperbolic tangent maps inputs to $(-1, 1)$ :

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1

Output range: $(-1, 1)$ (zero-centered)
Derivative: $\tanh'(x) = 1 - \tanh^2(x)$
Peak derivative: $1.0$ at $x = 0$

Activation Function Comparison

ReLU and Variants

DfReLU (Rectified Linear Unit)

\text{ReLU}(x) = \max(0, x)

Output range: $[0, \infty)$
Derivative: $\text{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$
Advantages: Computationally efficient,缓解 vanishing gradients, sparse activations
Disadvantage: Dead neurons (output always 0)

DfLeaky ReLU

\text{LeakyReLU}(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}

where $\alpha$ is a small constant (typically 0.01). Prevents dead neurons by allowing small negative outputs.

DfELU (Exponential Linear Unit)

\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}

ELU combines benefits of ReLU (no vanishing gradient for positive inputs) with mean activations closer to zero.

GELU and Swish

DfGELU (Gaussian Error Linear Unit)

\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

where $\Phi(x)$ is the CDF of the standard Gaussian. GELU is a smooth approximation to ReLU used in BERT, GPT, and Vision Transformers.

DfSwish (SiLU)

\text{Swish}(x) = x \cdot \sigma(\beta x)

When $\beta \to \infty$ , Swish approaches ReLU. When $\beta = 1$ , it equals SiLU. Swish is self-gated and smooth everywhere.

Softmax

DfSoftmax Function

The softmax function converts logits to probabilities:

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}}

Properties:

Output sums to 1: $\sum_i \text{softmax}(z_i) = 1$
Preserves order: larger logits get higher probabilities
Differentiable everywhere
Temperature parameter $\tau$ : $\text{softmax}(z_i/\tau)$ controls sharpness

Softmax with Temperature

\text{softmax}(z_i / \tau) = \frac{e^{z_i / \tau}}{\sum_{j=1}^{C} e^{z_j / \tau}}

Numerical Stability

In practice, subtract the max logit before softmax to prevent overflow:

\text{softmax}(z_i) = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}}

This is mathematically equivalent but numerically stable.

When to Use Each Activation

DfActivation Function Properties

Function	Range	Zero-Centered	Dead Neurons	Computation
Sigmoid	(0, 1)	No	No	Expensive
Tanh	(-1, 1)	Yes	No	Expensive
ReLU	[0, ∞)	No	Yes	Very cheap
Leaky ReLU	(-∞, ∞)	No	No	Cheap
GELU	(-0.17, ∞)	Approx	No	Moderate
Swish	(-0.28, ∞)	Approx	No	Moderate
Softmax	(0, 1) sum=1	N/A	No	Moderate

The Dead Neuron Problem

DfDead Neurons

A dead neuron is a ReLU neuron that always outputs 0 because its weights are such that $W\mathbf{x} + b < 0$ for all training inputs. Dead neurons receive zero gradient and never recover.

Causes:

Learning rate too high (weights become too negative)
Weight initialization too negative
Biases initialized too negatively

Solutions:

Use Leaky ReLU or ELU
Careful weight initialization (He initialization)
Lower learning rate
Batch normalization

Dead Neuron Statistics

In large networks with ReLU, typically 1-10% of neurons may be dead after training. While some sparsity is beneficial, too many dead neurons indicate training problems. Monitor the fraction of zero activations during training.

Summary

Activation functions introduce non-linearity, enabling deep networks to learn complex patterns
ReLU is the default for hidden layers: fast, prevents vanishing gradients, but causes dead neurons
GELU/Swish are preferred for transformers and deep networks: smooth, self-gated
Softmax for multi-class output, Sigmoid for binary output, None for regression
Dead neurons are a practical concern with ReLU; use Leaky ReLU or careful initialization to mitigate

Next: Loss Functions for Deep Learning

Activation Functions — Sigmoid, ReLU, GELU and The Dead Neuron Problem

Activation Functions — Adding Non-Linearity to Neural Networks

Activation Functions — Sigmoid, ReLU, GELU and The Dead Neuron Problem

Why Activation Functions?

DfNon-Linearity

Sigmoid

DfSigmoid Function

Sigmoid Function

ThVanishing Gradient in Sigmoid

Tanh

DfTanh Function

Activation Function Comparison

ReLU and Variants

DfReLU (Rectified Linear Unit)

DfLeaky ReLU

DfELU (Exponential Linear Unit)

GELU and Swish

DfGELU (Gaussian Error Linear Unit)

DfSwish (SiLU)

Softmax

DfSoftmax Function

When to Use Each Activation

DfActivation Function Properties

The Dead Neuron Problem

DfDead Neurons

Summary

Premium Content

Need Expert Deep Learning Help?