DL Foundations
Activation Functions — Adding Non-Linearity to Neural Networks
Without activation functions, deep networks collapse into linear models no matter how many layers are stacked. Activation functions introduce the non-linearity that enables learning of arbitrary decision boundaries.
- ReLU is the Default — Fast, prevents vanishing gradients, provides sparse activations for hidden layers
- GELU for Transformers — Smooth approximation to ReLU used in BERT, GPT, and Vision Transformers
- Dead Neuron Problem — ReLU neurons that always output zero require careful initialization and architecture choices
Activation Functions — Sigmoid, ReLU, GELU and The Dead Neuron Problem
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without them, a deep network would collapse into a linear model.
Why Activation Functions?
DfNon-Linearity
Without activation functions, a deep network is just a linear transformation:
No matter how many layers you stack, the result is equivalent to a single linear layer. Activation functions break this linearity, allowing networks to learn arbitrary decision boundaries.
Sigmoid
DfSigmoid Function
The sigmoid function maps any real number to :
- Output range:
- Derivative:
- Peak derivative: at
Sigmoid Function
Here,
- =Input (any real number)
- =Output in (0, 1)
- =Derivative (max 0.25 at x=0)
ThVanishing Gradient in Sigmoid
The maximum derivative of the sigmoid function is . In a network with layers, the gradient through sigmoid layers is bounded by , which decays exponentially. For , the gradient is less than , making training extremely slow.
Tanh
DfTanh Function
The hyperbolic tangent maps inputs to :
- Output range: (zero-centered)
- Derivative:
- Peak derivative: at
Activation Function Comparison
ReLU and Variants
DfReLU (Rectified Linear Unit)
- Output range:
- Derivative:
- Advantages: Computationally efficient,缓解 vanishing gradients, sparse activations
- Disadvantage: Dead neurons (output always 0)
DfLeaky ReLU
where is a small constant (typically 0.01). Prevents dead neurons by allowing small negative outputs.
DfELU (Exponential Linear Unit)
ELU combines benefits of ReLU (no vanishing gradient for positive inputs) with mean activations closer to zero.
GELU and Swish
DfGELU (Gaussian Error Linear Unit)
where is the CDF of the standard Gaussian. GELU is a smooth approximation to ReLU used in BERT, GPT, and Vision Transformers.
DfSwish (SiLU)
When , Swish approaches ReLU. When , it equals SiLU. Swish is self-gated and smooth everywhere.
Softmax
DfSoftmax Function
The softmax function converts logits to probabilities:
Properties:
- Output sums to 1:
- Preserves order: larger logits get higher probabilities
- Differentiable everywhere
- Temperature parameter : controls sharpness
Numerical Stability
In practice, subtract the max logit before softmax to prevent overflow:
This is mathematically equivalent but numerically stable.
When to Use Each Activation
DfActivation Function Properties
| Function | Range | Zero-Centered | Dead Neurons | Computation |
|---|---|---|---|---|
| Sigmoid | (0, 1) | No | No | Expensive |
| Tanh | (-1, 1) | Yes | No | Expensive |
| ReLU | [0, ∞) | No | Yes | Very cheap |
| Leaky ReLU | (-∞, ∞) | No | No | Cheap |
| GELU | (-0.17, ∞) | Approx | No | Moderate |
| Swish | (-0.28, ∞) | Approx | No | Moderate |
| Softmax | (0, 1) sum=1 | N/A | No | Moderate |
The Dead Neuron Problem
DfDead Neurons
A dead neuron is a ReLU neuron that always outputs 0 because its weights are such that for all training inputs. Dead neurons receive zero gradient and never recover.
Causes:
- Learning rate too high (weights become too negative)
- Weight initialization too negative
- Biases initialized too negatively
Solutions:
- Use Leaky ReLU or ELU
- Careful weight initialization (He initialization)
- Lower learning rate
- Batch normalization
Dead Neuron Statistics
In large networks with ReLU, typically 1-10% of neurons may be dead after training. While some sparsity is beneficial, too many dead neurons indicate training problems. Monitor the fraction of zero activations during training.
Summary
- Activation functions introduce non-linearity, enabling deep networks to learn complex patterns
- ReLU is the default for hidden layers: fast, prevents vanishing gradients, but causes dead neurons
- GELU/Swish are preferred for transformers and deep networks: smooth, self-gated
- Softmax for multi-class output, Sigmoid for binary output, None for regression
- Dead neurons are a practical concern with ReLU; use Leaky ReLU or careful initialization to mitigate