Transformer Architecture
The transformer, introduced in "Attention Is All You Need" (Vaswani et al., 2017), revolutionized NLP by eliminating recurrence entirely and relying solely on attention mechanisms.
Encoder Layer
Each encoder layer contains two sub-layers: multi-head self-attention and a position-wise feed-forward network.
Feed-Forward Network
DfPosition-wise Feed-Forward Network
The inner dimension d_ff is typically 4Γ the model dimension d_model (e.g., 2048 for d_model=512).
Positional Encoding
Since transformers have no recurrence, positional encodings inject sequence order information.
DfSinusoidal Positional Encoding
DfSinusoidal Positional Encoding (even)
import torch
import math
def sinusoidal_encoding(max_len, d_model):
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe.unsqueeze(0) # (1, max_len, d_model)
# Generate encodings
pe = sinusoidal_encoding(5000, 512)
print(f"Positional encoding shape: {pe.shape}") # (1, 5000, 512)
Sinusoidal encodings have the property that PE(pos+k) can be represented as a linear function of PE(pos), enabling the model to learn to attend by relative positions.
Decoder Layer
The decoder has three sub-layers:
- Masked multi-head self-attention β prevents attending to future tokens
- Multi-head cross-attention β attends to encoder output
- Position-wise feed-forward network
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.cross_attn = MultiHeadAttention(d_model, num_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
# Self-attention with causal mask
attn1, _ = self.self_attn(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout(attn1))
# Cross-attention with encoder output
attn2, _ = self.cross_attn(x, enc_output, enc_output, src_mask)
x = self.norm2(x + self.dropout(attn2))
# Feed-forward
ff_out = self.ffn(x)
x = self.norm3(x + self.dropout(ff_out))
return x
Full Transformer Model
class Transformer(nn.Module):
def __init__(self, src_vocab, tgt_vocab, d_model=512, num_heads=8,
num_layers=6, d_ff=2048, max_len=5000, dropout=0.1):
super().__init__()
self.d_model = d_model
# Embeddings
self.src_embedding = nn.Embedding(src_vocab, d_model)
self.tgt_embedding = nn.Embedding(tgt_vocab, d_model)
self.positional_encoding = sinusoidal_encoding(max_len, d_model)
# Encoder and Decoder stacks
self.encoder_layers = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
self.decoder_layers = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Output projection
self.output_proj = nn.Linear(d_model, tgt_vocab)
self.dropout = nn.Dropout(dropout)
def encode(self, src, src_mask=None):
x = self.src_embedding(src) * math.sqrt(self.d_model)
x = x + self.positional_encoding[:, :src.size(1), :]
x = self.dropout(x)
for layer in self.encoder_layers:
x = layer(x, x, src_mask)
return x
def decode(self, tgt, enc_output, src_mask=None, tgt_mask=None):
x = self.tgt_embedding(tgt) * math.sqrt(self.d_model)
x = x + self.positional_encoding[:, :tgt.size(1), :]
x = self.dropout(x)
for layer in self.decoder_layers:
x = layer(x, enc_output, src_mask, tgt_mask)
return x
def forward(self, src, tgt, src_mask=None, tgt_mask=None):
enc_output = self.encode(src, src_mask)
dec_output = self.decode(tgt, enc_output, src_mask, tgt_mask)
return self.output_proj(dec_output)
Hyperparameter Comparison
| Model | d_model | Heads | Layers | d_ff | Parameters |
|---|---|---|---|---|---|
| Transformer (base) | 512 | 8 | 6 | 2048 | 65M |
| Transformer (big) | 1024 | 16 | 6 | 4096 | 213M |
| BERT-Base | 768 | 12 | 12 | 3072 | 110M |
| BERT-Large | 1024 | 16 | 24 | 4096 | 340M |
| GPT-2 | 1600 | 25 | 48 | 6400 | 1.5B |
Layer Normalization
DfLayer Normalization
Where ΞΌ and ΟΒ² are computed over the feature dimension, and Ξ³, Ξ² are learnable parameters.
Key Design Principles
| Principle | Implementation | Benefit |
|---|---|---|
| Parallelization | Self-attention over all positions | Faster training than RNNs |
| Multi-head attention | h parallel attention heads | Captures diverse relationships |
| Residual connections | Add input to sub-layer output | Enables deep networks |
| Layer normalization | Normalize activations | Stable training |
| Positional encoding | Sinusoidal functions | Sequence order awareness |
| Feed-forward networks | Two linear layers with ReLU | Non-linear transformations |