Transformer Architecture

The transformer, introduced in "Attention Is All You Need" (Vaswani et al., 2017), revolutionized NLP by eliminating recurrence entirely and relying solely on attention mechanisms.

Encoder Layer

Each encoder layer contains two sub-layers: multi-head self-attention and a position-wise feed-forward network.

Feed-Forward Network

DfPosition-wise Feed-Forward Network

The inner dimension d_ff is typically 4× the model dimension d_model (e.g., 2048 for d_model=512).

Positional Encoding

Since transformers have no recurrence, positional encodings inject sequence order information.

DfSinusoidal Positional Encoding

DfSinusoidal Positional Encoding (even)

import torch
import math

def sinusoidal_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1).float()
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe.unsqueeze(0)  # (1, max_len, d_model)

# Generate encodings
pe = sinusoidal_encoding(5000, 512)
print(f"Positional encoding shape: {pe.shape}")  # (1, 5000, 512)

Sinusoidal encodings have the property that PE(pos+k) can be represented as a linear function of PE(pos), enabling the model to learn to attend by relative positions.

Decoder Layer

The decoder has three sub-layers:

Masked multi-head self-attention — prevents attending to future tokens
Multi-head cross-attention — attends to encoder output
Position-wise feed-forward network

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        # Self-attention with causal mask
        attn1, _ = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn1))

        # Cross-attention with encoder output
        attn2, _ = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn2))

        # Feed-forward
        ff_out = self.ffn(x)
        x = self.norm3(x + self.dropout(ff_out))
        return x

Full Transformer Model

class Transformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_model=512, num_heads=8,
                 num_layers=6, d_ff=2048, max_len=5000, dropout=0.1):
        super().__init__()
        self.d_model = d_model

        # Embeddings
        self.src_embedding = nn.Embedding(src_vocab, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab, d_model)
        self.positional_encoding = sinusoidal_encoding(max_len, d_model)

        # Encoder and Decoder stacks
        self.encoder_layers = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        self.decoder_layers = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        # Output projection
        self.output_proj = nn.Linear(d_model, tgt_vocab)
        self.dropout = nn.Dropout(dropout)

    def encode(self, src, src_mask=None):
        x = self.src_embedding(src) * math.sqrt(self.d_model)
        x = x + self.positional_encoding[:, :src.size(1), :]
        x = self.dropout(x)
        for layer in self.encoder_layers:
            x = layer(x, x, src_mask)
        return x

    def decode(self, tgt, enc_output, src_mask=None, tgt_mask=None):
        x = self.tgt_embedding(tgt) * math.sqrt(self.d_model)
        x = x + self.positional_encoding[:, :tgt.size(1), :]
        x = self.dropout(x)
        for layer in self.decoder_layers:
            x = layer(x, enc_output, src_mask, tgt_mask)
        return x

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        enc_output = self.encode(src, src_mask)
        dec_output = self.decode(tgt, enc_output, src_mask, tgt_mask)
        return self.output_proj(dec_output)

Hyperparameter Comparison

Model	d_model	Heads	Layers	d_ff	Parameters
Transformer (base)	512	8	6	2048	65M
Transformer (big)	1024	16	6	4096	213M
BERT-Base	768	12	12	3072	110M
BERT-Large	1024	16	24	4096	340M
GPT-2	1600	25	48	6400	1.5B

Layer Normalization

DfLayer Normalization

Where μ and σ² are computed over the feature dimension, and γ, β are learnable parameters.

Key Design Principles

Principle	Implementation	Benefit
Parallelization	Self-attention over all positions	Faster training than RNNs
Multi-head attention	h parallel attention heads	Captures diverse relationships
Residual connections	Add input to sub-layer output	Enables deep networks
Layer normalization	Normalize activations	Stable training
Positional encoding	Sinusoidal functions	Sequence order awareness
Feed-forward networks	Two linear layers with ReLU	Non-linear transformations

Transformer Architecture