πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Transformer Architecture

TransformersComplete Transformer Model🟒 Free Lesson

Advertisement

Transformer Architecture

The transformer, introduced in "Attention Is All You Need" (Vaswani et al., 2017), revolutionized NLP by eliminating recurrence entirely and relying solely on attention mechanisms.

Encoder Layer

Each encoder layer contains two sub-layers: multi-head self-attention and a position-wise feed-forward network.

Feed-Forward Network

DfPosition-wise Feed-Forward Network

The inner dimension d_ff is typically 4Γ— the model dimension d_model (e.g., 2048 for d_model=512).

Positional Encoding

Since transformers have no recurrence, positional encodings inject sequence order information.

DfSinusoidal Positional Encoding

DfSinusoidal Positional Encoding (even)

import torch
import math

def sinusoidal_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1).float()
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe.unsqueeze(0)  # (1, max_len, d_model)

# Generate encodings
pe = sinusoidal_encoding(5000, 512)
print(f"Positional encoding shape: {pe.shape}")  # (1, 5000, 512)

Sinusoidal encodings have the property that PE(pos+k) can be represented as a linear function of PE(pos), enabling the model to learn to attend by relative positions.

Decoder Layer

The decoder has three sub-layers:

  1. Masked multi-head self-attention β€” prevents attending to future tokens
  2. Multi-head cross-attention β€” attends to encoder output
  3. Position-wise feed-forward network
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        # Self-attention with causal mask
        attn1, _ = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn1))

        # Cross-attention with encoder output
        attn2, _ = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn2))

        # Feed-forward
        ff_out = self.ffn(x)
        x = self.norm3(x + self.dropout(ff_out))
        return x

Full Transformer Model

class Transformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_model=512, num_heads=8,
                 num_layers=6, d_ff=2048, max_len=5000, dropout=0.1):
        super().__init__()
        self.d_model = d_model

        # Embeddings
        self.src_embedding = nn.Embedding(src_vocab, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab, d_model)
        self.positional_encoding = sinusoidal_encoding(max_len, d_model)

        # Encoder and Decoder stacks
        self.encoder_layers = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        self.decoder_layers = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        # Output projection
        self.output_proj = nn.Linear(d_model, tgt_vocab)
        self.dropout = nn.Dropout(dropout)

    def encode(self, src, src_mask=None):
        x = self.src_embedding(src) * math.sqrt(self.d_model)
        x = x + self.positional_encoding[:, :src.size(1), :]
        x = self.dropout(x)
        for layer in self.encoder_layers:
            x = layer(x, x, src_mask)
        return x

    def decode(self, tgt, enc_output, src_mask=None, tgt_mask=None):
        x = self.tgt_embedding(tgt) * math.sqrt(self.d_model)
        x = x + self.positional_encoding[:, :tgt.size(1), :]
        x = self.dropout(x)
        for layer in self.decoder_layers:
            x = layer(x, enc_output, src_mask, tgt_mask)
        return x

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        enc_output = self.encode(src, src_mask)
        dec_output = self.decode(tgt, enc_output, src_mask, tgt_mask)
        return self.output_proj(dec_output)

Hyperparameter Comparison

Modeld_modelHeadsLayersd_ffParameters
Transformer (base)51286204865M
Transformer (big)10241664096213M
BERT-Base76812123072110M
BERT-Large102416244096340M
GPT-21600254864001.5B

Layer Normalization

DfLayer Normalization

Where ΞΌ and σ² are computed over the feature dimension, and Ξ³, Ξ² are learnable parameters.

Key Design Principles

PrincipleImplementationBenefit
ParallelizationSelf-attention over all positionsFaster training than RNNs
Multi-head attentionh parallel attention headsCaptures diverse relationships
Residual connectionsAdd input to sub-layer outputEnables deep networks
Layer normalizationNormalize activationsStable training
Positional encodingSinusoidal functionsSequence order awareness
Feed-forward networksTwo linear layers with ReLUNon-linear transformations

Transformer Forward Pass

⭐

Premium Content

Transformer Architecture

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement