Transformers and BERT: Attention Is All You Need

From RNNs to Transformers

The Limitations of Recurrent Architectures

Recurrent Neural Networks (RNNs) and their variants (LSTM, GRU) processed sequences step-by-step, maintaining a hidden state that accumulated information over time. While effective for many tasks, they suffered from fundamental limitations:

Sequential Bottleneck: RNNs process tokens one at a time, making parallelization impossible. For a sequence of length $n$ , the time complexity is $O(n)$ sequential operations, preventing GPU acceleration.

Long-Range Dependencies: Despite gating mechanisms, RNNs struggled to maintain information over long distances. The gradient signal must propagate through every intermediate step, leading to vanishing or exploding gradients.

Information Bottleneck: The fixed-size hidden state $h_t \in \mathbb{R}^d$ must compress all relevant information from the entire sequence, creating a capacity bottleneck.

The Attention Revolution

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), eliminated recurrence entirely. The key insight: use self-attention to model relationships between all positions simultaneously, enabling full parallelization and direct connections across arbitrary distances.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The Transformer achieves $O(1)$ sequential operations (fully parallelizable) and $O(n^2 \cdot d)$ compute complexity, trading sequential depth for quadratic attention computation — a favorable tradeoff on modern hardware.

RNNs	Transformers
Sequential: $O(n)$ steps	Parallel: $O(1)$ steps
Hidden state bottleneck	Full pairwise attention
$O(n \cdot d^2)$ compute	$O(n^2 \cdot d)$ compute
Struggles with long range	Direct connections across all distances

Self-Attention Mechanism

Query, Key, Value Framework

Self-attention computes a weighted sum of all positions in a sequence, where the weights are determined by the compatibility (dot product) between positions. Each input token $x_i$ is projected into three vectors:

q_i = W_Q x_i, \quad k_i = W_K x_i, \quad v_i = W_V x_i

where $W_Q, W_K \in \mathbb{R}^{d_k \times d}$ and $W_V \in \mathbb{R}^{d_v \times d}$ are learned projection matrices.

Scaled Dot-Product Attention

Given a sequence of $n$ tokens with embeddings $X \in \mathbb{R}^{n \times d}$ , we compute the full attention operation as:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where:

$Q = XW_Q \in \mathbb{R}^{n \times d_k}$ — Query matrix
$K = XW_K \in \mathbb{R}^{n \times d_k}$ — Key matrix
$V = XW_V \in \mathbb{R}^{n \times d_v}$ — Value matrix

Why scale by $\sqrt{d_k}$ ? When $d_k$ is large, the dot products $q_i^T k_j$ tend to have large magnitudes, pushing the softmax into regions with extremely small gradients. Scaling by $\sqrt{d_k}$ (the standard deviation of the dot product under random initialization) keeps the softmax in a regime with useful gradients:

\text{Var}(q_i^T k_j) = d_k \cdot \text{Var}(q_i) \cdot \text{Var}(k_j) \cdot \cos(\theta) = d_k \cdot \frac{1}{d_k} = 1

Attention as Soft Retrieval

Self-attention can be interpreted as a soft content-based retrieval system:

Queries represent what each token is "looking for"
Keys represent what each token "offers"
Values represent the actual information carried by each token
The attention weights $\alpha_{ij}$ determine how much information to retrieve from position $j$ for position $i$

Computational Complexity

The full attention computation requires:

Matrix multiplication $QK^T$ : $O(n^2 \cdot d_k)$
Softmax: $O(n^2)$
Weighted sum: $O(n^2 \cdot d_v)$
Total: $O(n^2 \cdot d)$

This quadratic complexity in sequence length is the primary limitation of standard Transformers for very long sequences.

Multi-Head Attention

Parallel Attention Heads

Multi-Head Attention runs multiple attention operations in parallel, allowing the model to attend to different types of relationships simultaneously:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O

\text{head}_i = \text{Attention}(QW_Q^i, KW_K^i, VW_V^i)

where $W_Q^i \in \mathbb{R}^{d_k \times d/h}$ , $W_K^i \in \mathbb{R}^{d_k \times d/h}$ , $W_V^i \in \mathbb{R}^{d_v \times d/h}$ , and $W_O \in \mathbb{R}^{d \times d}$ .

Why Multiple Heads Work

Different heads specialize in different linguistic phenomena:

Head 1 may attend to syntactic dependencies (subject-verb agreement)
Head 2 may capture semantic relationships (modifier-modified)
Head 3 may track positional patterns (adjacent tokens)
Head 4 may resolve coreference (pronoun-antecedent)

With $h$ heads each using $d_k = d/h$ dimensions, the total compute is equivalent to a single head with full dimensionality, but the representational capacity is significantly richer.

Attention Head Visualization

In practice, attention patterns can be visualized as heatmaps where each row represents a query token and each column represents a key token:

Positional Encoding

The Need for Position Information

Self-attention is permutation-equivariant — it treats the input as a set, not a sequence. Without positional information, "The cat sat" and "sat cat The" would produce identical representations. Positional encodings inject sequence order information.

Sinusoidal Positional Encoding

The original Transformer uses fixed sinusoidal functions:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

where $pos$ is the position index and $i$ is the dimension index.

Key Properties:

Bounded values: $PE_{(pos, i)} \in [-1, 1]$ for all positions and dimensions
Relative positions learnable: $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$ (rotation matrix in each frequency pair)
Unique encoding per position: Each position has a distinct encoding vector
Generalization to unseen lengths: The model can extrapolate to sequences longer than seen in training

Alternative Positional Encodings

Learned Positional Embeddings: BERT and GPT use learned position embeddings $E_{pos} \in \mathbb{R}^{d}$ stored in a lookup table. This is simpler but limits generalization to unseen sequence lengths.

Rotary Position Embeddings (RoPE): Encodes positions by rotating query and key vectors in 2D planes, enabling relative position awareness through dot products. Used in modern LLMs like LLaMA.

ALiBi (Attention with Linear Biases): Adds linear bias terms to attention scores based on relative distance, without explicit positional encoding.

Transformer Encoder Architecture

Encoder Block

Each Transformer encoder block consists of two sub-layers with residual connections and layer normalization:

Mathematical Formulation

Sub-layer 1: Multi-Head Self-Attention

\text{MHA}(x) = \text{MultiHead}(x, x, x)

\text{z} = \text{LayerNorm}(x + \text{MHA}(x))

Sub-layer 2: Position-wise Feed-Forward Network

\text{FFN}(z) = \max(0, zW_1 + b_1)W_2 + b_2

\text{output} = \text{LayerNorm}(z + \text{FFN}(z))

The FFN is applied independently to each position but shares the same parameters across positions — hence "position-wise."

Layer Normalization

\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

where $\mu$ and $\sigma^2$ are computed across the feature dimension for each token, and $\gamma, \beta$ are learned scale and shift parameters.

Transformer Decoder Architecture

Decoder Block

The decoder extends the encoder with an additional masked self-attention sub-layer and encoder-decoder cross-attention:

Causal Masking

For autoregressive generation, the decoder uses a causal mask to prevent attending to future positions:

\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

Applied before softmax: $\text{softmax}(S + \text{Mask})$ ensures position $i$ only attends to positions $\leq i$ .

BERT: Bidirectional Encoder Representations from Transformers

Architecture Overview

BERT (Devlin et al., 2019) uses only the Transformer encoder stack, enabling bidirectional context understanding — unlike GPT which is left-to-right only.

Key: Bidirectional — each token attends to ALL other tokens (left and right)

BERT Model Variants

Model	Layers ( $L$ )	Hidden ( $d$ )	Heads ( $h$ )	Parameters
BERT-base	12	768	12	110M
BERT-large	24	1024	16	340M

Pre-Training Objectives

BERT is pre-trained on two self-supervised tasks:

1. Masked Language Modeling (MLM)

Randomly mask 15% of input tokens and predict them:

\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}})

Of the masked tokens:

80% are replaced with [MASK]
10% are replaced with a random token
10% are left unchanged

This creates a mismatch between pre-training and fine-tuning (no [MASK] token at inference), so the 80/10/10 strategy mitigates this.

2. Next Sentence Prediction (NSP)

Given sentence pairs $(A, B)$ , predict whether $B$ actually follows $A$ in the corpus:

\mathcal{L}_{\text{NSP}} = -\left[ y \log P(\text{IsNext} | [CLS], A, B) + (1-y) \log P(\text{NotNext} | [CLS], A, B) \right]

BERT Input Representation

The input to BERT is the sum of three embeddings:

\text{Input} = \text{TokenEmbed}(x) + \text{PosEmbed}(pos) + \text{SegEmbed}(seg)

where:

Token Embedding: WordPiece tokenization (max 512 tokens)
Position Embedding: Learned (max 512 positions)
Segment Embedding: Distinguishes sentence A vs B (for NSP)

Special tokens: [CLS] (classification), [SEP] (separator), [MASK] (masked), [PAD] (padding)

GPT: Autoregressive Language Modeling

GPT Architecture

GPT (Radford et al., 2018) uses only the Transformer decoder stack, training autoregressively to predict the next token:

\mathcal{L}_{\text{GPT}} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}; \theta)

BERT vs GPT: Key Differences

Aspect	BERT	GPT
Architecture	Encoder only	Decoder only
Attention	Bidirectional	Unidirectional (causal)
Pre-training	MLM + NSP	Next token prediction
Fine-tuning	Classification, QA, NER	Generation, classification
Best for	Understanding tasks	Generation tasks

BERT Variants and Evolution

RoBERTa (Liu et al., 2019): Removed NSP, more data, larger batches
ALBERT (Lan et al., 2020): Parameter sharing, factorized embeddings
DistilBERT (Sanh et al., 2019): Knowledge distillation, 60% params, 97% performance
DeBERTa (He et al., 2021): Disentangled attention, enhanced decoder

Fine-Tuning Pretrained Models

Transfer Learning Paradigm

The modern NLP paradigm follows a two-stage approach:

Pre-training: Learn general language representations on large corpora (self-supervised)
Fine-tuning: Adapt to specific downstream tasks (supervised)

Task-Specific Heads

Different tasks require different output layers:

Task	Input	Output	Head
Sentiment	`[CLS]` token	Binary/multi-class	Linear + softmax
NER	All tokens	Per-token labels	Linear + CRF
QA	Passage + Question	Start/end span	Two linear layers
Similarity	Two `[CLS]`	Similarity score	Cosine / MLP
Generation	All tokens	Next token distribution	Language head

Hugging Face Transformers Implementation

Installation

pip install transformers datasets accelerate

Basic Usage

from transformers import AutoTokenizer, AutoModel

# Load pretrained BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize input
text = "The cat sat on the mat"
inputs = tokenizer(text, return_tensors="pt")

# Forward pass
outputs = model(**inputs)
# outputs.last_hidden_state: (batch, seq_len, hidden_dim)
# outputs.pooler_output: (batch, hidden_dim) - [CLS] token

Fine-Tuning for Text Classification

from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

# Load model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
)

# Tokenize
def tokenize_fn(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512,
    )

tokenized = dataset.map(tokenize_fn, batched=True)

# Training arguments
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics,
)

# Fine-tune
trainer.train()

Fine-Tuning for Named Entity Recognition

from transformers import AutoModelForTokenClassification
import numpy as np
import evaluate

# Load model
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=9,  # CoNLL-2003 NER tags
)

# Tokenize with alignment
def tokenize_and_align_labels(examples):
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        prev_word_id = None
        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)  # Special tokens
            elif word_id != prev_word_id:
                label_ids.append(label[word_id])
            else:
                label_ids.append(-100)  # Subword tokens
            prev_word_id = word_id
        labels.append(label_ids)
    tokenized["labels"] = labels
    return tokenized

tokenized_ner = dataset.map(tokenize_and_align_labels, batched=True)

# Compute metrics
def compute_metrics_ner(pred):
    predictions, labels = pred
    predictions = np.argmax(predictions, axis=2)
    true_labels = [
        [label for label, pred in zip(label, prediction) if label != -100]
        for label, prediction in zip(labels, predictions)
    ]
    true_preds = [
        [label for label, pred in zip(label, prediction) if label != -100]
        for label, prediction in zip(labels, predictions)
    ]
    results = metric.compute(predictions=true_preds, references=true_labels)
    return results

Using Pipelines for Inference

from transformers import pipeline

# Sentiment Analysis
classifier = pipeline("sentiment-analysis")
result = classifier("This movie is amazing!")
# {'label': 'POSITIVE', 'score': 0.9998}

# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
result = ner("Apple was founded by Steve Jobs in California")
# [{'entity_group': 'ORG', 'word': 'Apple', ...},
#  {'entity_group': 'PER', 'word': 'Steve Jobs', ...},
#  {'entity_group': 'LOC', 'word': 'California', ...}]

# Question Answering
qa = pipeline("question-answering")
result = qa(
    question="When was BERT published?",
    context="BERT was published by Google in October 2018.",
)
# {'answer': 'October 2018', 'score': 0.95, ...}

# Text Generation (GPT-2)
generator = pipeline("text-generation", model="gpt2")
result = generator("The future of AI is", max_length=50)

Advanced: Custom Model Architecture

import torch
import torch.nn as nn
from transformers import AutoModel

class CustomTransformerModel(nn.Module):
    def __init__(self, model_name, num_classes):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        self.classifier = nn.Sequential(
            nn.Dropout(0.1),
            nn.Linear(self.encoder.config.hidden_size, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_classes),
        )

    def forward(self, input_ids, attention_mask=None):
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        pooled = outputs.pooler_output  # [CLS] token
        logits = self.classifier(pooled)
        return logits

# Usage
model = CustomTransformerModel("bert-base-uncased", num_classes=5)

Key Takeaways

Self-attention enables parallel processing of sequences with direct connections between all positions, solving the sequential bottleneck of RNNs.
Multi-head attention allows the model to capture diverse relationships (syntactic, semantic, positional) simultaneously.
Positional encoding injects sequence order information into the permutation-equivariant self-attention mechanism.
BERT (encoder-only) excels at understanding tasks through bidirectional pre-training with MLM and NSP objectives.
GPT (decoder-only) excels at generation tasks through autoregressive next-token prediction.
Fine-tuning pretrained models on task-specific data achieves strong performance with minimal labeled data and compute.
Hugging Face Transformers provides a unified API for accessing, fine-tuning, and deploying transformer models.

References

Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
Radford, A., et al. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.
Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.
Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS.
Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. EMNLP.