πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Transformers and BERT: Attention Is All You Need

Module 14: NLPTransformers and BERT🟒 Free Lesson

Advertisement

From RNNs to Transformers

The Limitations of Recurrent Architectures

Recurrent Neural Networks (RNNs) and their variants (LSTM, GRU) processed sequences step-by-step, maintaining a hidden state that accumulated information over time. While effective for many tasks, they suffered from fundamental limitations:

Sequential Bottleneck: RNNs process tokens one at a time, making parallelization impossible. For a sequence of length nn, the time complexity is O(n)O(n) sequential operations, preventing GPU acceleration.

Long-Range Dependencies: Despite gating mechanisms, RNNs struggled to maintain information over long distances. The gradient signal must propagate through every intermediate step, leading to vanishing or exploding gradients.

Information Bottleneck: The fixed-size hidden state ht∈Rdh_t \in \mathbb{R}^d must compress all relevant information from the entire sequence, creating a capacity bottleneck.

The Attention Revolution

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), eliminated recurrence entirely. The key insight: use self-attention to model relationships between all positions simultaneously, enabling full parallelization and direct connections across arbitrary distances.

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The Transformer achieves O(1)O(1) sequential operations (fully parallelizable) and O(n2β‹…d)O(n^2 \cdot d) compute complexity, trading sequential depth for quadratic attention computation β€” a favorable tradeoff on modern hardware.

RNNsTransformers
Sequential: O(n)O(n) stepsParallel: O(1)O(1) steps
Hidden state bottleneckFull pairwise attention
O(nβ‹…d2)O(n \cdot d^2) computeO(n2β‹…d)O(n^2 \cdot d) compute
Struggles with long rangeDirect connections across all distances

Self-Attention Mechanism

Query, Key, Value Framework

Self-attention computes a weighted sum of all positions in a sequence, where the weights are determined by the compatibility (dot product) between positions. Each input token xix_i is projected into three vectors:

qi=WQxi,ki=WKxi,vi=WVxiq_i = W_Q x_i, \quad k_i = W_K x_i, \quad v_i = W_V x_i

where WQ,WK∈RdkΓ—dW_Q, W_K \in \mathbb{R}^{d_k \times d} and WV∈RdvΓ—dW_V \in \mathbb{R}^{d_v \times d} are learned projection matrices.

Inputx_iW_Qq_iW_Kk_iW_Vv_iScaled Dot-Product1. Compute ScoresS = QK^T / sqrt(d_k)2. Apply Softmaxalpha = softmax(S)3. Weighted SumOutputz_i = Sigma(alpha*v)Each position attends to all positions simultaneously

Scaled Dot-Product Attention

Given a sequence of nn tokens with embeddings X∈RnΓ—dX \in \mathbb{R}^{n \times d}, we compute the full attention operation as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where:

  • Q=XWQ∈RnΓ—dkQ = XW_Q \in \mathbb{R}^{n \times d_k} β€” Query matrix
  • K=XWK∈RnΓ—dkK = XW_K \in \mathbb{R}^{n \times d_k} β€” Key matrix
  • V=XWV∈RnΓ—dvV = XW_V \in \mathbb{R}^{n \times d_v} β€” Value matrix

Why scale by dk\sqrt{d_k}? When dkd_k is large, the dot products qiTkjq_i^T k_j tend to have large magnitudes, pushing the softmax into regions with extremely small gradients. Scaling by dk\sqrt{d_k} (the standard deviation of the dot product under random initialization) keeps the softmax in a regime with useful gradients:

Var(qiTkj)=dkβ‹…Var(qi)β‹…Var(kj)β‹…cos⁑(ΞΈ)=dkβ‹…1dk=1\text{Var}(q_i^T k_j) = d_k \cdot \text{Var}(q_i) \cdot \text{Var}(k_j) \cdot \cos(\theta) = d_k \cdot \frac{1}{d_k} = 1

Attention as Soft Retrieval

Self-attention can be interpreted as a soft content-based retrieval system:

  • Queries represent what each token is "looking for"
  • Keys represent what each token "offers"
  • Values represent the actual information carried by each token
  • The attention weights Ξ±ij\alpha_{ij} determine how much information to retrieve from position jj for position ii

Computational Complexity

The full attention computation requires:

  • Matrix multiplication QKTQK^T: O(n2β‹…dk)O(n^2 \cdot d_k)
  • Softmax: O(n2)O(n^2)
  • Weighted sum: O(n2β‹…dv)O(n^2 \cdot d_v)
  • Total: O(n2β‹…d)O(n^2 \cdot d)

This quadratic complexity in sequence length is the primary limitation of standard Transformers for very long sequences.


Multi-Head Attention

Parallel Attention Heads

Multi-Head Attention runs multiple attention operations in parallel, allowing the model to attend to different types of relationships simultaneously:

MultiHead(Q,K,V)=Concat(head1,…,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O
headi=Attention(QWQi,KWKi,VWVi)\text{head}_i = \text{Attention}(QW_Q^i, KW_K^i, VW_V^i)

where WQi∈RdkΓ—d/hW_Q^i \in \mathbb{R}^{d_k \times d/h}, WKi∈RdkΓ—d/hW_K^i \in \mathbb{R}^{d_k \times d/h}, WVi∈RdvΓ—d/hW_V^i \in \mathbb{R}^{d_v \times d/h}, and WO∈RdΓ—dW_O \in \mathbb{R}^{d \times d}.

InputXHead 1 (syntactic)Head 2 (semantic)Head 3 (positional)Head 4 (coreference)...Concat[head_1; head_2; ...]Linear Proj.W_OOutputZ in R^(n x d)Each head learns distinct attention patterns

Why Multiple Heads Work

Different heads specialize in different linguistic phenomena:

  • Head 1 may attend to syntactic dependencies (subject-verb agreement)
  • Head 2 may capture semantic relationships (modifier-modified)
  • Head 3 may track positional patterns (adjacent tokens)
  • Head 4 may resolve coreference (pronoun-antecedent)

With hh heads each using dk=d/hd_k = d/h dimensions, the total compute is equivalent to a single head with full dimensionality, but the representational capacity is significantly richer.

Attention Head Visualization

In practice, attention patterns can be visualized as heatmaps where each row represents a query token and each column represents a key token:

Attention Weight Heatmap (Head 3)Thecatsatonthemat<- Keys (K)ThecatsatonthematAttention Weight:Low (~0.0)Medium (~0.3)High (~0.7)Pattern: Head 3 focuses onpredicate-argument structure* "cat" -> "sat" (subject-verb)* "sat" -> "mat" (verb-object)* Prepositions -> their objects

Positional Encoding

The Need for Position Information

Self-attention is permutation-equivariant β€” it treats the input as a set, not a sequence. Without positional information, "The cat sat" and "sat cat The" would produce identical representations. Positional encodings inject sequence order information.

Sinusoidal Positional Encoding

The original Transformer uses fixed sinusoidal functions:

PE(pos,2i)=sin⁑(pos100002i/d)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
PE(pos,2i+1)=cos⁑(pos100002i/d)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

where pospos is the position index and ii is the dimension index.

Key Properties:

  1. Bounded values: PE(pos,i)∈[βˆ’1,1]PE_{(pos, i)} \in [-1, 1] for all positions and dimensions
  2. Relative positions learnable: PEpos+kPE_{pos+k} can be expressed as a linear function of PEposPE_{pos} (rotation matrix in each frequency pair)
  3. Unique encoding per position: Each position has a distinct encoding vector
  4. Generalization to unseen lengths: The model can extrapolate to sequences longer than seen in training
Sinusoidal Positional EncodingPosition (pos)Encoding Value0+1-1dim=0 (low freq)dim=1 (low freq)dim=8 (high freq)0102030PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Alternative Positional Encodings

Learned Positional Embeddings: BERT and GPT use learned position embeddings Epos∈RdE_{pos} \in \mathbb{R}^{d} stored in a lookup table. This is simpler but limits generalization to unseen sequence lengths.

Rotary Position Embeddings (RoPE): Encodes positions by rotating query and key vectors in 2D planes, enabling relative position awareness through dot products. Used in modern LLMs like LLaMA.

ALiBi (Attention with Linear Biases): Adds linear bias terms to attention scores based on relative distance, without explicit positional encoding.


Transformer Encoder Architecture

Encoder Block

Each Transformer encoder block consists of two sub-layers with residual connections and layer normalization:

Input Embeddings + Pos EncodingX + PEMulti-Head Self-AttentionQ = K = V = XAdd and LayerNormx + LayerNorm(MHAttn(x))Feed-Forward NetworkFFN(x) = max(0, xW1 + b1)W2 + b2Add and LayerNormx + LayerNorm(FFN(x))Encoder OutputZ in R^(n x d)x N layers

Mathematical Formulation

Sub-layer 1: Multi-Head Self-Attention

MHA(x)=MultiHead(x,x,x)\text{MHA}(x) = \text{MultiHead}(x, x, x)
z=LayerNorm(x+MHA(x))\text{z} = \text{LayerNorm}(x + \text{MHA}(x))

Sub-layer 2: Position-wise Feed-Forward Network

FFN(z)=max⁑(0,zW1+b1)W2+b2\text{FFN}(z) = \max(0, zW_1 + b_1)W_2 + b_2
output=LayerNorm(z+FFN(z))\text{output} = \text{LayerNorm}(z + \text{FFN}(z))

The FFN is applied independently to each position but shares the same parameters across positions β€” hence "position-wise."

Layer Normalization

LayerNorm(x)=Ξ³βŠ™xβˆ’ΞΌΟƒ2+Ο΅+Ξ²\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

where ΞΌ\mu and Οƒ2\sigma^2 are computed across the feature dimension for each token, and Ξ³,Ξ²\gamma, \beta are learned scale and shift parameters.


Transformer Decoder Architecture

Decoder Block

The decoder extends the encoder with an additional masked self-attention sub-layer and encoder-decoder cross-attention:

Encoder OutputK_enc, V_encDecoder Input (shifted right)Y + PEMasked Multi-Head Self-AttentionPrevents attending to future tokensAdd and LayerNormMulti-Head Cross-AttentionQ from decoder, K/V from encoderAdd and LayerNormFeed-Forward NetworkSame as encoder FFNAdd and LayerNormLinear + Softmax -> P(y_t)x N layersCausal MaskLower triangular matrix

Causal Masking

For autoregressive generation, the decoder uses a causal mask to prevent attending to future positions:

Maskij={0ifΒ j≀iβˆ’βˆžifΒ j>i\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

Applied before softmax: softmax(S+Mask)\text{softmax}(S + \text{Mask}) ensures position ii only attends to positions ≀i\leq i.


BERT: Bidirectional Encoder Representations from Transformers

Architecture Overview

BERT (Devlin et al., 2019) uses only the Transformer encoder stack, enabling bidirectional context understanding β€” unlike GPT which is left-to-right only.

BERT Architecture (Bidirectional)[CLS]Thecat[MASK]saton[SEP]Transformer Encoder Layer 1Self-Attention + FFN + LayerNormTransformer Encoder Layer 2Full bidirectional attention...Transformer Encoder Layer LBERT-base: L=12, BERT-large: L=24[CLS]-> h_clsh_1h_2h_3h_4h_5h_sep

Key: Bidirectional β€” each token attends to ALL other tokens (left and right)

BERT Model Variants

ModelLayers (LL)Hidden (dd)Heads (hh)Parameters
BERT-base1276812110M
BERT-large24102416340M

Pre-Training Objectives

BERT is pre-trained on two self-supervised tasks:

1. Masked Language Modeling (MLM)

Randomly mask 15% of input tokens and predict them:

LMLM=βˆ’βˆ‘i∈Mlog⁑P(xi∣x\M)\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}})

Of the masked tokens:

  • 80% are replaced with [MASK]
  • 10% are replaced with a random token
  • 10% are left unchanged

This creates a mismatch between pre-training and fine-tuning (no [MASK] token at inference), so the 80/10/10 strategy mitigates this.

2. Next Sentence Prediction (NSP)

Given sentence pairs (A,B)(A, B), predict whether BB actually follows AA in the corpus:

LNSP=βˆ’[ylog⁑P(IsNext∣[CLS],A,B)+(1βˆ’y)log⁑P(NotNext∣[CLS],A,B)]\mathcal{L}_{\text{NSP}} = -\left[ y \log P(\text{IsNext} | [CLS], A, B) + (1-y) \log P(\text{NotNext} | [CLS], A, B) \right]
BERT Pre-Training TasksTask 1: Masked Language Modeling (MLM)Input:The[M]satonthe[M]BERT Encoder Layers (bidirectional)Predict:catmatP("cat")=0.72P("mat")=0.65Task 2: Next Sentence Prediction (NSP)Sentence A:"The cat sat"Sentence B:"on the mat"[CLS] The cat sat [SEP] on the mat [SEP]IsNextNotNextBinary classification from [CLS] token

BERT Input Representation

The input to BERT is the sum of three embeddings:

Input=TokenEmbed(x)+PosEmbed(pos)+SegEmbed(seg)\text{Input} = \text{TokenEmbed}(x) + \text{PosEmbed}(pos) + \text{SegEmbed}(seg)

where:

  • Token Embedding: WordPiece tokenization (max 512 tokens)
  • Position Embedding: Learned (max 512 positions)
  • Segment Embedding: Distinguishes sentence A vs B (for NSP)

Special tokens: [CLS] (classification), [SEP] (separator), [MASK] (masked), [PAD] (padding)


GPT: Autoregressive Language Modeling

GPT Architecture

GPT (Radford et al., 2018) uses only the Transformer decoder stack, training autoregressively to predict the next token:

LGPT=βˆ’βˆ‘t=1Tlog⁑P(xt∣x1,…,xtβˆ’1;ΞΈ)\mathcal{L}_{\text{GPT}} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}; \theta)
GPT Autoregressive GenerationThecatsatonthe?Causal attention: each token only sees previous tokensMasked Multi-Head Self-AttentionFeed-Forward Networkx N layers (GPT-3: 96 layers, 175B params)P(next)Top predictions:1. mat (0.42)2. floor (0.18)3. bed (0.12)

BERT vs GPT: Key Differences

AspectBERTGPT
ArchitectureEncoder onlyDecoder only
AttentionBidirectionalUnidirectional (causal)
Pre-trainingMLM + NSPNext token prediction
Fine-tuningClassification, QA, NERGeneration, classification
Best forUnderstanding tasksGeneration tasks

BERT Variants and Evolution

  • RoBERTa (Liu et al., 2019): Removed NSP, more data, larger batches
  • ALBERT (Lan et al., 2020): Parameter sharing, factorized embeddings
  • DistilBERT (Sanh et al., 2019): Knowledge distillation, 60% params, 97% performance
  • DeBERTa (He et al., 2021): Disentangled attention, enhanced decoder

Fine-Tuning Pretrained Models

Transfer Learning Paradigm

The modern NLP paradigm follows a two-stage approach:

  1. Pre-training: Learn general language representations on large corpora (self-supervised)
  2. Fine-tuning: Adapt to specific downstream tasks (supervised)
Fine-Tuning WorkflowPre-trainingLarge corpus (books, Wikipedia)Self-supervised objectivesDays to weeks on TPU podsBERT: 3.3B wordsGPT-3: 300B tokensOutput: Pretrained modelFine-tuneFine-tuningTask-specific labeled dataTask head + full model updateMinutes to hours on GPUSST-2: 67K sentencesMNLI: 393K pairsOutput: Task-specific modelDeployProduction model

Task-Specific Heads

Different tasks require different output layers:

TaskInputOutputHead
Sentiment[CLS] tokenBinary/multi-classLinear + softmax
NERAll tokensPer-token labelsLinear + CRF
QAPassage + QuestionStart/end spanTwo linear layers
SimilarityTwo [CLS]Similarity scoreCosine / MLP
GenerationAll tokensNext token distributionLanguage head

Hugging Face Transformers Implementation

Installation

pip install transformers datasets accelerate

Basic Usage

from transformers import AutoTokenizer, AutoModel

# Load pretrained BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize input
text = "The cat sat on the mat"
inputs = tokenizer(text, return_tensors="pt")

# Forward pass
outputs = model(**inputs)
# outputs.last_hidden_state: (batch, seq_len, hidden_dim)
# outputs.pooler_output: (batch, hidden_dim) - [CLS] token

Fine-Tuning for Text Classification

from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

# Load model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
)

# Tokenize
def tokenize_fn(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512,
    )

tokenized = dataset.map(tokenize_fn, batched=True)

# Training arguments
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics,
)

# Fine-tune
trainer.train()

Fine-Tuning for Named Entity Recognition

from transformers import AutoModelForTokenClassification
import numpy as np
import evaluate

# Load model
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=9,  # CoNLL-2003 NER tags
)

# Tokenize with alignment
def tokenize_and_align_labels(examples):
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        prev_word_id = None
        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)  # Special tokens
            elif word_id != prev_word_id:
                label_ids.append(label[word_id])
            else:
                label_ids.append(-100)  # Subword tokens
            prev_word_id = word_id
        labels.append(label_ids)
    tokenized["labels"] = labels
    return tokenized

tokenized_ner = dataset.map(tokenize_and_align_labels, batched=True)

# Compute metrics
def compute_metrics_ner(pred):
    predictions, labels = pred
    predictions = np.argmax(predictions, axis=2)
    true_labels = [
        [label for label, pred in zip(label, prediction) if label != -100]
        for label, prediction in zip(labels, predictions)
    ]
    true_preds = [
        [label for label, pred in zip(label, prediction) if label != -100]
        for label, prediction in zip(labels, predictions)
    ]
    results = metric.compute(predictions=true_preds, references=true_labels)
    return results

Using Pipelines for Inference

from transformers import pipeline

# Sentiment Analysis
classifier = pipeline("sentiment-analysis")
result = classifier("This movie is amazing!")
# {'label': 'POSITIVE', 'score': 0.9998}

# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
result = ner("Apple was founded by Steve Jobs in California")
# [{'entity_group': 'ORG', 'word': 'Apple', ...},
#  {'entity_group': 'PER', 'word': 'Steve Jobs', ...},
#  {'entity_group': 'LOC', 'word': 'California', ...}]

# Question Answering
qa = pipeline("question-answering")
result = qa(
    question="When was BERT published?",
    context="BERT was published by Google in October 2018.",
)
# {'answer': 'October 2018', 'score': 0.95, ...}

# Text Generation (GPT-2)
generator = pipeline("text-generation", model="gpt2")
result = generator("The future of AI is", max_length=50)

Advanced: Custom Model Architecture

import torch
import torch.nn as nn
from transformers import AutoModel

class CustomTransformerModel(nn.Module):
    def __init__(self, model_name, num_classes):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        self.classifier = nn.Sequential(
            nn.Dropout(0.1),
            nn.Linear(self.encoder.config.hidden_size, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_classes),
        )

    def forward(self, input_ids, attention_mask=None):
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        pooled = outputs.pooler_output  # [CLS] token
        logits = self.classifier(pooled)
        return logits

# Usage
model = CustomTransformerModel("bert-base-uncased", num_classes=5)

Key Takeaways

  1. Self-attention enables parallel processing of sequences with direct connections between all positions, solving the sequential bottleneck of RNNs.

  2. Multi-head attention allows the model to capture diverse relationships (syntactic, semantic, positional) simultaneously.

  3. Positional encoding injects sequence order information into the permutation-equivariant self-attention mechanism.

  4. BERT (encoder-only) excels at understanding tasks through bidirectional pre-training with MLM and NSP objectives.

  5. GPT (decoder-only) excels at generation tasks through autoregressive next-token prediction.

  6. Fine-tuning pretrained models on task-specific data achieves strong performance with minimal labeled data and compute.

  7. Hugging Face Transformers provides a unified API for accessing, fine-tuning, and deploying transformer models.


References

  1. Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
  2. Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
  3. Radford, A., et al. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.
  4. Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.
  5. Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS.
  6. Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
  7. Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. EMNLP.
⭐

Premium Content

Transformers and BERT: Attention Is All You Need

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement