Transformers

Attention Mechanisms Deep Dive — The Key to Modern AI

Attention mechanisms revolutionized deep learning by allowing models to dynamically focus on relevant parts of the input. From machine translation to image generation, attention is the core innovation behind Transformers, BERT, GPT, and virtually all state-of-the-art models.

Key point 1 — Self-attention enables O(1) sequential operations with full parallelism
Key point 2 — Multi-head attention captures different types of relationships simultaneously
Key point 3 — Scaled dot-product attention is the foundation of every modern architecture

"Attention is all you need — and it changed everything."

Attention Mechanisms — Deep Dive

Attention allows models to dynamically focus on relevant parts of the input when producing each output element. It is the core innovation behind Transformers.

Why Attention?

Seq2seq models compress the entire input into a single fixed-length vector, creating an information bottleneck. Attention solves this by allowing the decoder to "look at" all encoder states at each decoding step.

DfAttention Mechanism

Attention computes a weighted sum of values (encoder states) where the weights are determined by the compatibility between a query (decoder state) and keys (encoder states):

\text{Attention}(q, K, V) = \sum_{i=1}^{T} \alpha_i v_i

where $\alpha_i$ are attention weights that sum to 1.

Bahdanau Attention

DfBahdanau (Additive) Attention

Proposed by Bahdanau et al. (2015), this uses a learned feedforward network to compute alignment scores:

e_{ij} = v^T \tanh(W_s s_{i-1} + W_h h_j)

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})}

c_i = \sum_{j=1}^{T} \alpha_{ij} h_j

Bahdanau Attention Score

e_{ij} = v^T \tanh(W_s s_{i-1} + W_h h_j)

Here,

$e_{ij}$ =Alignment score between decoder state i and encoder state j
$v$ =Learnable weight vector
$W_s$ =Decoder state projection
$W_h$ =Encoder state projection
$s_{i-1}$ =Previous decoder hidden state
$h_j$ =Encoder hidden state at position j

Luong Attention

DfLuong (Multiplicative) Attention

Proposed by Luong et al. (2015), this uses simpler dot-product or general scoring:

Dot: $e_{ij} = s_i^T h_j$

General: $e_{ij} = s_i^T W_a h_j$

Concat: $e_{ij} = v^T \tanh(W_a [s_i; h_j])$

Luong General Score

e_{ij} = s_i^T W_a h_j

Here,

$s_i$ =Decoder hidden state at step i
$W_a$ =Learnable alignment matrix
$h_j$ =Encoder hidden state at position j

Type	Score Function	Complexity	Parameters
Dot	$s^T h$	$O(d)$	None
General	$s^T W h$	$O(d^2)$	$W \in \mathbb{R}^{d \times d}$
Concat	$v^T \tanh(W[s;h])$	$O(d)$	$W, v$
Additive	$v^T \tanh(W_1 s + W_2 h)$	$O(d)$	$W_1, W_2, v$

Self-Attention

DfSelf-Attention

Self-attention computes attention within a single sequence, allowing each position to attend to all other positions:

\text{Self-Attention}(X) = \text{softmax}\left(\frac{XW_Q (XW_K)^T}{\sqrt{d_k}}\right) XW_V

Each token can directly attend to every other token, capturing long-range dependencies in $O(1)$ sequential operations.

Scaled Dot-Product Attention

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Why Scale by sqrt(d_k)?

Without scaling, dot products grow large with dimension $d_k$ , pushing softmax into regions with extremely small gradients. Dividing by $\sqrt{d_k}$ keeps variance stable, enabling effective training. This was a critical insight in "Attention Is All You Need" (Vaswani et al., 2017).

Multi-Head Attention

DfMulti-Head Attention

Instead of a single attention function, project queries, keys, and values into $h$ different subspaces, compute attention in parallel, and concatenate:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

where each head:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Multi-Head Attention (Full Formulation)

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O \quad \text{where} \quad \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Why Multiple Heads?

Different heads can learn to attend to different types of relationships: syntactic structure, semantic similarity, positional patterns, etc. Typically $h = 8$ or $16$ with $d_k = d_{model} / h$ .

Theorem: Attention as Differentiable Lookup

ThAttention as Soft Lookup

Hard lookup (dictionary retrieval) returns $v_i$ for a specific key $k_i$ . Attention generalizes this to a differentiable soft lookup that returns a weighted average of all values, with weights determined by key-query similarity. As temperature $\tau \to 0$ , attention converges to hard lookup.

Temperature-Dependent Attention

\alpha_i = \frac{\exp(e_i / \tau)}{\sum_j \exp(e_j / \tau)}

Here,

$\tau$ =Temperature parameter
$e_i$ =Attention score for position i
$\alpha_i$ =Attention weight

Full PyTorch Implementation

Example: Multi-Head Attention from Scratch

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k, dropout=0.1):
        super().__init__()
        self.scale = math.sqrt(d_k)
        self.dropout = nn.Dropout(dropout)

    def forward(self, Q, K, V, mask=None):
        # Q, K, V: (batch, num_heads, seq_len, d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        output = torch.matmul(attn_weights, V)
        return output, attn_weights


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.attention = ScaledDotProductAttention(self.d_k, dropout)
        self.dropout = nn.Dropout(dropout)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections: (batch, seq_len, d_model) -> (batch, seq_len, d_model)
        Q = self.W_q(query)
        K = self.W_k(key)
        V = self.W_v(value)

        # Reshape: (batch, seq_len, d_model) -> (batch, num_heads, seq_len, d_k)
        Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Apply attention
        output, attn_weights = self.attention(Q, K, V, mask)

        # Reshape: (batch, num_heads, seq_len, d_k) -> (batch, seq_len, d_model)
        output = output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )

        # Final linear projection
        output = self.W_o(output)
        return output, attn_weights


class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x: (batch, seq_len, d_model)
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)


# Test multi-head attention
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(2, 10, 512)  # batch=2, seq_len=10, d_model=512
output, weights = mha(x, x, x)  # self-attention
print(f"Output shape: {output.shape}")   # (2, 10, 512)
print(f"Weights shape: {weights.shape}") # (2, 8, 10, 10)

Attention Patterns

Understanding Attention Patterns

Local attention: Tokens attend mainly to nearby positions — common in early layers
Global attention: Some tokens attend to all positions — e.g., [CLS] token in BERT
Distributed attention: Weights spread across many positions — captures semantic similarity
Block attention: Attention confined to windows — used in efficient Transformers (Longformer, BigBird)

Practice Exercises

Implement attention variants: Code dot, general, and concat attention. Compare on a fixed sequence.
Visualize attention: Train a seq2seq model and plot attention heatmaps. What patterns emerge?
Multi-head analysis: Train with different numbers of heads (1, 4, 8, 16). How does performance change?
Efficient attention: Implement linear attention (kernel-based) and compare with standard attention on long sequences.

Key Takeaways

Summary: Attention Mechanisms

Bahdanau attention: Additive scoring, learned alignment
Luong attention: Multiplicative scoring, simpler and faster
Self-attention: Each position attends to all positions in the same sequence
Scaled dot-product: $\text{softmax}(QK^T / \sqrt{d_k}) V$ — the standard attention mechanism
Multi-head attention: Parallel attention heads capture different relationship types
Attention as soft lookup: Generalizes hard dictionary lookup to differentiable weighted average
Attention enables $O(1)$ sequential operations — fully parallelizable
Foundation of Transformers, BERT, GPT, and modern NLP
See also: Transformers for the complete architecture

What to Learn Next

-> Vision Transformers Apply Transformer architecture to image recognition by treating patches as tokens.

-> DL Systems Design Master distributed training, monitoring, and production deployment of deep learning models.

-> Sequence-to-Sequence Learn encoder-decoder architecture for translation and text generation tasks.

-> LSTM Networks Explore gated recurrent units with cell state for long-range dependencies.

-> CNN Architecture Deep Dive Master convolutional layers, pooling, and modern CNN architectures.

-> Model Compression Make deep learning models fast and efficient for production deployment.

Attention Mechanisms — Deep Dive

Attention Mechanisms Deep Dive — The Key to Modern AI

Attention Mechanisms — Deep Dive

Why Attention?

DfAttention Mechanism

Bahdanau Attention

DfBahdanau (Additive) Attention

Bahdanau Attention Score

Luong Attention

DfLuong (Multiplicative) Attention

Luong General Score

Self-Attention

DfSelf-Attention

Multi-Head Attention

DfMulti-Head Attention

Theorem: Attention as Differentiable Lookup

ThAttention as Soft Lookup

Temperature-Dependent Attention

Full PyTorch Implementation

Example: Multi-Head Attention from Scratch

Attention Patterns

Practice Exercises

Key Takeaways

Summary: Attention Mechanisms

What to Learn Next

Premium Content

Need Expert Deep Learning Help?