🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Attention Mechanisms — Deep Dive

Deep LearningAttention🟢 Free Lesson

Advertisement

Deep Learning

Attention Is All You Need — The Mechanism Behind Modern AI

Master the attention mechanism that powers Transformers and enables state-of-the-art AI performance.

  • Self-attention — weigh importance of different input parts
  • Multi-head attention — capture different types of relationships
  • Scalability — process sequences of any length efficiently

Where attention goes, energy flows.

Attention Mechanisms — Deep Dive

Attention is a mechanism that allows models to dynamically focus on relevant parts of the input when producing each element of the output. It was originally proposed for seq2seq models (Bahdanau et al., 2014) and is now the foundation of all modern deep learning architectures.


Types of Attention

Attention Mechanism TypesSelf-Attentionh₁h₂h₃h₄Input attends to ITSELFEach token relates to ALL tokensUsed in: Encoder, decoderCaptures internal relationshipsCross-AttentionEncoder outk₁k₂k₃Decoder Qq₁q₂Output attends to INPUTDecoder Q queries encoder K,VUsed in: Encoder-DecoderLinks input and outputCausal (Masked) Attentiont₁t₂t₃t₄Each token attends to PREVIOUS onlyCannot look at future tokensUsed in: GPT, autoregressiveEnables text generationAttention ComparisonTypeMaskComplexityUse CaseSelf-AttentionNone (bidirectional)O(n²·d)BERT, encoderCausal Self-AttnLower triangularO(n²·d)GPT, decoderCross-AttentionNoneO(n·m·d)Seq2Seq, encoder-decoderSparse AttentionFixed patternO(n·√n·d)Long sequencesLinear AttentionNoneO(n·d²)Very long sequences

How this diagram works: This diagram compares the three main attention variants side by side. Self-attention (left) lets every token attend to every other token, capturing internal relationships — used in BERT and encoder layers. Cross-attention (middle) lets decoder queries attend to encoder keys and values, bridging input and output in seq2seq models. Causal attention (right) masks future tokens so each position only sees previous ones, enabling autoregressive generation in GPT. The comparison table below summarizes their masking patterns, computational complexity, and typical use cases.


Multi-Head Attention

Multi-Head Attention: Why Multiple Heads?Input Xn×dmodelHead 1Subject-VerbAttn(Q1,K1,V1)Head 2Semantic RoleAttn(Q2,K2,V2)Head 3CoreferenceAttn(Q3,K3,V3)Head 4TemporalAttn(Q4,K4,V4)Concat(head₁,...headh)WOLinearOutputn×dmodelHead VisualizationsHead 1Head 2Head 3Head 4

Efficient Attention Mechanisms

Efficient Attention VariantsStandardO(n² · d)Full attention matrixn=4096 → 16M entriesMemory bottleneckSparse AttentionO(n · √n · d)Attend to k tokens onlyFixed/local patternsBigBird, LongformerLinear AttentionO(n · d²)Avoid n×n matrixkernel approximationPerformer, Linear Trans.Flash AttentionO(n² · d)Same complexitybut 2-4x fasterIO-aware, tilingFlash Attention (Dao et al., 2022)Key insight: Standard attention is memory-bandwidth bound, not compute-bound. The bottleneck is reading the n×n attention matrix from HBM.Flash Attention uses tiling: process attention in blocks that fit in SRAM. Avoids materializing the full n×n matrix. Same result, 2-4x faster, 5-20x less memory.Used in: GPT-4, LLaMA, Mistral, all modern LLMs. The single most important attention optimization.KV Cache (Autoregressive)During generation, K and V for previous tokens are cached.Only compute Q for the new token. Avoids recomputation.Cost: O(n·d) memory per layer for cacheGrouped Query Attention (GQA)Share K,V heads across multiple Q heads.Reduces KV cache size by factor of g (num groups).Used in: LLaMA-2 70B, Mistral, Gemma

Attention Computation

DfFull Attention Computation

Given sequences XRn×dmodel\mathbf{X} \in \mathbb{R}^{n \times d_{\text{model}}}:

Step 1: Linear projections (per head ii, dk=dmodel/hd_k = d_{\text{model}} / h):

Qi=XWiQ,Ki=XWiK,Vi=XWiV\mathbf{Q}_i = \mathbf{X} W_i^Q, \quad \mathbf{K}_i = \mathbf{X} W_i^K, \quad \mathbf{V}_i = \mathbf{X} W_i^V

where WiQ,WiKRdmodel×dkW_i^Q, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, WiVRdmodel×dvW_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}.

Step 2: Scaled dot-product attention:

Attn(Qi,Ki,Vi)=softmax(QiKidk)Vi\text{Attn}(\mathbf{Q}_i, \mathbf{K}_i, \mathbf{V}_i) = \text{softmax}\left(\frac{\mathbf{Q}_i \mathbf{K}_i^\top}{\sqrt{d_k}}\right) \mathbf{V}_i

Step 3: Concatenate and project:

MultiHead(X)=Concat(head1,,headh)WO\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

Parameter count per layer: 4dmodel24 \cdot d_{\text{model}}^2 (Q, K, V, output projections).


Key Takeaways

Summary: Attention Mechanisms

  • Self-attention is the core of Transformers — each token attends to all others
  • Cross-attention links encoder and decoder in seq2seq models
  • Causal masking prevents looking ahead in autoregressive generation
  • Multi-head captures different relationship types simultaneously
  • Flash Attention is the most important optimization — IO-aware tiling
  • KV cache speeds up autoregressive generation by caching K,V
  • GQA reduces KV cache by sharing heads
  • Sliding window (Mistral) enables efficient long sequences
  • Standard attention complexity is O(n²·d) — key limitation for long context

What to Learn Next

-> Transformers Apply attention in complete architectures.

-> BERT See attention in bidirectional models.

-> GPT Architecture Explore autoregressive attention.

-> RNNs and LSTMs Compare with recurrent approaches.

-> Training Deep Networks Master optimizers and regularization.

-> CNNs Understand local vs global attention.

Premium Content

Attention Mechanisms — Deep Dive

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement