🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

LLM Architecture Deep Dive

FoundationsArchitecture🟢 Free Lesson

Advertisement

LLM Foundations

LLM Architecture Deep Dive — How Transformers Power Language Models

Modern LLMs are built on the decoder-only Transformer architecture. This guide dives deep into self-attention mechanisms, positional encoding, and the critical KV cache optimization that powers efficient inference.

  • Transformers — Decoder-only architecture is the modern standard
  • Self-Attention — The core mechanism enabling context understanding
  • KV Cache — Reduces autoregressive generation from O(T²) to O(T)

Architecture is destiny—understand the model to unlock its potential.

LLM Architecture Deep Dive

Modern LLMs are built on the decoder-only Transformer architecture. This tutorial provides a rigorous treatment of the architecture, including self-attention, positional encoding, and the critical KV cache optimization.

The Transformer Architecture

The original Transformer (Vaswani et al., 2017) uses an encoder-decoder structure. However, modern LLMs predominantly use a decoder-only architecture, which simplifies training and inference while maintaining strong performance.

DfDecoder-Only Transformer

A decoder-only Transformer processes input tokens sequentially, using causal (masked) self-attention to prevent information leakage from future tokens. Each layer consists of: (1) masked multi-head self-attention, (2) layer normalization, (3) position-wise feed-forward network, and (4) residual connections.

Self-Attention Mechanism

The core computation in each Transformer layer is self-attention, which allows each token to attend to all previous tokens in the sequence.

Scaled Dot-Product Self-Attention

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Here,

  • QQ=Query matrix (n × d_k)
  • KK=Key matrix (n × d_k)
  • VV=Value matrix (n × d_v)
  • dkd_k=Dimension of keys/queries
  • nn=Sequence length

In decoder-only models, we apply causal masking to prevent attention to future tokens:

Causal Self-Attention

CausalAttention(Q,K,V)=softmax(QKT+Mdk)V\text{CausalAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V

Here,

  • MM=Causal mask matrix where M_{ij} = 0 if i ≥ j, else -∞

Multi-Head Attention

Multi-head attention allows the model to attend to information from different representation subspaces:

Multi-Head Attention

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

Here,

  • hh=Number of attention heads
  • headi\text{head}_i=Attention head i = Attention(QW_i^Q, KW_i^K, VW_i^V)
  • WOW^O=Output projection matrix
  • WiQ,WiK,WiVW_i^Q, W_i^K, W_i^V=Learned projection matrices for head i

The per-head dimensions are typically:

  • d_k = d_model / h
  • d_v = d_model / h

For a model with d_model = 4096 and h = 32 heads, each head has dimension d_k = d_v = 128. This is the standard configuration for models like LLaMA 2 70B.

Positional Encoding

Since self-attention is permutation-invariant, positional information must be injected explicitly. Modern LLMs use Rotary Position Embeddings (RoPE) or ALiBi.

Rotary Position Embeddings (RoPE)

RoPE encodes position by rotating the query and key vectors in the attention computation:

RoPE Rotation

f(x,m)=(x0x1x2x3xd2xd1)(cos(mθ0)cos(mθ0)cos(mθ1)cos(mθ1)cos(mθd/21)cos(mθd/21))+(x1x0x3x2xd1xd2)(sin(mθ0)sin(mθ0)sin(mθ1)sin(mθ1)sin(mθd/21)sin(mθd/21))f(x, m) = \begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_{d-2} \\ x_{d-1} \end{pmatrix} \otimes \begin{pmatrix} \cos(m\theta_0) \\ \cos(m\theta_0) \\ \cos(m\theta_1) \\ \cos(m\theta_1) \\ \vdots \\ \cos(m\theta_{d/2-1}) \\ \cos(m\theta_{d/2-1}) \end{pmatrix} + \begin{pmatrix} -x_1 \\ x_0 \\ -x_3 \\ x_2 \\ \vdots \\ -x_{d-1} \\ x_{d-2} \end{pmatrix} \otimes \begin{pmatrix} \sin(m\theta_0) \\ \sin(m\theta_0) \\ \sin(m\theta_1) \\ \sin(m\theta_1) \\ \vdots \\ \sin(m\theta_{d/2-1}) \\ \sin(m\theta_{d/2-1}) \end{pmatrix}

Here,

  • xx=Input vector
  • mm=Position index
  • θi\theta_i=Frequency parameter = 10000^{-2i/d}
  • dd=Model dimension

A key property of RoPE is that the attention score between tokens at positions m and n depends only on the relative distance (m - n):

RoPE Relative Attention

f(q,m),f(k,n)=g(q,k,mn)\langle f(q, m), f(k, n) \rangle = g(q, k, m - n)

Here,

  • q,kq, k=Query and key vectors
  • m,nm, n=Absolute positions
  • gg=Function of relative position (m - n)

ALiBi (Attention with Linear Biases)

ALiBi adds a linear bias to attention scores based on distance, without any learned parameters:

ALiBi Bias

softmax(qikjTdkmij)\text{softmax}\left(\frac{q_i k_j^T}{\sqrt{d_k}} - m \cdot |i - j|\right)

Here,

  • mm=Head-specific slope (geometric sequence)
  • i,ji, j=Token positions

RoPE is the dominant positional encoding in modern LLMs (LLaMA, Mistral, Qwen). ALiBi was popularized by BLOOM and is used in some models for its simplicity and extrapolation capabilities.

The KV Cache

In autoregressive generation, we compute one token at a time. Without optimization, this requires re-computing attention for all previous tokens at each step, which is O(T²) per generation step.

DfKV Cache

The KV Cache stores the key and value tensors from previous tokens, avoiding redundant computation. At each generation step, we only compute Q, K, V for the new token and attend to the cached K, V from all previous tokens.

KV Cache Memory

Memory=2×L×nlayers×dmodel×batch_size×precision_bytes\text{Memory} = 2 \times L \times n_{\text{layers}} \times d_{\text{model}} \times \text{batch\_size} \times \text{precision\_bytes}

Here,

  • LL=Sequence length
  • nlayersn_{\text{layers}}=Number of transformer layers
  • dmodeld_{\text{model}}=Model dimension
  • 22=For both K and V

KV Cache Example

For a 70B parameter model with 80 layers, d_model = 8192, and sequence length 4096:

  • KV Cache size per token: 2 × 80 × 8192 × 2 bytes (FP16) = 2.56 MB
  • For batch size 1 and sequence length 4096: ~10.5 GB
  • This is why KV cache management is critical for LLM serving

Modern techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV cache size by sharing key-value heads across attention groups. LLaMA 2 70B uses GQA with 8 KV heads (vs 64 query heads).

Comparison: Encoder vs Decoder vs Encoder-Decoder

Encoder-Only (BERT)

  • Bidirectional attention (no masking)
  • Pre-trained with masked language modeling (MLM)
  • Best for: classification, NER, sentence similarity
  • NOT suitable for text generation

Decoder-Only (GPT)

  • Causal (masked) attention
  • Pre-trained with next-token prediction (CLM)
  • Best for: text generation, in-context learning, instruction following
  • Dominant architecture for LLMs

Encoder-Decoder (T5, BART)

  • Encoder processes input, decoder generates output
  • Cross-attention between encoder and decoder
  • Best for: translation, summarization, question answering
  • More parameters for same input/output capacity

Architecture Comparison

BERT: P(zx)GPT: P(x)=tP(xtx<t)T5: P(yx)=tP(ytx,y<t)\text{BERT: } P(z|x) \quad \text{GPT: } P(x) = \prod_t P(x_t|x_{<t}) \quad \text{T5: } P(y|x) = \prod_t P(y_t|x, y_{<t})

Here,

  • xx=Input sequence
  • yy=Output sequence
  • zz=Latent representation

Decoder-only models are preferred for LLMs because: (1) they use a single unified architecture for all tasks, (2) they scale more efficiently, and (3) in-context learning emerges naturally from the autoregressive objective.

Feed-Forward Network

Each Transformer layer includes a position-wise feed-forward network (FFN):

SwiGLU Feed-Forward

FFN(x)=SwiGLU(xW1,W3)W2=(SiLU(xW1)xW3)W2\text{FFN}(x) = \text{SwiGLU}(xW_1, W_3)W_2 = (\text{SiLU}(xW_1) \odot xW_3)W_2

Here,

  • W1W_1=Gate projection (d_model -> d_ff)
  • W2W_2=Down projection (d_ff -> d_model)
  • W3W_3=Up projection (d_model -> d_ff)
  • SiLU\text{SiLU}=Swish activation: x · σ(x)
  • \odot=Element-wise multiplication

Modern LLMs use SwiGLU (Shazeer, 2020) instead of ReLU/GELU, with d_ff = (8/3) × d_model (typically rounded to a multiple of 128 for hardware efficiency).

Transformer Forward Pass

The complete forward pass through a decoder-only Transformer:

Transformer Forward Pass
h0=Embed(x)+PosEnc(x)for l=1L:hl=hl1+FFN(LN(hl1+MHA(LN(hl1))))logits=LM_Head(LN(hL))h_0 = \text{Embed}(x) + \text{PosEnc}(x) \quad \text{for } l = 1 \ldots L: \quad h_l = h_{l-1} + \text{FFN}(\text{LN}(h_{l-1} + \text{MHA}(\text{LN}(h_{l-1})))) \quad \text{logits} = \text{LM\_Head}(\text{LN}(h_L))

Here,

  • xx=Input token sequence
  • hlh_l=Hidden state at layer l
  • LL=Number of layers
  • MHAMHA=Multi-head attention
  • FFNFFN=Feed-forward network
  • LNLN=Layer normalization (RMSNorm)
  • LM_HeadLM\_Head=Output projection to vocabulary

Practical Example: Building a Minimal Transformer

import torch
import torch.nn as nn
import torch.nn.functional as F

class RMSNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight

class SwiGLU(nn.Module):
    def __init__(self, dim: int, hidden_dim: int):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
    
    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

class TransformerBlock(nn.Module):
    def __init__(self, dim: int, n_heads: int, hidden_dim: int):
        super().__init__()
        self.attention_norm = RMSNorm(dim)
        self.ffn_norm = RMSNorm(dim)
        self.attention = nn.MultiheadAttention(dim, n_heads, batch_first=True)
        self.ffn = SwiGLU(dim, hidden_dim)
    
    def forward(self, x, mask=None):
        # Self-attention with residual
        h = self.attention_norm(x)
        h, _ = self.attention(h, h, h, attn_mask=mask)
        x = x + h
        
        # FFN with residual
        h = self.ffn_norm(x)
        h = self.ffn(h)
        x = x + h
        return x

# Example usage
dim, n_heads, hidden_dim = 512, 8, 1408  # ~355M params
block = TransformerBlock(dim, n_heads, hidden_dim)
x = torch.randn(2, 128, dim)  # (batch, seq_len, dim)
mask = torch.triu(torch.ones(128, 128) * float('-inf'), diagonal=1)
output = block(x, mask)
print(f"Output shape: {output.shape}")  # (2, 128, 512)

For a comprehensive treatment of attention mechanisms, see our module on Attention Mechanisms Deep Dive.

Practice Exercises

  1. Architecture: Draw the block diagram of a single Transformer layer in a decoder-only model. Label all components and show the flow of information.

  2. Mathematical: For a model with 32 layers, d_model = 4096, and 32 attention heads, calculate the total number of parameters in the attention blocks (Q, K, V, O projections) and the FFN layers (assuming SwiGLU with d_ff = 11008).

  3. Implementation: Implement a simplified version of the KV cache for autoregressive generation. Show how it reduces computation from O(T²) to O(T) per generation step.

  4. Analysis: Compare the memory requirements of a 7B parameter model in FP16 vs INT4 quantization. How does this affect the maximum sequence length you can use with a given GPU memory?

Key Takeaways:

  • Modern LLMs use decoder-only Transformers with causal self-attention
  • Self-attention: Attention(Q, K, V) = softmax(QK^T / √d_k) V
  • RoPE encodes position via rotation, enabling relative position awareness
  • The KV cache reduces autoregressive generation from O(T²) to O(T)
  • SwiGLU FFN layers with RMSNorm are the modern standard
  • GQA reduces KV cache size while maintaining performance

What to Learn Next

-> Tokenization for LLMs How LLMs break text into manageable pieces using BPE, WordPiece, and more.

-> Pretraining Language Models Learning language from the internet with CLM, scaling laws, and data curation.

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> LoRA and PEFT Efficient fine-tuning without full retraining using low-rank adaptation.

-> QLoRA and Quantization Running LLMs on consumer hardware with INT4 quantization.

-> Prompt Engineering Getting the most out of language models through effective input design.

Premium Content

LLM Architecture Deep Dive

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert LLM Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement