LLM Foundations

LLM Architecture Deep Dive — How Transformers Power Language Models

Modern LLMs are built on the decoder-only Transformer architecture. This guide dives deep into self-attention mechanisms, positional encoding, and the critical KV cache optimization that powers efficient inference.

Transformers — Decoder-only architecture is the modern standard
Self-Attention — The core mechanism enabling context understanding
KV Cache — Reduces autoregressive generation from O(T²) to O(T)

Architecture is destiny—understand the model to unlock its potential.

LLM Architecture Deep Dive

Modern LLMs are built on the decoder-only Transformer architecture. This tutorial provides a rigorous treatment of the architecture, including self-attention, positional encoding, and the critical KV cache optimization.

The Transformer Architecture

The original Transformer (Vaswani et al., 2017) uses an encoder-decoder structure. However, modern LLMs predominantly use a decoder-only architecture, which simplifies training and inference while maintaining strong performance.

DfDecoder-Only Transformer

A decoder-only Transformer processes input tokens sequentially, using causal (masked) self-attention to prevent information leakage from future tokens. Each layer consists of: (1) masked multi-head self-attention, (2) layer normalization, (3) position-wise feed-forward network, and (4) residual connections.

Self-Attention Mechanism

The core computation in each Transformer layer is self-attention, which allows each token to attend to all previous tokens in the sequence.

Scaled Dot-Product Self-Attention

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Here,

$Q$ =Query matrix (n × d_k)
$K$ =Key matrix (n × d_k)
$V$ =Value matrix (n × d_v)
$d_k$ =Dimension of keys/queries
$n$ =Sequence length

In decoder-only models, we apply causal masking to prevent attention to future tokens:

Causal Self-Attention

\text{CausalAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V

Here,

$M$ =Causal mask matrix where M_{ij} = 0 if i ≥ j, else -∞

Multi-Head Attention

Multi-head attention allows the model to attend to information from different representation subspaces:

Multi-Head Attention

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

Here,

$h$ =Number of attention heads
$\text{head}_i$ =Attention head i = Attention(QW_i^Q, KW_i^K, VW_i^V)
$W^O$ =Output projection matrix
$W_i^Q, W_i^K, W_i^V$ =Learned projection matrices for head i

The per-head dimensions are typically:

d_k = d_model / h
d_v = d_model / h

For a model with d_model = 4096 and h = 32 heads, each head has dimension d_k = d_v = 128. This is the standard configuration for models like LLaMA 2 70B.

Positional Encoding

Since self-attention is permutation-invariant, positional information must be injected explicitly. Modern LLMs use Rotary Position Embeddings (RoPE) or ALiBi.

Rotary Position Embeddings (RoPE)

RoPE encodes position by rotating the query and key vectors in the attention computation:

RoPE Rotation

f(x, m) = \begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_{d-2} \\ x_{d-1} \end{pmatrix} \otimes \begin{pmatrix} \cos(m\theta_0) \\ \cos(m\theta_0) \\ \cos(m\theta_1) \\ \cos(m\theta_1) \\ \vdots \\ \cos(m\theta_{d/2-1}) \\ \cos(m\theta_{d/2-1}) \end{pmatrix} + \begin{pmatrix} -x_1 \\ x_0 \\ -x_3 \\ x_2 \\ \vdots \\ -x_{d-1} \\ x_{d-2} \end{pmatrix} \otimes \begin{pmatrix} \sin(m\theta_0) \\ \sin(m\theta_0) \\ \sin(m\theta_1) \\ \sin(m\theta_1) \\ \vdots \\ \sin(m\theta_{d/2-1}) \\ \sin(m\theta_{d/2-1}) \end{pmatrix}

Here,

$x$ =Input vector
$m$ =Position index
$\theta_i$ =Frequency parameter = 10000^{-2i/d}
$d$ =Model dimension

A key property of RoPE is that the attention score between tokens at positions m and n depends only on the relative distance (m - n):

RoPE Relative Attention

\langle f(q, m), f(k, n) \rangle = g(q, k, m - n)

Here,

$q, k$ =Query and key vectors
$m, n$ =Absolute positions
$g$ =Function of relative position (m - n)

ALiBi (Attention with Linear Biases)

ALiBi adds a linear bias to attention scores based on distance, without any learned parameters:

ALiBi Bias

\text{softmax}\left(\frac{q_i k_j^T}{\sqrt{d_k}} - m \cdot |i - j|\right)

Here,

$m$ =Head-specific slope (geometric sequence)
$i, j$ =Token positions

RoPE is the dominant positional encoding in modern LLMs (LLaMA, Mistral, Qwen). ALiBi was popularized by BLOOM and is used in some models for its simplicity and extrapolation capabilities.

The KV Cache

In autoregressive generation, we compute one token at a time. Without optimization, this requires re-computing attention for all previous tokens at each step, which is O(T²) per generation step.

DfKV Cache

The KV Cache stores the key and value tensors from previous tokens, avoiding redundant computation. At each generation step, we only compute Q, K, V for the new token and attend to the cached K, V from all previous tokens.

KV Cache Memory

\text{Memory} = 2 \times L \times n_{\text{layers}} \times d_{\text{model}} \times \text{batch\_size} \times \text{precision\_bytes}

Here,

$L$ =Sequence length
$n_{\text{layers}}$ =Number of transformer layers
$d_{\text{model}}$ =Model dimension
$2$ =For both K and V

KV Cache Example

For a 70B parameter model with 80 layers, d_model = 8192, and sequence length 4096:

KV Cache size per token: 2 × 80 × 8192 × 2 bytes (FP16) = 2.56 MB
For batch size 1 and sequence length 4096: ~10.5 GB
This is why KV cache management is critical for LLM serving

Modern techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV cache size by sharing key-value heads across attention groups. LLaMA 2 70B uses GQA with 8 KV heads (vs 64 query heads).

Comparison: Encoder vs Decoder vs Encoder-Decoder

Encoder-Only (BERT)

Bidirectional attention (no masking)
Pre-trained with masked language modeling (MLM)
Best for: classification, NER, sentence similarity
NOT suitable for text generation

Decoder-Only (GPT)

Causal (masked) attention
Pre-trained with next-token prediction (CLM)
Best for: text generation, in-context learning, instruction following
Dominant architecture for LLMs

Encoder-Decoder (T5, BART)

Encoder processes input, decoder generates output
Cross-attention between encoder and decoder
Best for: translation, summarization, question answering
More parameters for same input/output capacity

Architecture Comparison

\text{BERT: } P(z|x) \quad \text{GPT: } P(x) = \prod_t P(x_t|x_{<t}) \quad \text{T5: } P(y|x) = \prod_t P(y_t|x, y_{<t})

Here,

$x$ =Input sequence
$y$ =Output sequence
$z$ =Latent representation

Decoder-only models are preferred for LLMs because: (1) they use a single unified architecture for all tasks, (2) they scale more efficiently, and (3) in-context learning emerges naturally from the autoregressive objective.

Feed-Forward Network

Each Transformer layer includes a position-wise feed-forward network (FFN):

SwiGLU Feed-Forward

\text{FFN}(x) = \text{SwiGLU}(xW_1, W_3)W_2 = (\text{SiLU}(xW_1) \odot xW_3)W_2

Here,

$W_1$ =Gate projection (d_model -> d_ff)
$W_2$ =Down projection (d_ff -> d_model)
$W_3$ =Up projection (d_model -> d_ff)
$\text{SiLU}$ =Swish activation: x · σ(x)
$\odot$ =Element-wise multiplication

Modern LLMs use SwiGLU (Shazeer, 2020) instead of ReLU/GELU, with d_ff = (8/3) × d_model (typically rounded to a multiple of 128 for hardware efficiency).

Transformer Forward Pass

The complete forward pass through a decoder-only Transformer:

Transformer Forward Pass

h_0 = \text{Embed}(x) + \text{PosEnc}(x) \quad \text{for } l = 1 \ldots L: \quad h_l = h_{l-1} + \text{FFN}(\text{LN}(h_{l-1} + \text{MHA}(\text{LN}(h_{l-1})))) \quad \text{logits} = \text{LM\_Head}(\text{LN}(h_L))

Here,

$x$ =Input token sequence
$h_l$ =Hidden state at layer l
$L$ =Number of layers
$MHA$ =Multi-head attention
$FFN$ =Feed-forward network
$LN$ =Layer normalization (RMSNorm)
$LM\_Head$ =Output projection to vocabulary

Practical Example: Building a Minimal Transformer

import torch
import torch.nn as nn
import torch.nn.functional as F

class RMSNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight

class SwiGLU(nn.Module):
    def __init__(self, dim: int, hidden_dim: int):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
    
    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

class TransformerBlock(nn.Module):
    def __init__(self, dim: int, n_heads: int, hidden_dim: int):
        super().__init__()
        self.attention_norm = RMSNorm(dim)
        self.ffn_norm = RMSNorm(dim)
        self.attention = nn.MultiheadAttention(dim, n_heads, batch_first=True)
        self.ffn = SwiGLU(dim, hidden_dim)
    
    def forward(self, x, mask=None):
        # Self-attention with residual
        h = self.attention_norm(x)
        h, _ = self.attention(h, h, h, attn_mask=mask)
        x = x + h
        
        # FFN with residual
        h = self.ffn_norm(x)
        h = self.ffn(h)
        x = x + h
        return x

# Example usage
dim, n_heads, hidden_dim = 512, 8, 1408  # ~355M params
block = TransformerBlock(dim, n_heads, hidden_dim)
x = torch.randn(2, 128, dim)  # (batch, seq_len, dim)
mask = torch.triu(torch.ones(128, 128) * float('-inf'), diagonal=1)
output = block(x, mask)
print(f"Output shape: {output.shape}")  # (2, 128, 512)

For a comprehensive treatment of attention mechanisms, see our module on Attention Mechanisms Deep Dive.

Practice Exercises

Architecture: Draw the block diagram of a single Transformer layer in a decoder-only model. Label all components and show the flow of information.
Mathematical: For a model with 32 layers, d_model = 4096, and 32 attention heads, calculate the total number of parameters in the attention blocks (Q, K, V, O projections) and the FFN layers (assuming SwiGLU with d_ff = 11008).
Implementation: Implement a simplified version of the KV cache for autoregressive generation. Show how it reduces computation from O(T²) to O(T) per generation step.
Analysis: Compare the memory requirements of a 7B parameter model in FP16 vs INT4 quantization. How does this affect the maximum sequence length you can use with a given GPU memory?

Key Takeaways:

Modern LLMs use decoder-only Transformers with causal self-attention
Self-attention: Attention(Q, K, V) = softmax(QK^T / √d_k) V
RoPE encodes position via rotation, enabling relative position awareness
The KV cache reduces autoregressive generation from O(T²) to O(T)
SwiGLU FFN layers with RMSNorm are the modern standard
GQA reduces KV cache size while maintaining performance

What to Learn Next

-> Tokenization for LLMs How LLMs break text into manageable pieces using BPE, WordPiece, and more.

-> Pretraining Language Models Learning language from the internet with CLM, scaling laws, and data curation.

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> LoRA and PEFT Efficient fine-tuning without full retraining using low-rank adaptation.

-> QLoRA and Quantization Running LLMs on consumer hardware with INT4 quantization.

-> Prompt Engineering Getting the most out of language models through effective input design.

LLM Architecture Deep Dive

LLM Architecture Deep Dive — How Transformers Power Language Models

LLM Architecture Deep Dive

The Transformer Architecture

DfDecoder-Only Transformer

Self-Attention Mechanism

Scaled Dot-Product Self-Attention

Causal Self-Attention

Multi-Head Attention

Multi-Head Attention

Positional Encoding

Rotary Position Embeddings (RoPE)

RoPE Rotation

RoPE Relative Attention

ALiBi (Attention with Linear Biases)

ALiBi Bias

The KV Cache

DfKV Cache

KV Cache Memory

KV Cache Example

Comparison: Encoder vs Decoder vs Encoder-Decoder

Encoder-Only (BERT)

Decoder-Only (GPT)

Encoder-Decoder (T5, BART)

Architecture Comparison

Feed-Forward Network

SwiGLU Feed-Forward

Transformer Forward Pass

Practical Example: Building a Minimal Transformer

Practice Exercises

What to Learn Next

Premium Content

Need Expert LLM Help?