Why Inference Optimization Matters
A single LLM inference call can consume milliseconds to seconds of GPU time. At scale, inference costs dominate total operational expenditure. Optimizing inference throughput and latency is the single most impactful LLMOps activity for reducing costs.
Quantization
Quantization reduces model weight precision from FP16/FP32 to lower-bit representations, reducing memory usage and increasing throughput.
Quantization Methods
| Method | Bits | Speedup | Quality Impact | Memory Reduction |
|---|---|---|---|---|
| FP16 (baseline) | 16 | 1.0x | None | 1.0x |
| INT8 | 8 | ~1.5x | Minimal | 2.0x |
| INT4 (GPTQ) | 4 | ~2.5x | Moderate | 4.0x |
| INT4 (AWQ) | 4 | ~2.5x | Low-Moderate | 4.0x |
| GGUF (llama.cpp) | 2-8 | Variable | Variable | Variable |
DfQuantization Error
The quantization error for a weight w quantized to b bits is bounded by:
\epsilon = |w - \text{quantize}(w, b)| \leq \frac{\Delta}{2} = \frac{w_{max} - w_{min}}{2^{b+1} - 2}
Post-Training Quantization with GPTQ
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
# Load model with GPTQ quantization
model_id = "meta-llama/Llama-2-7b-hf"
quantization_config = GPTQConfig(
bits=4,
group_size=128,
desc_act=True, # Sort activations by magnitude
damp_percent=0.01,
sym=True # Symmetric quantization
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=quantization_config
)
# Memory usage: ~4GB instead of ~14GB for FP16
print(f"Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")
Activation-Aware Weight Quantization (AWQ)
AWQ identifies salient weight channels and preserves their precision, achieving better quality than naive quantization.
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
safetensors=True
)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
# AWQ quantization β preserves critical weights
model.quantize(
tokenizer=None,
quant_config=quant_config,
calib_data="pileval"
)
KV Cache Management
The Key-Value cache stores previously computed attention keys and values, avoiding redundant computation during autoregressive generation.
\text{KV Cache Size} = 2 \times L \times n_{heads} \times d_{head} \times s \times \text{sizeof(dtype)}
Where L = layers, n_{heads} = attention heads, d_{head} = head dimension, s = sequence length.
For a 70B parameter model with 80 layers, 64 heads, 128 head dimension, and 4096 sequence length:
KV Cache = 2 Γ 80 Γ 64 Γ 128 Γ 4096 Γ 2 bytes = 10.7 GB per sequence
PagedAttention (vLLM)
vLLM introduces PagedAttention, which manages KV cache in fixed-size blocks (pages) instead of contiguous memory, eliminating memory fragmentation.
from vllm import LLM, SamplingParams
# vLLM automatically manages KV cache with PagedAttention
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
gpu_memory_utilization=0.9, # 90% of GPU memory for KV cache
max_num_batched_tokens=8192,
max_num_seqs=256,
block_size=16 # KV cache block size in tokens
)
# Efficient batching β requests share KV cache blocks when possible
prompts = ["Explain quantum computing", "What is machine learning?"]
outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=256))
KV Cache Compression
For long-context models, KV cache can be compressed through eviction or pooling strategies.
# Sliding window attention β limit KV cache to recent tokens
class SlidingWindowKVCache:
def __init__(self, window_size: int = 2048):
self.window_size = window_size
def update(self, key_states, value_states):
# Keep only the last window_size tokens
if key_states.shape[-2] > self.window_size:
key_states = key_states[..., -self.window_size:, :]
value_states = value_states[..., -self.window_size:, :]
return key_states, value_states
Continuous Batching
Traditional static batching waits for all requests to complete before processing the next batch. Continuous batching (also called iteration-level scheduling) adds and removes requests from the batch dynamically.
DfThroughput Improvement
Continuous batching achieves throughput of:
\text{Throughput}{continuous} = \frac{N \cdot \bar{T}{output}}{\max(L_i) \cdot \bar{T}_{token}}
Where N is batch size, \bar{T}{output} is average output length, L_i is the length of the i-th sequence, and \bar{T}{token} is per-token latency.
# Comparison: static vs continuous batching
# Static batching: batch waits for longest sequence
# Request A: 10 tokens [IDLE---DONE-IDLE-IDLE-IDLE]
# Request B: 50 tokens [PROCESSING------------------DONE]
# Request C: 20 tokens [PROCESSING------DONE-IDLE---IDLE]
# Continuous batching: slots are immediately reused
# Request A: 10 tokens [DONE] [Request D: 30 tokens--DONE]
# Request B: 50 tokens [PROCESSING------------------------DONE]
# Request C: 20 tokens [PROCESSING----DONE] [Req E: 15---DONE]
Speculative Decoding
Speculative decoding uses a smaller "draft" model to generate candidate tokens that are then verified by the larger target model in a single forward pass.
\text{Speedup}{speculative} = \frac{T{target}}{T_{draft} \cdot \alpha \cdot k + T_{target} \cdot (1 - \alpha \cdot k)}
Where k is the draft length, \alpha is the acceptance rate, and T_{target} and T_{draft} are per-token costs.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class SpeculativeDecoder:
def __init__(self, target_model, draft_model, tokenizer, k=5):
self.target = target_model
self.draft = draft_model
self.tokenizer = tokenizer
self.k = k # Draft tokens per verification step
def generate(self, prompt: str, max_tokens: int = 256):
input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
generated = input_ids.clone()
while generated.shape[-1] < max_tokens:
# Step 1: Draft model generates k candidate tokens
draft_tokens = []
draft_input = generated.clone()
for _ in range(self.k):
draft_logits = self.draft(draft_input).logits[:, -1, :]
draft_token = torch.argmax(draft_logits, dim=-1)
draft_tokens.append(draft_token)
draft_input = torch.cat([draft_input, draft_token.unsqueeze(0)], dim=-1)
draft_tensor = torch.stack(draft_tokens, dim=-1)
# Step 2: Target model verifies all draft tokens in one pass
candidate = torch.cat([generated, draft_tensor], dim=-1)
target_logits = self.target(candidate).logits
# Step 3: Accept/reject each draft token
accepted = 0
for i in range(self.k):
target_token = torch.argmax(target_logits[:, -(self.k + 1 - i), :], dim=-1)
if target_token.item() == draft_tokens[i].item():
accepted += 1
else:
# Accept draft tokens up to this point + target token
break
# Update generated sequence
if accepted > 0:
generated = torch.cat([generated, draft_tensor[:, :accepted]], dim=-1)
# If rejection occurred, sample from corrected distribution
if accepted < self.k:
next_token = torch.argmax(target_logits[:, -(self.k - accepted), :], dim=-1)
generated = torch.cat([generated, next_token.unsqueeze(0)], dim=-1)
return self.tokenizer.decode(generated[0])
Flash Attention
Flash Attention computes exact attention in a memory-efficient manner by reducing memory reads and writes between GPU HBM and SRAM.
\text{Memory}{standard} = O(n^2) \quad \text{vs} \quad \text{Memory}{flash} = O(n)
from torch.nn.functional import scaled_dot_product_attention
# PyTorch 2.0+ β Flash Attention enabled automatically
def attention_forward(q, k, v, is_causal=True):
# Uses FlashAttention kernel when hardware supports it
return scaled_dot_product_attention(q, k, v, is_causal=is_causal)
# Memory savings: O(n^2) β O(n) for sequence dimension
# Speed improvement: 2-4x for typical sequence lengths (512-4096)
Optimization Summary
| Technique | Latency Reduction | Memory Reduction | Implementation Effort |
|---|---|---|---|
| INT8 Quantization | 1.5x | 2x | Low |
| INT4 Quantization | 2.5x | 4x | Medium |
| PagedAttention | 0% (throughput +) | 60-80% KV cache | Low (use vLLM) |
| Continuous Batching | 2-5x throughput | Minimal | Low (use vLLM) |
| Speculative Decoding | 2-3x | Minimal | High |
| Flash Attention | 2-4x | 5-20x | Low (use PyTorch 2.0) |
Effective LLMOps combines multiple optimization techniques. A production system might use AWQ-quantized models served via vLLM with continuous batching and Flash Attention, achieving 5-10x throughput improvement over naive inference.