LLM Inference Optimization

Why Inference Optimization Matters

A single LLM inference call can consume milliseconds to seconds of GPU time. At scale, inference costs dominate total operational expenditure. Optimizing inference throughput and latency is the single most impactful LLMOps activity for reducing costs.

Quantization

Quantization reduces model weight precision from FP16/FP32 to lower-bit representations, reducing memory usage and increasing throughput.

Quantization Methods

Method	Bits	Speedup	Quality Impact	Memory Reduction
FP16 (baseline)	16	1.0x	None	1.0x
INT8	8	~1.5x	Minimal	2.0x
INT4 (GPTQ)	4	~2.5x	Moderate	4.0x
INT4 (AWQ)	4	~2.5x	Low-Moderate	4.0x
GGUF (llama.cpp)	2-8	Variable	Variable	Variable

DfQuantization Error

The quantization error for a weight w quantized to b bits is bounded by:

\epsilon = |w - \text{quantize}(w, b)| \leq \frac{\Delta}{2} = \frac{w_{max} - w_{min}}{2^{b+1} - 2}

Post-Training Quantization with GPTQ

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

# Load model with GPTQ quantization
model_id = "meta-llama/Llama-2-7b-hf"
quantization_config = GPTQConfig(
    bits=4,
    group_size=128,
    desc_act=True,      # Sort activations by magnitude
    damp_percent=0.01,
    sym=True            # Symmetric quantization
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quantization_config
)

# Memory usage: ~4GB instead of ~14GB for FP16
print(f"Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")

Activation-Aware Weight Quantization (AWQ)

AWQ identifies salient weight channels and preserves their precision, achieving better quality than naive quantization.

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    safetensors=True
)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

# AWQ quantization — preserves critical weights
model.quantize(
    tokenizer=None,
    quant_config=quant_config,
    calib_data="pileval"
)

KV Cache Management

The Key-Value cache stores previously computed attention keys and values, avoiding redundant computation during autoregressive generation.

\text{KV Cache Size} = 2 \times L \times n_{heads} \times d_{head} \times s \times \text{sizeof(dtype)}

Where L = layers, n_{heads} = attention heads, d_{head} = head dimension, s = sequence length.

For a 70B parameter model with 80 layers, 64 heads, 128 head dimension, and 4096 sequence length:

Architecture Diagram

KV Cache = 2 × 80 × 64 × 128 × 4096 × 2 bytes = 10.7 GB per sequence

PagedAttention (vLLM)

vLLM introduces PagedAttention, which manages KV cache in fixed-size blocks (pages) instead of contiguous memory, eliminating memory fragmentation.

from vllm import LLM, SamplingParams

# vLLM automatically manages KV cache with PagedAttention
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    gpu_memory_utilization=0.9,  # 90% of GPU memory for KV cache
    max_num_batched_tokens=8192,
    max_num_seqs=256,
    block_size=16  # KV cache block size in tokens
)

# Efficient batching — requests share KV cache blocks when possible
prompts = ["Explain quantum computing", "What is machine learning?"]
outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=256))

KV Cache Compression

For long-context models, KV cache can be compressed through eviction or pooling strategies.

# Sliding window attention — limit KV cache to recent tokens
class SlidingWindowKVCache:
    def __init__(self, window_size: int = 2048):
        self.window_size = window_size

    def update(self, key_states, value_states):
        # Keep only the last window_size tokens
        if key_states.shape[-2] > self.window_size:
            key_states = key_states[..., -self.window_size:, :]
            value_states = value_states[..., -self.window_size:, :]
        return key_states, value_states

Continuous Batching

Traditional static batching waits for all requests to complete before processing the next batch. Continuous batching (also called iteration-level scheduling) adds and removes requests from the batch dynamically.

DfThroughput Improvement

Continuous batching achieves throughput of:

\text{Throughput}{continuous} = \frac{N \cdot \bar{T}{output}}{\max(L_i) \cdot \bar{T}_{token}}

Where N is batch size, \bar{T}{output} is average output length, L_i is the length of the i-th sequence, and \bar{T}{token} is per-token latency.

# Comparison: static vs continuous batching
# Static batching: batch waits for longest sequence
# Request A: 10 tokens  [IDLE---DONE-IDLE-IDLE-IDLE]
# Request B: 50 tokens  [PROCESSING------------------DONE]
# Request C: 20 tokens  [PROCESSING------DONE-IDLE---IDLE]

# Continuous batching: slots are immediately reused
# Request A: 10 tokens  [DONE] [Request D: 30 tokens--DONE]
# Request B: 50 tokens  [PROCESSING------------------------DONE]
# Request C: 20 tokens  [PROCESSING----DONE] [Req E: 15---DONE]

Speculative Decoding

Speculative decoding uses a smaller "draft" model to generate candidate tokens that are then verified by the larger target model in a single forward pass.

\text{Speedup}{speculative} = \frac{T{target}}{T_{draft} \cdot \alpha \cdot k + T_{target} \cdot (1 - \alpha \cdot k)}

Where k is the draft length, \alpha is the acceptance rate, and T_{target} and T_{draft} are per-token costs.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class SpeculativeDecoder:
    def __init__(self, target_model, draft_model, tokenizer, k=5):
        self.target = target_model
        self.draft = draft_model
        self.tokenizer = tokenizer
        self.k = k  # Draft tokens per verification step

    def generate(self, prompt: str, max_tokens: int = 256):
        input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
        generated = input_ids.clone()

        while generated.shape[-1] < max_tokens:
            # Step 1: Draft model generates k candidate tokens
            draft_tokens = []
            draft_input = generated.clone()
            for _ in range(self.k):
                draft_logits = self.draft(draft_input).logits[:, -1, :]
                draft_token = torch.argmax(draft_logits, dim=-1)
                draft_tokens.append(draft_token)
                draft_input = torch.cat([draft_input, draft_token.unsqueeze(0)], dim=-1)

            draft_tensor = torch.stack(draft_tokens, dim=-1)

            # Step 2: Target model verifies all draft tokens in one pass
            candidate = torch.cat([generated, draft_tensor], dim=-1)
            target_logits = self.target(candidate).logits

            # Step 3: Accept/reject each draft token
            accepted = 0
            for i in range(self.k):
                target_token = torch.argmax(target_logits[:, -(self.k + 1 - i), :], dim=-1)
                if target_token.item() == draft_tokens[i].item():
                    accepted += 1
                else:
                    # Accept draft tokens up to this point + target token
                    break

            # Update generated sequence
            if accepted > 0:
                generated = torch.cat([generated, draft_tensor[:, :accepted]], dim=-1)

            # If rejection occurred, sample from corrected distribution
            if accepted < self.k:
                next_token = torch.argmax(target_logits[:, -(self.k - accepted), :], dim=-1)
                generated = torch.cat([generated, next_token.unsqueeze(0)], dim=-1)

        return self.tokenizer.decode(generated[0])

Flash Attention

Flash Attention computes exact attention in a memory-efficient manner by reducing memory reads and writes between GPU HBM and SRAM.

\text{Memory}{standard} = O(n^2) \quad \text{vs} \quad \text{Memory}{flash} = O(n)

from torch.nn.functional import scaled_dot_product_attention

# PyTorch 2.0+ — Flash Attention enabled automatically
def attention_forward(q, k, v, is_causal=True):
    # Uses FlashAttention kernel when hardware supports it
    return scaled_dot_product_attention(q, k, v, is_causal=is_causal)

# Memory savings: O(n^2) → O(n) for sequence dimension
# Speed improvement: 2-4x for typical sequence lengths (512-4096)

Optimization Summary

Technique	Latency Reduction	Memory Reduction	Implementation Effort
INT8 Quantization	1.5x	2x	Low
INT4 Quantization	2.5x	4x	Medium
PagedAttention	0% (throughput +)	60-80% KV cache	Low (use vLLM)
Continuous Batching	2-5x throughput	Minimal	Low (use vLLM)
Speculative Decoding	2-3x	Minimal	High
Flash Attention	2-4x	5-20x	Low (use PyTorch 2.0)

Effective LLMOps combines multiple optimization techniques. A production system might use AWQ-quantized models served via vLLM with continuous batching and Flash Attention, achieving 5-10x throughput improvement over naive inference.