LLM Serving Architecture

LLM Serving Landscape

Serving large language models requires specialized infrastructure that handles GPU memory management, request batching, and high-throughput inference. Unlike traditional ML model serving, LLM serving must manage autoregressive generation across multiple concurrent requests.

Serving Frameworks Comparison

Framework	Developer	Key Feature	Best For
vLLM	UC Berkeley	PagedAttention, continuous batching	High-throughput serving
TGI	Hugging Face	Optimized for HF models, streaming	HF ecosystem integration
TensorRT-LLM	NVIDIA	Max throughput on NVIDIA GPUs	NVIDIA-optimized deployments
Triton	NVIDIA	Multi-model serving, ensemble	Complex inference pipelines
Ollama	Ollama	Local development, easy setup	Prototyping, edge deployment

vLLM Architecture

vLLM is the most widely adopted open-source LLM serving engine. Its key innovation is PagedAttention, which eliminates memory fragmentation in KV cache management.

from vllm import LLM, SamplingParams
from vllm.config import EngineConfig

# Basic vLLM deployment
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,       # Spread across 4 GPUs
    gpu_memory_utilization=0.9,   # 90% GPU memory for KV cache
    max_model_len=4096,           # Maximum context length
    dtype="auto",                 # Auto-detect FP16/BF16
    trust_remote_code=True,
    enforce_eager=True            # Disable CUDA graphs for debugging
)

# Generate with controlled parameters
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    max_tokens=512,
    presence_penalty=0.1,
    frequency_penalty=0.1,
    stop=["</s>", "Human:"]
)

outputs = llm.generate(["What is LLMOps?"], params)
print(outputs[0].outputs[0].text)

vLLM Production Configuration

# vllm-config.yaml
model: "meta-llama/Llama-2-70b-hf"
served-model-name: "llama-2-70b"
tensor-parallel-size: 4
pipeline-parallel-size: 1
gpu-memory-utilization: 0.92
max-num-batched-tokens: 8192
max-num-seqs: 256
block-size: 16
swap-space: 4                    # GB for CPU swap
max-logprobs: 5
disable-log-requests: false
host: "0.0.0.0"
port: 8000

Text Generation Inference (TGI)

TGI is Hugging Face's production-ready serving solution, tightly integrated with the Hugging Face ecosystem.

from huggingface_hub import InferenceClient

# TGI via Hugging Face Inference API
client = InferenceClient(
    model="http://localhost:8080",  # Self-hosted TGI
    token="hf_xxx"                 # For HF Inference API
)

# Streaming generation
stream = client.text_generation(
    "Explain quantum computing in simple terms",
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    stream=True,
    details=True
)

for chunk in stream:
    print(chunk.token.text, end="", flush=True)

TGI Docker Deployment

# Dockerfile for TGI
FROM ghcr.io/huggingface/text-generation-inference:latest

ENV MODEL_ID=meta-llama/Llama-2-7b-chat-hf
ENV QUANTIZE=gptq
ENV MAX_INPUT_LENGTH=4096
ENV MAX_TOTAL_TOKENS=8192
ENV MAX_BATCH_PREFILL_TOKENS=8192
ENV MAX_CONCURRENT_REQUESTS=128

# Launch TGI
CMD ["text-generation-launcher", "--model-id", "$MODEL_ID"]

# Deploy with Docker
docker run --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /data/models:/data \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-2-7b-chat-hf \
  --quantize gptq

TensorRT-LLM

TensorRT-LLM provides the highest throughput on NVIDIA GPUs through kernel-level optimizations and graph compilation.

import tensorrt_llm

# TensorRT-LLM engine build configuration
import tensorrt_llm.Builder as Builder

build_config = {
    "architecture": "llama",
    "dtype": "float16",
    "tp_size": 4,
    "pp_size": 1,
    "max_batch_size": 64,
    "max_input_len": 2048,
    "max_output_len": 512,
    "max_seq_len": 4096,
    "plugin_config": {
        "use_paged_context_fmha": True,
        "use_flash_attention": True,
        "gemm_plugin": "float16"
    }
}

# Build optimized engine
engine = Builder.build_engine(build_config)

Triton Inference Server

Triton supports serving multiple models with different backends simultaneously, making it suitable for complex inference pipelines.

# model_repository/config.pbtxt for LLM serving
"""
name: "llama_pipeline"
platform: "ensemble"
max_batch_size: 32

input [
  {
    name: "INPUT_TEXT"
    data_type: TYPE_STRING
    dims: [1]
  }
]

output [
  {
    name: "OUTPUT_TEXT"
    data_type: TYPE_STRING
    dims: [1]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "tokenizer"
      model_version: -1
      input_map { key: "INPUT_TEXT" value: "INPUT_TEXT" }
      output_map { key: "TOKEN_IDS" value: "TOKEN_IDS" }
    },
    {
      model_name: "llama_engine"
      model_version: -1
      input_map { key: "INPUT_IDS" value: "TOKEN_IDS" }
      output_map { key: "OUTPUT_IDS" value: "OUTPUT_IDS" }
    },
    {
      model_name: "detokenizer"
      model_version: -1
      input_map { key: "OUTPUT_IDS" value: "OUTPUT_IDS" }
      output_map { key: "OUTPUT_TEXT" value: "OUTPUT_TEXT" }
    }
  ]
}
"""

Auto-Scaling for LLM Workloads

LLM serving requires GPU-aware auto-scaling that considers VRAM usage and request queue depth rather than CPU utilization.

# Kubernetes HPA configuration for LLM serving
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-serving-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-deployment
  minReplicas: 2
  maxReplicas: 16
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "85"
  - type: Pods
    pods:
      metric:
        name: request_queue_depth
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120

Cost Comparison

Setup	GPUs	Throughput (tokens/s)	Cost/hr	Cost/1M tokens
vLLM + A100 80GB	4	~8,000	$12.00 \| ~$ 1.50
TGI + A100 40GB	2	~4,000	$6.00 \| ~$ 1.50
TensorRT-LLM + H100	4	~15,000	$24.00 \| ~$ 1.60
OpenAI API	N/A	~1,000	N/A	$10.00-$ 30.00

Self-hosting becomes cost-effective at scale (roughly >1M tokens/day), while API-based serving is more economical for variable or low-volume workloads.