LLM Serving Landscape
Serving large language models requires specialized infrastructure that handles GPU memory management, request batching, and high-throughput inference. Unlike traditional ML model serving, LLM serving must manage autoregressive generation across multiple concurrent requests.
Serving Frameworks Comparison
| Framework | Developer | Key Feature | Best For |
|---|---|---|---|
| vLLM | UC Berkeley | PagedAttention, continuous batching | High-throughput serving |
| TGI | Hugging Face | Optimized for HF models, streaming | HF ecosystem integration |
| TensorRT-LLM | NVIDIA | Max throughput on NVIDIA GPUs | NVIDIA-optimized deployments |
| Triton | NVIDIA | Multi-model serving, ensemble | Complex inference pipelines |
| Ollama | Ollama | Local development, easy setup | Prototyping, edge deployment |
vLLM Architecture
vLLM is the most widely adopted open-source LLM serving engine. Its key innovation is PagedAttention, which eliminates memory fragmentation in KV cache management.
from vllm import LLM, SamplingParams
from vllm.config import EngineConfig
# Basic vLLM deployment
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=4, # Spread across 4 GPUs
gpu_memory_utilization=0.9, # 90% GPU memory for KV cache
max_model_len=4096, # Maximum context length
dtype="auto", # Auto-detect FP16/BF16
trust_remote_code=True,
enforce_eager=True # Disable CUDA graphs for debugging
)
# Generate with controlled parameters
params = SamplingParams(
temperature=0.7,
top_p=0.9,
top_k=50,
max_tokens=512,
presence_penalty=0.1,
frequency_penalty=0.1,
stop=["</s>", "Human:"]
)
outputs = llm.generate(["What is LLMOps?"], params)
print(outputs[0].outputs[0].text)
vLLM Production Configuration
# vllm-config.yaml
model: "meta-llama/Llama-2-70b-hf"
served-model-name: "llama-2-70b"
tensor-parallel-size: 4
pipeline-parallel-size: 1
gpu-memory-utilization: 0.92
max-num-batched-tokens: 8192
max-num-seqs: 256
block-size: 16
swap-space: 4 # GB for CPU swap
max-logprobs: 5
disable-log-requests: false
host: "0.0.0.0"
port: 8000
Text Generation Inference (TGI)
TGI is Hugging Face's production-ready serving solution, tightly integrated with the Hugging Face ecosystem.
from huggingface_hub import InferenceClient
# TGI via Hugging Face Inference API
client = InferenceClient(
model="http://localhost:8080", # Self-hosted TGI
token="hf_xxx" # For HF Inference API
)
# Streaming generation
stream = client.text_generation(
"Explain quantum computing in simple terms",
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
stream=True,
details=True
)
for chunk in stream:
print(chunk.token.text, end="", flush=True)
TGI Docker Deployment
# Dockerfile for TGI
FROM ghcr.io/huggingface/text-generation-inference:latest
ENV MODEL_ID=meta-llama/Llama-2-7b-chat-hf
ENV QUANTIZE=gptq
ENV MAX_INPUT_LENGTH=4096
ENV MAX_TOTAL_TOKENS=8192
ENV MAX_BATCH_PREFILL_TOKENS=8192
ENV MAX_CONCURRENT_REQUESTS=128
# Launch TGI
CMD ["text-generation-launcher", "--model-id", "$MODEL_ID"]
# Deploy with Docker
docker run --gpus all \
--shm-size 1g \
-p 8080:80 \
-v /data/models:/data \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-7b-chat-hf \
--quantize gptq
TensorRT-LLM
TensorRT-LLM provides the highest throughput on NVIDIA GPUs through kernel-level optimizations and graph compilation.
import tensorrt_llm
# TensorRT-LLM engine build configuration
import tensorrt_llm.Builder as Builder
build_config = {
"architecture": "llama",
"dtype": "float16",
"tp_size": 4,
"pp_size": 1,
"max_batch_size": 64,
"max_input_len": 2048,
"max_output_len": 512,
"max_seq_len": 4096,
"plugin_config": {
"use_paged_context_fmha": True,
"use_flash_attention": True,
"gemm_plugin": "float16"
}
}
# Build optimized engine
engine = Builder.build_engine(build_config)
Triton Inference Server
Triton supports serving multiple models with different backends simultaneously, making it suitable for complex inference pipelines.
# model_repository/config.pbtxt for LLM serving
"""
name: "llama_pipeline"
platform: "ensemble"
max_batch_size: 32
input [
{
name: "INPUT_TEXT"
data_type: TYPE_STRING
dims: [1]
}
]
output [
{
name: "OUTPUT_TEXT"
data_type: TYPE_STRING
dims: [1]
}
]
ensemble_scheduling {
step [
{
model_name: "tokenizer"
model_version: -1
input_map { key: "INPUT_TEXT" value: "INPUT_TEXT" }
output_map { key: "TOKEN_IDS" value: "TOKEN_IDS" }
},
{
model_name: "llama_engine"
model_version: -1
input_map { key: "INPUT_IDS" value: "TOKEN_IDS" }
output_map { key: "OUTPUT_IDS" value: "OUTPUT_IDS" }
},
{
model_name: "detokenizer"
model_version: -1
input_map { key: "OUTPUT_IDS" value: "OUTPUT_IDS" }
output_map { key: "OUTPUT_TEXT" value: "OUTPUT_TEXT" }
}
]
}
"""
Auto-Scaling for LLM Workloads
LLM serving requires GPU-aware auto-scaling that considers VRAM usage and request queue depth rather than CPU utilization.
# Kubernetes HPA configuration for LLM serving
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-serving-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
minReplicas: 2
maxReplicas: 16
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "85"
- type: Pods
pods:
metric:
name: request_queue_depth
target:
type: AverageValue
averageValue: "10"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
Cost Comparison
| Setup | GPUs | Throughput (tokens/s) | Cost/hr | Cost/1M tokens |
|---|---|---|---|---|
| vLLM + A100 80GB | 4 | ~8,000 | 1.50 | |
| TGI + A100 40GB | 2 | ~4,000 | 1.50 | |
| TensorRT-LLM + H100 | 4 | ~15,000 | 1.60 | |
| OpenAI API | N/A | ~1,000 | N/A | 30.00 |
Self-hosting becomes cost-effective at scale (roughly >1M tokens/day), while API-based serving is more economical for variable or low-volume workloads.