πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

LLM Serving Architecture

LLMOps FundamentalsModel Serving🟒 Free Lesson

Advertisement

LLM Serving Landscape

Serving large language models requires specialized infrastructure that handles GPU memory management, request batching, and high-throughput inference. Unlike traditional ML model serving, LLM serving must manage autoregressive generation across multiple concurrent requests.

Serving Frameworks Comparison

FrameworkDeveloperKey FeatureBest For
vLLMUC BerkeleyPagedAttention, continuous batchingHigh-throughput serving
TGIHugging FaceOptimized for HF models, streamingHF ecosystem integration
TensorRT-LLMNVIDIAMax throughput on NVIDIA GPUsNVIDIA-optimized deployments
TritonNVIDIAMulti-model serving, ensembleComplex inference pipelines
OllamaOllamaLocal development, easy setupPrototyping, edge deployment

vLLM Architecture

vLLM is the most widely adopted open-source LLM serving engine. Its key innovation is PagedAttention, which eliminates memory fragmentation in KV cache management.

from vllm import LLM, SamplingParams
from vllm.config import EngineConfig

# Basic vLLM deployment
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,       # Spread across 4 GPUs
    gpu_memory_utilization=0.9,   # 90% GPU memory for KV cache
    max_model_len=4096,           # Maximum context length
    dtype="auto",                 # Auto-detect FP16/BF16
    trust_remote_code=True,
    enforce_eager=True            # Disable CUDA graphs for debugging
)

# Generate with controlled parameters
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    max_tokens=512,
    presence_penalty=0.1,
    frequency_penalty=0.1,
    stop=["</s>", "Human:"]
)

outputs = llm.generate(["What is LLMOps?"], params)
print(outputs[0].outputs[0].text)

vLLM Production Configuration

# vllm-config.yaml
model: "meta-llama/Llama-2-70b-hf"
served-model-name: "llama-2-70b"
tensor-parallel-size: 4
pipeline-parallel-size: 1
gpu-memory-utilization: 0.92
max-num-batched-tokens: 8192
max-num-seqs: 256
block-size: 16
swap-space: 4                    # GB for CPU swap
max-logprobs: 5
disable-log-requests: false
host: "0.0.0.0"
port: 8000

Text Generation Inference (TGI)

TGI is Hugging Face's production-ready serving solution, tightly integrated with the Hugging Face ecosystem.

from huggingface_hub import InferenceClient

# TGI via Hugging Face Inference API
client = InferenceClient(
    model="http://localhost:8080",  # Self-hosted TGI
    token="hf_xxx"                 # For HF Inference API
)

# Streaming generation
stream = client.text_generation(
    "Explain quantum computing in simple terms",
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    stream=True,
    details=True
)

for chunk in stream:
    print(chunk.token.text, end="", flush=True)

TGI Docker Deployment

# Dockerfile for TGI
FROM ghcr.io/huggingface/text-generation-inference:latest

ENV MODEL_ID=meta-llama/Llama-2-7b-chat-hf
ENV QUANTIZE=gptq
ENV MAX_INPUT_LENGTH=4096
ENV MAX_TOTAL_TOKENS=8192
ENV MAX_BATCH_PREFILL_TOKENS=8192
ENV MAX_CONCURRENT_REQUESTS=128

# Launch TGI
CMD ["text-generation-launcher", "--model-id", "$MODEL_ID"]
# Deploy with Docker
docker run --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /data/models:/data \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-2-7b-chat-hf \
  --quantize gptq

TensorRT-LLM

TensorRT-LLM provides the highest throughput on NVIDIA GPUs through kernel-level optimizations and graph compilation.

import tensorrt_llm

# TensorRT-LLM engine build configuration
import tensorrt_llm.Builder as Builder

build_config = {
    "architecture": "llama",
    "dtype": "float16",
    "tp_size": 4,
    "pp_size": 1,
    "max_batch_size": 64,
    "max_input_len": 2048,
    "max_output_len": 512,
    "max_seq_len": 4096,
    "plugin_config": {
        "use_paged_context_fmha": True,
        "use_flash_attention": True,
        "gemm_plugin": "float16"
    }
}

# Build optimized engine
engine = Builder.build_engine(build_config)

Triton Inference Server

Triton supports serving multiple models with different backends simultaneously, making it suitable for complex inference pipelines.

# model_repository/config.pbtxt for LLM serving
"""
name: "llama_pipeline"
platform: "ensemble"
max_batch_size: 32

input [
  {
    name: "INPUT_TEXT"
    data_type: TYPE_STRING
    dims: [1]
  }
]

output [
  {
    name: "OUTPUT_TEXT"
    data_type: TYPE_STRING
    dims: [1]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "tokenizer"
      model_version: -1
      input_map { key: "INPUT_TEXT" value: "INPUT_TEXT" }
      output_map { key: "TOKEN_IDS" value: "TOKEN_IDS" }
    },
    {
      model_name: "llama_engine"
      model_version: -1
      input_map { key: "INPUT_IDS" value: "TOKEN_IDS" }
      output_map { key: "OUTPUT_IDS" value: "OUTPUT_IDS" }
    },
    {
      model_name: "detokenizer"
      model_version: -1
      input_map { key: "OUTPUT_IDS" value: "OUTPUT_IDS" }
      output_map { key: "OUTPUT_TEXT" value: "OUTPUT_TEXT" }
    }
  ]
}
"""

Auto-Scaling for LLM Workloads

LLM serving requires GPU-aware auto-scaling that considers VRAM usage and request queue depth rather than CPU utilization.

# Kubernetes HPA configuration for LLM serving
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-serving-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-deployment
  minReplicas: 2
  maxReplicas: 16
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "85"
  - type: Pods
    pods:
      metric:
        name: request_queue_depth
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120

Cost Comparison

SetupGPUsThroughput (tokens/s)Cost/hrCost/1M tokens
vLLM + A100 80GB4~8,00012.00∣ 12.00 | ~1.50
TGI + A100 40GB2~4,0006.00∣ 6.00 | ~1.50
TensorRT-LLM + H1004~15,00024.00∣ 24.00 | ~1.60
OpenAI APIN/A~1,000N/A10.00βˆ’10.00-30.00

Self-hosting becomes cost-effective at scale (roughly >1M tokens/day), while API-based serving is more economical for variable or low-volume workloads.

⭐

Premium Content

LLM Serving Architecture

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert AI Ops & LLM Ops Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement