← Case Studies · AI Platform
vLLM Multi-Model Serving on H100/H200
Production LLM inference platform spanning 7B to 72B models, delivering sub-200ms P99 TTFT at 1,000+ concurrent requests — with AWQ 4-bit quantization, tensor parallelism, and HPA on DCGM GPU metrics.
- vLLM
- NVIDIA H100/H200
- AWQ 4-bit
- Tensor Parallelism
- Continuous Batching
- Prefix Caching
- DCGM
- HPA
- Kubernetes
The problem
Internal teams needed a single, reliable inference platform capable of serving both general-purpose chat (7B–13B) and specialized reasoning workloads (70B+) with consistent latency SLOs. Off-the-shelf inference stacks either burned through VRAM unnecessarily or collapsed under real concurrent workloads.
Architecture
┌────────────────────────────────────────────────────────────┐
│ Clients: Open WebUI · n8n · LangChain · Mattermost bots │
└─────────────────────────┬──────────────────────────────────┘
▼
┌──────────────────────┐
│ LiteLLM Gateway │ OpenAI-compatible API
│ (rate limit, keys, │ load balance + fallback
│ cost per team) │
└─────────┬────────────┘
▼
┌────────────────────┴──────────────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ vLLM (7B/13B) │ │ vLLM (70B/72B) │
│ 1 × H100 │ │ 4 × H200 TP=4 │
│ AWQ 4-bit │ │ AWQ 4-bit │
│ Prefix caching │ │ Continuous batch │
└────────┬─────────┘ └────────┬─────────┘
│ │
└──────────────┬──────────────────────────┘
▼
Prometheus · DCGM · Grafana · Alertmanager
Implementation highlights
- Quantization: AWQ 4-bit (AutoAWQ) shrank a 72B checkpoint from ~144GB → ~36GB VRAM with negligible quality loss validated against MMLU and HumanEval benchmark gates before production promotion.
- Tensor parallelism: TP=4 across NVLink-connected H200s; careful NCCL tuning and pinned CPU memory to avoid PCIe bottlenecks.
- Continuous batching + prefix caching: vLLM's PagedAttention paired with prefix cache drove token throughput up ~3× at mixed-length conversational workloads.
- Autoscaling: Kubernetes HPA wired to DCGM GPU utilization + vLLM queue depth metrics rather than CPU — avoided the common failure mode of scaling too late.
- Health: liveness/readiness on
/health, with a separate warm-up probe so Kubernetes doesn't route traffic before the model is fully loaded into VRAM.
Observability
- vLLM Prometheus metrics:
e2e_request_latency_seconds,num_requests_running,gpu_cache_usage_perc,generation_tokens_total. - Alertmanager rules for P99 TTFT degradation, KV-cache exhaustion, and GPU hang detection.
- 30+ Grafana dashboards covering golden signals per model, per tenant, per region.
- Distributed tracing across LiteLLM → vLLM for end-to-end request attribution.
Outcomes
- P99 TTFT < 200 ms at 1,000+ concurrent requests.
- 72B deployable on 4× H200 via AWQ (would otherwise require 8× H100 at FP16).
- Zero-downtime model upgrades via Argo Rollouts + Seldon Core canary with Prometheus analysis gates.
- Unified OpenAI-compatible API via LiteLLM — consumers don't know or care which backend serves them.