← Case Studies · AI Platform

vLLM Multi-Model Serving on H100/H200

Production LLM inference platform spanning 7B to 72B models, delivering sub-200ms P99 TTFT at 1,000+ concurrent requests — with AWQ 4-bit quantization, tensor parallelism, and HPA on DCGM GPU metrics.

vLLM
NVIDIA H100/H200
AWQ 4-bit
Tensor Parallelism
Continuous Batching
Prefix Caching
DCGM
HPA
Kubernetes

The problem

Internal teams needed a single, reliable inference platform capable of serving both general-purpose chat (7B–13B) and specialized reasoning workloads (70B+) with consistent latency SLOs. Off-the-shelf inference stacks either burned through VRAM unnecessarily or collapsed under real concurrent workloads.

Architecture

┌────────────────────────────────────────────────────────────┐
│  Clients: Open WebUI · n8n · LangChain · Mattermost bots   │
└─────────────────────────┬──────────────────────────────────┘
                          ▼
              ┌──────────────────────┐
              │  LiteLLM Gateway     │  OpenAI-compatible API
              │  (rate limit, keys,  │  load balance + fallback
              │   cost per team)     │
              └─────────┬────────────┘
                        ▼
   ┌────────────────────┴──────────────────────┐
   ▼                                             ▼
┌──────────────────┐                      ┌──────────────────┐
│ vLLM (7B/13B)    │                      │ vLLM (70B/72B)   │
│ 1 × H100         │                      │ 4 × H200 TP=4    │
│ AWQ 4-bit        │                      │ AWQ 4-bit        │
│ Prefix caching   │                      │ Continuous batch │
└────────┬─────────┘                      └────────┬─────────┘
         │                                          │
         └──────────────┬──────────────────────────┘
                        ▼
          Prometheus · DCGM · Grafana · Alertmanager

Implementation highlights

Quantization: AWQ 4-bit (AutoAWQ) shrank a 72B checkpoint from ~144GB → ~36GB VRAM with negligible quality loss validated against MMLU and HumanEval benchmark gates before production promotion.
Tensor parallelism: TP=4 across NVLink-connected H200s; careful NCCL tuning and pinned CPU memory to avoid PCIe bottlenecks.
Continuous batching + prefix caching: vLLM's PagedAttention paired with prefix cache drove token throughput up ~3× at mixed-length conversational workloads.
Autoscaling: Kubernetes HPA wired to DCGM GPU utilization + vLLM queue depth metrics rather than CPU — avoided the common failure mode of scaling too late.
Health: liveness/readiness on /health, with a separate warm-up probe so Kubernetes doesn't route traffic before the model is fully loaded into VRAM.

Observability

vLLM Prometheus metrics: e2e_request_latency_seconds, num_requests_running, gpu_cache_usage_perc, generation_tokens_total.
Alertmanager rules for P99 TTFT degradation, KV-cache exhaustion, and GPU hang detection.
30+ Grafana dashboards covering golden signals per model, per tenant, per region.
Distributed tracing across LiteLLM → vLLM for end-to-end request attribution.

Outcomes

P99 TTFT < 200 ms at 1,000+ concurrent requests.
72B deployable on 4× H200 via AWQ (would otherwise require 8× H100 at FP16).
Zero-downtime model upgrades via Argo Rollouts + Seldon Core canary with Prometheus analysis gates.
Unified OpenAI-compatible API via LiteLLM — consumers don't know or care which backend serves them.

← More case studies Discuss this work