← Case Studies · AI Platform

vLLM Multi-Model Serving on H100/H200

Production LLM inference platform spanning 7B to 72B models, delivering sub-200ms P99 TTFT at 1,000+ concurrent requests — with AWQ 4-bit quantization, tensor parallelism, and HPA on DCGM GPU metrics.

  • vLLM
  • NVIDIA H100/H200
  • AWQ 4-bit
  • Tensor Parallelism
  • Continuous Batching
  • Prefix Caching
  • DCGM
  • HPA
  • Kubernetes

The problem

Internal teams needed a single, reliable inference platform capable of serving both general-purpose chat (7B–13B) and specialized reasoning workloads (70B+) with consistent latency SLOs. Off-the-shelf inference stacks either burned through VRAM unnecessarily or collapsed under real concurrent workloads.

Architecture

┌────────────────────────────────────────────────────────────┐
│  Clients: Open WebUI · n8n · LangChain · Mattermost bots   │
└─────────────────────────┬──────────────────────────────────┘
                          ▼
              ┌──────────────────────┐
              │  LiteLLM Gateway     │  OpenAI-compatible API
              │  (rate limit, keys,  │  load balance + fallback
              │   cost per team)     │
              └─────────┬────────────┘
                        ▼
   ┌────────────────────┴──────────────────────┐
   ▼                                             ▼
┌──────────────────┐                      ┌──────────────────┐
│ vLLM (7B/13B)    │                      │ vLLM (70B/72B)   │
│ 1 × H100         │                      │ 4 × H200 TP=4    │
│ AWQ 4-bit        │                      │ AWQ 4-bit        │
│ Prefix caching   │                      │ Continuous batch │
└────────┬─────────┘                      └────────┬─────────┘
         │                                          │
         └──────────────┬──────────────────────────┘
                        ▼
          Prometheus · DCGM · Grafana · Alertmanager

Implementation highlights

  • Quantization: AWQ 4-bit (AutoAWQ) shrank a 72B checkpoint from ~144GB → ~36GB VRAM with negligible quality loss validated against MMLU and HumanEval benchmark gates before production promotion.
  • Tensor parallelism: TP=4 across NVLink-connected H200s; careful NCCL tuning and pinned CPU memory to avoid PCIe bottlenecks.
  • Continuous batching + prefix caching: vLLM's PagedAttention paired with prefix cache drove token throughput up ~3× at mixed-length conversational workloads.
  • Autoscaling: Kubernetes HPA wired to DCGM GPU utilization + vLLM queue depth metrics rather than CPU — avoided the common failure mode of scaling too late.
  • Health: liveness/readiness on /health, with a separate warm-up probe so Kubernetes doesn't route traffic before the model is fully loaded into VRAM.

Observability

  • vLLM Prometheus metrics: e2e_request_latency_seconds, num_requests_running, gpu_cache_usage_perc, generation_tokens_total.
  • Alertmanager rules for P99 TTFT degradation, KV-cache exhaustion, and GPU hang detection.
  • 30+ Grafana dashboards covering golden signals per model, per tenant, per region.
  • Distributed tracing across LiteLLM → vLLM for end-to-end request attribution.

Outcomes

  • P99 TTFT < 200 ms at 1,000+ concurrent requests.
  • 72B deployable on 4× H200 via AWQ (would otherwise require 8× H100 at FP16).
  • Zero-downtime model upgrades via Argo Rollouts + Seldon Core canary with Prometheus analysis gates.
  • Unified OpenAI-compatible API via LiteLLM — consumers don't know or care which backend serves them.

← More case studies Discuss this work