← Case Studies · Sovereign AI

Air-Gapped Sovereign LLM Ops

End-to-end pipeline to deploy modern LLMs into fully isolated, sovereign environments — from internet-connected bastion through data diode transfer to an air-gapped Kubernetes cluster with its own model registry. Deployed for GOVINT clients.

  • Air-Gapped
  • Offline HF Mirror
  • AWQ
  • Data Diode
  • Harbor
  • MinIO Model Registry
  • Kyverno
  • Sovereign

The problem

Sovereign and GOVINT customers cannot reach the public internet — no direct HuggingFace pulls, no docker.io, no ghcr.io, no GitHub Releases. They still need the latest open-weight models, reproducible deploys, and the ability to audit what's running in production. The entire supply chain has to be transferred across an air gap with integrity guarantees.

Pipeline

  ┌──────────────────────────────────────────────┐
  │  BASTION (internet-connected, staging)       │
  │  • HF snapshot_download (HF_HUB_OFFLINE=0)   │
  │  • AutoAWQ 4-bit quantization                │
  │  • MMLU / HumanEval validation gates         │
  │  • Package: weights + tokenizer + configs    │
  │  • Sign + checksum                           │
  └────────────────────┬─────────────────────────┘
                       │  (one-way)
                       ▼  data diode
  ┌──────────────────────────────────────────────┐
  │  AIR-GAPPED SIDE                             │
  │  • Harbor registry (nvcr.io/docker.io mirror)│
  │  • MinIO model registry (versioned artefacts)│
  │  • MLflow Registry (internal)                │
  │  • Kyverno admission policy:                 │
  │      reject Deployments referencing          │
  │      unregistered model paths                │
  │  • vLLM / Triton on isolated K8s cluster     │
  └──────────────────────────────────────────────┘

Implementation highlights

  • Offline HuggingFace mode: on the production side, HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 are enforced — every artefact (model, tokenizer, config, chat template) must be present locally or the pod refuses to start.
  • AWQ before transfer: quantizing on the bastion reduces the bytes that have to cross the diode (144GB → 36GB for a 72B model) and removes the need for GPU quantization tooling on the air-gapped side.
  • Quality gates: quantized checkpoints run through MMLU and HumanEval benchmarks on the bastion; a quality regression above the configured threshold blocks transfer.
  • Registry mirrors: Harbor mirrors nvcr.io, docker.io, and ghcr.io — internal deployments pull only from Harbor.
  • MinIO as model registry: versioned, checksummed artefact storage with MLflow Registry pointing at MinIO. Provenance is auditable.
  • Kyverno enforcement: admission policy rejects any vLLM Deployment referencing a model path that isn't registered — closing the "someone sideloaded a model" audit gap.

Threat & compliance model

  • Full provenance chain: model → quantization → validation → transfer → registry → deploy, all logged with immutable audit trail.
  • Zero-trust NetworkPolicies inside the cluster; egress denied by default.
  • Keycloak OIDC for human access; short-lived tokens; no shared kubeconfigs.
  • Falco + Trivy + OPA Gatekeeper enforce runtime and build-time controls.

Outcomes

  • Sovereign clients able to run current open-weight LLMs without exposure to the public internet.
  • Reproducible, auditable pipeline — what's in production matches what was validated, byte for byte.
  • Same developer experience on the air-gapped side as in cloud environments.
  • Pattern reused across GOVINT, OSINT, and Smart Nation verticals at G42.

← More case studies Discuss this work