← Case Studies · Agentic AI

LangGraph Multi-Agent SRE Automation

Stateful multi-agent workflow with a supervisor topology — routes Prometheus alerts to specialist agents for retrieval, execution, and validation. Human-in-the-loop on destructive actions, full audit logging, and a 40% reduction in overnight on-call interventions.

LangGraph
Supervisor Topology
Qdrant
RAG
MCP
n8n
HITL Gates
Audit Logging

The problem

Overnight on-call pages for recurring, well-documented incidents (disk pressure, runaway pods, noisy neighbors) created alert fatigue and degraded team availability. Traditional runbooks required manual execution; the playbooks already described safe, repeatable actions — but no system to execute them with guardrails.

Architecture

  Alertmanager webhook
           │
           ▼
  ┌─────────────────────┐
  │  Supervisor Agent   │  classifies alert → routes
  └───┬────────┬────────┘
      │        │
      ▼        ▼
  ┌────────┐ ┌──────────┐
  │ Retr.  │ │ Exec.    │──► kubectl / Ansible (RBAC-scoped SA)
  │ Agent  │ │ Agent    │    via MCP tool servers
  │ Qdrant │ └──┬───────┘
  │  RAG   │    │
  └────────┘    ▼
              ┌──────────┐
              │ Valid.   │──► Prometheus query verification
              │ Agent    │
              └──┬───────┘
                 │
                 ▼
         HITL approval gate on destructive ops
                 │
                 ▼
         Audit log (MinIO + Loki) + Mattermost post

Implementation highlights

LangGraph supervisor: stateful graph with typed edges between agents; failed transitions retry with exponential backoff and a dead-letter path to a human operator.
Retrieval agent: Qdrant vector store (Kubernetes StatefulSet on Longhorn PVCs) indexing runbooks, post-mortems, and architectural docs — sub-50 ms retrieval.
Execution agent: invokes MCP tool servers that wrap the Kubernetes API, GitLab, and Prometheus. The ServiceAccount is RBAC-scoped per namespace — even if the LLM went rogue, blast radius is bounded.
Validation agent: re-queries Prometheus after remediation to confirm the symptom resolved before closing the incident.
HITL: any action classified as destructive (pod deletion, node drain, scaling down) posts to Mattermost for human approval; timeout escalates to PagerDuty.
Audit: every state transition, prompt, and tool call persisted to MinIO + Loki — reviewable post-incident.

Safety model

Tool exposure is explicit via MCP — no arbitrary shell, no broad kubeconfig.
Agents run as a dedicated ServiceAccount with NetworkPolicy isolation; egress locked to the MCP endpoints and internal vLLM.
Every destructive action is behind a human gate. Every read action is logged.
Kill-switch: a single ConfigMap flag disables the execution agent globally.

Outcomes

−40% overnight on-call interventions in the first 90 days.
−50% operational toil overall (measured by tickets resolved without human touch).
Mean time to triage on covered alert classes dropped from minutes to seconds.
Runbook quality improved — the system exposes gaps in documentation, driving continuous improvement.

← More case studies Discuss this work