← Case Studies · Agentic AI
LangGraph Multi-Agent SRE Automation
Stateful multi-agent workflow with a supervisor topology — routes Prometheus alerts to specialist agents for retrieval, execution, and validation. Human-in-the-loop on destructive actions, full audit logging, and a 40% reduction in overnight on-call interventions.
- LangGraph
- Supervisor Topology
- Qdrant
- RAG
- MCP
- n8n
- HITL Gates
- Audit Logging
The problem
Overnight on-call pages for recurring, well-documented incidents (disk pressure, runaway pods, noisy neighbors) created alert fatigue and degraded team availability. Traditional runbooks required manual execution; the playbooks already described safe, repeatable actions — but no system to execute them with guardrails.
Architecture
Alertmanager webhook
│
▼
┌─────────────────────┐
│ Supervisor Agent │ classifies alert → routes
└───┬────────┬────────┘
│ │
▼ ▼
┌────────┐ ┌──────────┐
│ Retr. │ │ Exec. │──► kubectl / Ansible (RBAC-scoped SA)
│ Agent │ │ Agent │ via MCP tool servers
│ Qdrant │ └──┬───────┘
│ RAG │ │
└────────┘ ▼
┌──────────┐
│ Valid. │──► Prometheus query verification
│ Agent │
└──┬───────┘
│
▼
HITL approval gate on destructive ops
│
▼
Audit log (MinIO + Loki) + Mattermost post
Implementation highlights
- LangGraph supervisor: stateful graph with typed edges between agents; failed transitions retry with exponential backoff and a dead-letter path to a human operator.
- Retrieval agent: Qdrant vector store (Kubernetes StatefulSet on Longhorn PVCs) indexing runbooks, post-mortems, and architectural docs — sub-50 ms retrieval.
- Execution agent: invokes MCP tool servers that wrap the Kubernetes API, GitLab, and Prometheus. The ServiceAccount is RBAC-scoped per namespace — even if the LLM went rogue, blast radius is bounded.
- Validation agent: re-queries Prometheus after remediation to confirm the symptom resolved before closing the incident.
- HITL: any action classified as destructive (pod deletion, node drain, scaling down) posts to Mattermost for human approval; timeout escalates to PagerDuty.
- Audit: every state transition, prompt, and tool call persisted to MinIO + Loki — reviewable post-incident.
Safety model
- Tool exposure is explicit via MCP — no arbitrary shell, no broad kubeconfig.
- Agents run as a dedicated ServiceAccount with NetworkPolicy isolation; egress locked to the MCP endpoints and internal vLLM.
- Every destructive action is behind a human gate. Every read action is logged.
- Kill-switch: a single ConfigMap flag disables the execution agent globally.
Outcomes
- −40% overnight on-call interventions in the first 90 days.
- −50% operational toil overall (measured by tickets resolved without human touch).
- Mean time to triage on covered alert classes dropped from minutes to seconds.
- Runbook quality improved — the system exposes gaps in documentation, driving continuous improvement.