Senior SRE · DevOps Lead · MLOps / LLMOps · AI Infrastructure

Building & running the infrastructure that makes AI work at scale.

Senior Site Reliability Engineer leading a 6-person SRE team at Group 42 (G42). 11+ years across bare-metal GPU clusters, production Kubernetes (AKS, RKE2, air-gapped, MIG-partitioned H100/H200), vLLM/Triton serving, MLOps pipelines, and sovereign LLM Ops delivered across multiple countries for government, intelligence, and smart-city workloads.

About

I build and run the infrastructure that makes AI work at scale — from bare-metal GPU clusters to production Kubernetes platforms serving government, intelligence, and smart-city workloads. Based in Abu Dhabi, UAE.

As Senior Site Reliability Engineer and DevOps Lead at Group 42 (G42), I lead a 6-person SRE team responsible for cloud and on-premises platforms across Azure, AWS, and air-gapped environments — delivering to AI, GOVINT, OSINT, RND, and Smart Nation verticals, with cross-border hypercare across multiple countries.

Over the past year I've expanded deep into MLOps and LLM infrastructure — designing production-grade LLM serving stacks on NVIDIA H100/H200 GPU clusters, including full-stack deployments of 72B-parameter models with vLLM tensor parallelism, observability via Prometheus + Grafana + DCGM Exporter, and enterprise API gateways using LiteLLM.

Core focus areas: Kubernetes platform engineering (AKS, RKE2, air-gapped, MIG-partitioned GPU nodes) · LLM / MLOps (vLLM, Triton, Ray Serve, MLflow, Kubeflow) · Cloud platforms (Azure, AWS) with FinOps and cost optimisation · GitOps & CI/CD (ArgoCD, Flux, GitLab CI, Azure DevOps) · SRE practices (SLO/SLA, incident management, chaos engineering) · AI agent platforms (OpenClaw, n8n, LangGraph, Qdrant RAG).

Technical Skills

LLM Serving & Inference

  • vLLM
  • PagedAttention
  • Tensor Parallelism
  • AWQ / GPTQ
  • NVIDIA Triton
  • TensorRT
  • ONNX Runtime
  • LiteLLM Gateway
  • Ollama
  • Ray Serve
  • BentoML

Agentic AI & RAG

  • LangGraph
  • LangChain
  • AutoGen
  • Model Context Protocol (MCP)
  • n8n
  • OpenClaw
  • NemoClaw
  • Qdrant
  • Weaviate
  • Open WebUI
  • Tool-use pipelines
  • Human-in-the-loop

MLOps Toolchain

  • MLflow
  • Kubeflow Pipelines
  • Argo Workflows
  • Seldon Core
  • DVC
  • Weights & Biases
  • Feast
  • MinIO
  • HuggingFace Datasets
  • Kyverno Model Governance

Kubernetes & Cloud

  • AKS
  • EKS
  • RKE2
  • Rancher
  • Helm
  • ArgoCD
  • Argo Rollouts
  • Kustomize
  • Azure
  • AWS
  • G42 Cloud
  • Harbor
  • Docker / Containerd

GPU & HPC

  • NVIDIA H100 / H200 / A100
  • MIG (GPU Operator)
  • DCGM Exporter
  • KubeRay
  • CUDA / cuDNN
  • InfiniBand / NVLink
  • NCCL Tuning

Observability & Security

  • Prometheus
  • Grafana (30+ dashboards)
  • Loki
  • Fluent Bit
  • Alertmanager
  • Distributed Tracing
  • Falco
  • Trivy
  • OPA Gatekeeper
  • Kyverno
  • Keycloak OIDC
  • Zero-Trust NetworkPolicies

Air-Gapped LLM Ops

  • HF Hub Mirroring (offline)
  • Harbor Registry Mirrors
  • Data Diode Transfer
  • AWQ Size Reduction
  • Internal MinIO Model Registry
  • Sovereign Clusters

IaC & Scripting

  • Terraform
  • Ansible
  • GitLab CI/CD
  • GitOps
  • Python (LangGraph / MLflow / Kubeflow SDKs)
  • Bash

Experience

  1. 03/2023 – Present Abu Dhabi, UAE · International

    Senior Site Reliability Engineer · DevOps Lead

    Group 42 (G42)

    • Lead a 6-member SRE team delivering cloud and on-premises platform engineering across AI, GOVINT, OSINT, RND, and Smart Nation verticals.
    • Designed and deployed production LLM serving on NVIDIA H100/H200 GPU clusters — vLLM with tensor parallelism, 7B to 72B models, AWQ 4-bit quantization (144GB → 36GB), continuous batching, prefix caching; P99 TTFT < 200 ms at 1,000+ concurrent requests.
    • Built a full GPU observability stack — Prometheus, Grafana, DCGM Exporter — with real-time visibility into GPU utilization, memory pressure, and inference throughput; 30+ dashboards.
    • Architected and managed multi-cluster Kubernetes (AKS, RKE2) including air-gapped on-premises deployments with MIG-partitioned H100/H200 GPU nodes for multi-tenant ML workloads.
    • Deployed enterprise AI platform components — LiteLLM API gateway, Open WebUI, OpenClaw AI agent platform, Mattermost, shared PostgreSQL backend — containerized on isolated bridge networks.
    • Designed LangGraph multi-agent SRE automation (supervisor + retrieval + execution + validation agents) with human-in-the-loop gates — cut overnight on-call interventions by 40%; MCP tool servers reduced operational toil by 50%.
    • Implemented GitOps-driven platform lifecycle with ArgoCD and Flux across dev/staging/prod; led SRE practice adoption — SLO/SLA definition, error budgets, incident runbooks, chaos engineering.
    • Managed Azure Entra ID RBAC, service-principal governance, and least-privilege access for enterprise applications and CI/CD pipelines.
    • Delivered cross-border infrastructure projects with international client teams from architecture through hypercare.
  2. 07/2019 – 03/2023 United Arab Emirates

    Site Reliability Engineer

    Group 42 (G42)

    • Designed and implemented CI/CD pipelines on Kubernetes with Terraform and Docker — significantly reduced deployment times and improved release quality.
    • Architected scalable, cost-effective cloud infrastructures on AWS and Azure, deploying AI/ML services with high availability and fault tolerance.
    • Automated infrastructure provisioning and configuration with Ansible and Terraform; designed Azure security solutions aligned with compliance requirements.
    • Deployed and managed the ELK stack (Elasticsearch, Logstash, Kibana) for centralized logging and real-time monitoring.
    • Configured and managed AWS services — EC2, S3, RDS, Lambda, ECS, EKS, Load Balancers — with CloudWatch, CloudTrail, Prometheus, and Grafana for observability.
    • Handled Linux system administration and database administration (backup/recovery, performance tuning, security) across production environments.
  3. 01/2018 – 07/2019 Abu Dhabi, UAE

    Cloud Engineer

    First Abu Dhabi Bank (FAB)

    • Delivered DevOps and cloud engineering for the FGB–NBAD banks migration and integration — one of the UAE's largest banking mergers — supporting mission-critical financial infrastructure under strict regulatory and availability requirements.
    • Managed enterprise Azure cloud infrastructure with Terraform and Ansible for repeatable, audit-compliant environments.
    • Designed and maintained CI/CD pipelines for multiple development teams; implemented Docker + Kubernetes containerization strategies.
    • Contributed to Linux hardening, security patching, and compliance alignment for regulated banking workloads; coordinated on RBAC, network policy, and secrets management.
  4. 05/2017 – 01/2018 Abu Dhabi, UAE

    Technical Specialist

    HCL Infosystems Ltd.

    • Delivered a software-defined data centre transformation for Daman — Abu Dhabi's government-affiliated health insurance provider — including migration of existing Linux, Windows, and virtualization workloads to the new cloud-enabled data centre.
    • Implemented, administered, and managed Oracle Virtual Manager (OVM); owned installation, operational management, and capacity expansion.
    • Authored service reports covering executed tasks, findings, and solutions for the client.
  5. 08/2015 – 04/2017 Bengaluru, India

    Linux Engineer

    HCL Infosystems Ltd.

    • Operated India's largest Biometric Data Centre for UIDAI (Unique Identification Authority of India) — 3,000+ physical servers (IBM, HP, Dell blade and rack) and 800+ virtual machines.
    • Administered Linux across CentOS, Ubuntu, and Red Hat; fine-tuned system parameters for workload-specific performance; automated routine operations with Bash and Python.
    • Hardened Linux servers — access controls, firewalls, kernel patching — and diagnosed network issues with Wireshark, tcpdump, and ncat.
    • Managed VMware vSphere (vCenter, ESXi, clusters) — VM lifecycle, resource optimization, datastores, storage policies, and storage vMotion.
  6. 10/2012 – 06/2013 Hyderabad, India

    Technical Support Engineer

    Polaris

    • Provided L2 technical support for Videocon d2h — an Indian DTH pay-TV operator — troubleshooting Linux-related incidents to keep services running smoothly.
    • Administered Linux servers (CentOS, Ubuntu, Red Hat) — installation, configuration, user/access management, and application troubleshooting.
    • Wrote shell scripts to automate routine tasks and maintained technical documentation, SOPs, and troubleshooting guides.

AI & Agents — Production Focus

Selected production AI/agentic systems I've built & operated. All run on Kubernetes with GPU scheduling, OIDC, NetworkPolicy isolation, and full observability. Three have full case studies.

vLLM Multi-Model Serving Platform

Production vLLM on H100/H200 clusters with AWQ 4-bit quantization, tensor parallelism, continuous batching, prefix caching. HPA on DCGM GPU metrics; P99 TTFT < 200 ms @ 1k+ concurrent.

  • vLLM
  • H100/H200
  • AWQ
  • HPA
  • DCGM

Read case study →

LangGraph SRE Automation Agents

Stateful multi-agent workflow: supervisor routes Prometheus alerts to retrieval (Qdrant RAG), execution (kubectl/Ansible RBAC), and validation agents. Human-in-the-loop on destructive ops. −40% overnight on-call.

  • LangGraph
  • Qdrant
  • RAG
  • HITL

Read case study →

MCP Tool Servers for Internal APIs

Model Context Protocol servers on Kubernetes exposing Kubernetes API, GitLab, Prometheus, and Jira as structured tool endpoints for LLM agents. Enables agent-driven triage & remediation.

  • MCP
  • Kubernetes
  • OAuth2

LiteLLM OpenAI-Compatible Gateway

Unified OpenAI-compatible API fronting multiple vLLM backends with load balancing, per-team rate limits, cost tracking, API keys, and fallback routing. Consumed by Open WebUI, n8n, LangChain, and Mattermost bots.

  • LiteLLM
  • Cost Tracking
  • Fallback

Air-Gapped Sovereign LLM Ops

Complete offline pipeline: HF snapshot download + AWQ quantize on bastion → data diode transfer → Harbor + internal MinIO registry → isolated K8s deploy. Delivered for sovereign/GOVINT clients.

  • Offline HF
  • Harbor
  • MinIO
  • Data Diode

Read case study →

End-to-End MLOps Pipeline

Kubeflow + Argo Workflows DAGs: ingestion → KubeRay fine-tuning → MMLU/HumanEval gates → MLflow registration → ArgoCD vLLM rollout. Canary via Argo Rollouts with Prometheus analysis gates. Weeks → hours.

  • Kubeflow
  • Argo
  • MLflow
  • ArgoCD
  • Seldon

Triton Embedding & Encoder Serving

NVIDIA Triton with ONNX Runtime + TensorRT (2–3× speedup), dynamic batching, ensembles for pre/post-processing. Sub-50 ms gRPC retrieval powering the SRE RAG knowledge base.

  • Triton
  • TensorRT
  • ONNX
  • gRPC

LLM Observability Stack

vLLM Prometheus metrics (TTFT, num_requests_running, KV-cache) + DCGM GPU metrics → 30+ Grafana dashboards. Alertmanager rules for TTFT degradation, KV-cache exhaustion, GPU hang. Distributed tracing across agents.

  • Prometheus
  • Grafana
  • DCGM
  • Loki

Certifications

  • Azure Administrator Associate (AZ-104) · Valid 02/2026 – 07/2027
  • Azure Solutions Architect Expert (AZ-305) · Valid 07/2025 – 07/2027
  • AWS Solutions Architect – Associate · ID PFR7HTQ25EFQQS9T
  • Certified Kubernetes Administrator (CKA) · Renewal in Progress
  • G42 Cloud Certified Engineer · ID G42C/SVD/CRT/0475
  • Red Hat Certified Engineer (RHCE) · ID 150-012-904
  • Red Hat Certified System Administrator (RHCSA)
  • Cloudera Hadoop Administrator
  • GitOps Certified Fundamentals
  • Cloud Architecture: Design Decisions

Education

  • Master's — Business Administration & Management (MBA) Jawaharlal Nehru Technological University · 10/2012 – 04/2015
  • Bachelor's — Computer Software Engineering Jawaharlal Nehru Technological University · 08/2008 – 06/2012

Languages

English (Full Professional) · Hindi (Full Professional) · Telugu (Native / Bilingual)

Let's talk

Open to Staff / Principal AI Platform & LLMOps roles, Senior SRE Engineer, Senior DevOps Engineer, and Senior Infrastructure Engineer — globally.