# Deepak Inugala — Senior SRE · MLOps / LLMOps · AI Infrastructure > Senior Site Reliability Engineer leading a 6-person SRE team at G42. > AI Platform Engineering · Agentic AI · GPU Inference Infrastructure · International Delivery. > 11+ years of infrastructure, Linux, cloud, Kubernetes, and AI platform > engineering — production LLM serving, MLOps pipelines, agentic AI, and > sovereign air-gapped LLM Ops. ## Contact - Name: Deepak Inugala - Location: Abu Dhabi, UAE - Email: deepak.1990@hotmail.com - Phone: +971 50 494 5921 - LinkedIn: https://www.linkedin.com/in/deepak-inugala - Portfolio: https://deepakinugala.github.io/portfolio/ - Open to: Staff / Principal AI Platform & LLMOps roles, Senior SRE Engineer, Senior DevOps Engineer, Senior Infrastructure Engineer — globally ## Current role **Group 42 (G42)** — Senior Site Reliability Engineer (03/2023 – Present). Leads a 6-person SRE team covering GPU platform, LLM/MLOps tooling, and international client delivery. Sole technical authority for GPU infrastructure at client sites across Kazakhstan, Angola, and Bahrain; supports Maldives and Ethiopia remotely. Delivers across GOVINT (classified air-gapped), OSINT, and Smart Nation programmes. ## Core expertise - **LLM Serving**: vLLM (PagedAttention, continuous batching, tensor parallelism, prefix caching, AWQ/GPTQ), NVIDIA Triton (TensorRT, ONNX, dynamic batching, ensembles), LiteLLM (OpenAI-compatible gateway). - **Agentic AI**: LangGraph (stateful supervisor topology), LangChain, AutoGen, MCP (Model Context Protocol) tool servers, n8n workflow automation, OpenClaw, NemoClaw. - **RAG**: Qdrant, Weaviate, Open WebUI, HuggingFace embeddings, sub-50 ms retrieval. - **MLOps**: MLflow, Kubeflow Pipelines, Argo Workflows, DVC, Feast, Seldon Core, Weights & Biases, Kyverno model admission. - **GPU / HPC**: NVIDIA H100 / H200 / A100, MIG, DCGM Exporter, KubeRay, CUDA/cuDNN, InfiniBand, NVLink, NCCL tuning. - **Kubernetes & Cloud**: AKS, EKS, RKE2, Rancher, Helm, ArgoCD, Argo Rollouts, Kustomize, Azure, AWS, G42 Cloud, Harbor. - **Air-Gapped LLM Ops**: HF snapshot_download offline, Harbor mirrors, data diode transfer, AWQ size reduction, internal MinIO model registry. - **Observability & Security**: Prometheus, Grafana (30+ dashboards), Loki, Fluent Bit, Falco, Trivy, OPA Gatekeeper, Kyverno, Keycloak OIDC, zero-trust NetworkPolicies. - **IaC & SRE**: Terraform, Ansible, GitLab CI/CD, GitOps, Python (LangGraph/LangChain, MLflow SDK, Kubeflow SDK), Bash; incident command, on-call leadership, SLO design. ## Highlighted outcomes - Deployed LLMs from 7B to 72B parameters; AWQ 4-bit reduced 72B from 144GB to 36GB VRAM. - Achieved P99 TTFT < 200ms at 1,000+ concurrent requests on vLLM / H100/H200. - Cut MLOps release cycles from weeks to hours via Kubeflow + Argo + MLflow + ArgoCD. - Reduced overnight SRE on-call interventions by 40% using LangGraph multi-agent automation with human-in-the-loop gates. - Cut operational toil by 50% via MCP tool servers exposing internal APIs to LLM agents. - Led FGB–NBAD post-merger IT integration at First Abu Dhabi Bank with zero major incidents; 35% reduction in repeat incidents via improved incident response. - Managed 3,000+ physical Linux servers at HCL (UIDAI programme, Bengaluru) with on-prem-to-cloud migration delivered in under 2 hours downtime and 0% data loss. ## Experience timeline - 03/2023 – Present: **Group 42 (G42)**, Abu Dhabi — Senior Site Reliability Engineer (leads 6-person SRE team; GPU platform, LLM/MLOps, international delivery). - 07/2019 – 03/2023: **Group 42 (G42)**, Abu Dhabi — Site Reliability Engineer (built the AI platform foundations — Kubernetes, GPU infrastructure, observability). - 01/2018 – 07/2019: **First Abu Dhabi Bank (FAB)**, Abu Dhabi — Cloud Engineer (FGB–NBAD post-merger IT integration, AWS architectures, Cloudera CDH administration). - 05/2017 – 01/2018: **HCL Technologies**, Abu Dhabi — Technical Specialist (Daman health insurance account, Linux + virtualization + incident response). - 08/2015 – 04/2017: **HCL Infosystems**, Bengaluru, India — Linux Engineer (UIDAI / Aadhaar programme, 3,000+ physical RHEL/CentOS servers, on-prem-to-cloud migration). - 10/2012 – 06/2013: **Polaris Consulting & Services**, Hyderabad, India — Technical Support Engineer (Videocon d2h account, Linux and network operations). ## Education - MBA, Jawaharlal Nehru Technological University (10/2012 – 04/2015). - B.Tech, Jawaharlal Nehru Technological University (08/2008 – 06/2012). ## Certifications - Microsoft Azure Administrator Associate (AZ-104) — valid through 07/2027. - Microsoft Azure Solutions Architect Expert (AZ-305) — valid through 07/2027. - AWS Solutions Architect – Associate. - Certified Kubernetes Administrator (CKA) — renewal in progress. - G42 Cloud Certified Engineer. - Red Hat Certified Engineer (RHCE). - Red Hat Certified System Administrator (RHCSA). - Cloudera Certified Administrator for Apache Hadoop. - GitOps Certified Fundamentals (Codefresh / Argo). - Cloud Architecture: Design Decisions (LinkedIn Learning). ## Resumes (role-tailored PDFs) - LLMOps / MLOps Engineer: /resumes/LLMOps_MLOps_Engineer.pdf - Senior SRE: /resumes/Senior_SRE_Engineer.pdf - Team Lead SRE: /resumes/TeamLead_SRE_Engineer.pdf - Senior DevOps: /resumes/Senior_DevOps_Engineer.pdf - Senior DevSecOps: /resumes/Senior_DevSecOps_Engineer.pdf - Senior Cloud: /resumes/Senior_Cloud_Engineer.pdf - Senior Kubernetes Administrator: /resumes/Senior_Kubernetes_Administrator.pdf - Senior HPC Engineer: /resumes/Senior_HPC_Engineer.pdf ## Languages English & Hindi (Full Professional) · Telugu (Native).