We are looking for an experienced Site Reliability Engineer (SRE) to ensure the reliability, availability, and performance of Azure-based services in a large-scale enterprise environment. This role involves managing cloud infrastructure, enhancing observability, implementing disaster recovery strategies, and driving reliability improvements through SLOs/SLIs and automation.
Key Responsibilities
- Define and manage SLOs, SLIs, and Error Budgets for Azure-hosted services, reporting SLA compliance to stakeholders.
- Lead architectural reviews, ensuring reliability targets (availability, RTO/RPO) are met from design to production.
- Implement chaos engineering practices and conduct disaster recovery drills across Azure regions.
- Serve as Incident Commander for P1/P2 incidents, owning the incident lifecycle and post-mortem actions.
- Design and operate enterprise observability using Azure Monitor, Log Analytics, Application Insights, and Grafana.
- Develop alerting frameworks and automate self-healing operations with Azure Automation and scripting (Python/PowerShell).
- Embed reliability gates in CI/CD pipelines and manage AKS cluster reliability (scaling, upgrades, security).
- Enforce infrastructure-as-code best practices with Terraform/Bicep for Azure Landing Zones.
Required Qualifications
- 7+ years in SRE, platform engineering, or cloud infrastructure in large-scale environments.
- 4+ years of hands-on Azure experience with AKS and cloud engineering.
- Expertise in Terraform (required), Bicep, and managing Azure Landing Zones.
- Proficiency in Python, Go, or PowerShell scripting.
- Experience with Azure observability tools (Monitor, Log Analytics, Application Insights).
- Proven track record of owning SLOs/SLIs and improving production reliability.