We are looking for a highly skilled Senior SRE / AWS Engineer with strong experience in cloud infrastructure, automation, and reliability engineering. The ideal candidate will have deep expertise in AWS, Kubernetes, Infrastructure as Code, and CI/CD, along with a solid background in networking and system reliability. You will be responsible for designing, operating, and improving scalable, highly available cloud platforms.
Roles and Responsibilities
- Design, build, and maintain highly available and scalable AWS infrastructure
- Implement and manage SRE best practices including monitoring, alerting, incident management, and reliability improvements
- Manage containerized workloads using Docker and Kubernetes
- Build and maintain CI/CD pipelines using tools like GitLab CI and Jenkins
- Automate infrastructure provisioning using Terraform and Ansible
- Monitor infrastructure performance and availability; proactively identify and resolve issues
- Collaborate with development teams to improve deployment reliability and system performance
- Implement security best practices and ensure compliance across cloud infrastructure
- Troubleshoot complex infrastructure, networking, and application issues
- Create and maintain technical documentation and operational runbooks
Required Skills & Qualifications
- 8–10 years of overall IT experience, with 7+ years in AWS DevOps / SRE.
- Strong hands-on experience with AWS services (EC2, VPC, IAM, S3, RDS, EKS, CloudWatch, etc.)
- Deep expertise in Kubernetes and container orchestration
- Strong experience with Infrastructure as Code (IaC) using Terraform
- Expertise in configuration management tools like Ansible
- Solid understanding of CI/CD pipelines using GitLab, Jenkins
- Hands-on experience with Docker
- Strong infrastructure and networking fundamentals (DNS, TCP/IP, Load Balancers, Firewalls)
- Proficiency in Python and/or Shell scripting
- Experience with monitoring and logging tools for cloud infrastructure
Preferred
- Experience with SRE practices such as SLIs, SLOs, SLAs, and error budgets
- Experience with multi-region or large-scale production environments
- Exposure to security and compliance in cloud environments