We are seeking an experienced AWS Cloud Ops / SRE to build and operate secure, scalable, and highly available cloud platforms. This role focuses on AWS infrastructure operations, EKS management, AMI lifecycle, Terraform-based IaC, patching, and end-to-end production visibility. Experience with Harness is a plus.
Primary Responsibilities
AWS Platform Operations
- Own AWS platform releases across environments (validation, regression, readiness).
- Manage core services: VPC, IAM, KMS, Route 53, networking, and guardrails.
Infrastructure as Code (Terraform)
- Design and manage scalable infrastructure using Terraform.
- Build reusable modules for compute, networking, storage, EKS, and security.
- Enforce best practices: versioning, immutability, peer review, CI/CD integration.
EKS (Kubernetes) Operations
- Deploy and operate production-grade EKS clusters and node groups.
- Define standards for security, RBAC, namespaces, and secrets.
- Optimize scaling, performance, and workload reliability.
AMI Lifecycle & Patch Management
- Manage AMI lifecycle: build, harden (CIS), scan, publish, and deprecate.
- Automate image pipelines (e.g., Packer).
- Lead OS patching via AWS SSM and maintain compliance dashboards.
Observability & Reliability
- Implement monitoring using CloudWatch, X-Ray, OpenTelemetry.
- Build dashboards for golden signals (latency, traffic, errors, saturation).
- Lead incident response, RCA, and reliability improvements.
CI/CD & Automation
- Build and maintain CI/CD pipelines for infra and app deployments.
- Integrate Terraform, AMIs, EKS, and patching workflows.
- Leverage Harness (preferred) for deployment strategies (canary, blue/green).
Documentation & Governance
- Produce design docs, runbooks, DR plans, and architecture diagrams.
- Conduct readiness reviews, capacity planning, and cost optimization.
Required Qualifications
- 10+ years in SRE, Cloud Ops, or DevOps with strong AWS expertise.
- Hands-on experience with:
- Compute: EC2, ASG, EKS/ECS, Lambda
- Networking: VPC, Route 53, ALB/NLB, SG/NACL
- Storage: S3, EBS, EFS
- Databases: RDS, Aurora, DynamoDB
- Strong experience with Terraform or CloudFormation.
- Expertise in AMI pipelines, image hardening, and OS-level patching.
- Proven ability to troubleshoot infrastructure and application issues end-to-end.