We are seeking an experienced AI Infrastructure SRE Engineer to manage and optimize high-performance computing systems, focusing on NVIDIA DGX and Cisco UCS clusters. You will ensure the availability, scalability, and reliability of our AI infrastructure while driving automation and continuous improvement.
Responsibilities
- Manage and optimize high-performance compute environments (NVIDIA DGX, Cisco UCS).
- Ensure reliability, scalability, and efficiency of infrastructure using fault-tolerant approaches.
- Automate operational tasks with Python, Ansible, Terraform, Go, etc.
- Build and maintain CI/CD pipelines using GitLab, GitHub Actions, Jenkins.
- Implement metrics-driven processes to monitor and meet service quality targets.
- Collaborate with development and operations teams to ensure seamless delivery of AI workloads.
Qualifications
- 5+ years in AI infrastructure or systems engineering.
- Strong reliability engineering and performance tuning skills.
- 5+ years of experience with HPC systems (NVIDIA DGX, Cisco UCS).
- Proficiency in Docker and containerized environments.
- Strong automation skills with Python, Terraform, Ansible, or Go.
Preferred Skills
- CI/CD experience (GitLab, GitHub Actions, Jenkins).
- Familiarity with Kubernetes (OpenShift, Google Anthos).
- Experience in the software development lifecycle (Golang, C/C++).