AI Infrastructure SRE Engineer - DGX

Techvilla Solutions

Posted On: Sep 15, 2025

Job Overview

Job Type

Full-time

Experience

6 - 10 Years

Salary

Depends on Experience

Work Arrangement

Remote

Travel Requirement

Required Skills

AI Infrastructure
SRE
Python
automation
HPC systems

Job Description

We are seeking an experienced AI Infrastructure SRE Engineer to manage and optimize high-performance computing systems, focusing on NVIDIA DGX and Cisco UCS clusters. You will ensure the availability, scalability, and reliability of our AI infrastructure while driving automation and continuous improvement.

Responsibilities

Manage and optimize high-performance compute environments (NVIDIA DGX, Cisco UCS).
Ensure reliability, scalability, and efficiency of infrastructure using fault-tolerant approaches.
Automate operational tasks with Python, Ansible, Terraform, Go, etc.
Build and maintain CI/CD pipelines using GitLab, GitHub Actions, Jenkins.
Implement metrics-driven processes to monitor and meet service quality targets.
Collaborate with development and operations teams to ensure seamless delivery of AI workloads.

Qualifications

5+ years in AI infrastructure or systems engineering.
Strong reliability engineering and performance tuning skills.
5+ years of experience with HPC systems (NVIDIA DGX, Cisco UCS).
Proficiency in Docker and containerized environments.
Strong automation skills with Python, Terraform, Ansible, or Go.

Preferred Skills

CI/CD experience (GitLab, GitHub Actions, Jenkins).
Familiarity with Kubernetes (OpenShift, Google Anthos).
Experience in the software development lifecycle (Golang, C/C++).

Job ID: TS250255

Posted By

Vivek

Information Technology Recruiter