AI Infrastructure SRE Engineer - DGX

Techvilla Solutions

Posted On: Sep 15, 2025

Posted On: Sep 15, 2025

Job Overview

Job Type

Full-time

Experience

6 - 10 Years

Salary

Depends on Experience

Work Arrangement

Remote

Travel Requirement

0%

Required Skills

  • AI Infrastructure
  • SRE
  • Python
  • automation
  • HPC systems
Job Description

We are seeking an experienced AI Infrastructure SRE Engineer to manage and optimize high-performance computing systems, focusing on NVIDIA DGX and Cisco UCS clusters. You will ensure the availability, scalability, and reliability of our AI infrastructure while driving automation and continuous improvement.

Responsibilities
  • Manage and optimize high-performance compute environments (NVIDIA DGX, Cisco UCS).
  • Ensure reliability, scalability, and efficiency of infrastructure using fault-tolerant approaches.
  • Automate operational tasks with Python, Ansible, Terraform, Go, etc.
  • Build and maintain CI/CD pipelines using GitLab, GitHub Actions, Jenkins.
  • Implement metrics-driven processes to monitor and meet service quality targets.
  • Collaborate with development and operations teams to ensure seamless delivery of AI workloads.

 

Qualifications
  • 5+ years in AI infrastructure or systems engineering.
  • Strong reliability engineering and performance tuning skills.
  • 5+ years of experience with HPC systems (NVIDIA DGX, Cisco UCS).
  • Proficiency in Docker and containerized environments.
  • Strong automation skills with Python, Terraform, Ansible, or Go.

 

Preferred Skills
  • CI/CD experience (GitLab, GitHub Actions, Jenkins).
  • Familiarity with Kubernetes (OpenShift, Google Anthos).
  • Experience in the software development lifecycle (Golang, C/C++).

Job ID: TS250255


Posted By

Vivek

Information Technology Recruiter