Site Reliability Engineer (SRE) – ML Platform

PB Consulting

Austin, TX/Sunnyvale, CA

Posted On: Aug 11, 2025

Posted On: Aug 11, 2025

Job Overview

Job Type

Full-time

Experience

9 - 15 Years

Salary

$100,000 - $140,000 Per Year

Work Arrangement

On-Site

Travel Requirement

0%

Required Skills

ML Ops
Python
MongoDB
cloud
Linux
SRE

Job Description

Roles and Responsibilities

Manage and enhance CI/CD pipelines using GitHub Actions, Flux, and Kustomize for continuous deployment.
Design and implement cloud-native MLOps solutions on AWS.
Support containerization and deployment of ML/LLM models using Docker, Kubernetes, and VLLM.
Collaborate with cross-functional teams (Data Scientists, Engineers, Architects) to understand requirements and improve infrastructure reliability.
Build and maintain scalable tools and services to support ML model training, evaluation, and inference.
Document processes, deployment procedures, and best practices clearly and concisely.
Ensure the platform’s reliability, scalability, and performance for ML workloads.
Support the development and automation of MLOps pipelines and workflows using tools like Kubeflow, MLflow, Airflow, and Argo.
Implement and monitor security, compliance, and operational standards for ML services in production.

Required Qualifications

8+ years of experience in SRE, DevOps, or ML Ops roles with a strong focus on machine learning infrastructure.
Strong experience with Kubernetes, Docker, and cloud platforms (AWS preferred).
Proficient in Python, with experience in scripting and automation.
Experience working with MongoDB and Apache Solr.
Strong background in Linux system administration.
Familiarity with ML and LLM concepts, training, and deployment.
Hands-on experience with CI/CD pipelines, configuration management, and Infrastructure-as-Code.
Experience with cloud-native development, container orchestration, and cloud networking.
Strong understanding of API integration, cloud-based services, and microservices architecture.
Familiarity with open-source tools and frameworks for MLOps such as Kubeflow, MLflow, Airflow, DataRobot, or Argo.
Excellent communication skills and the ability to work effectively in a collaborative team environment.

Preferred Qualifications

Prior experience supporting large-scale ML/LLM systems in production environments.
Deep understanding of observability and monitoring tools (e.g., Prometheus, Grafana).
Experience with model versioning, reproducibility, and A/B testing in ML systems.
Knowledge of security best practices in ML/AI workloads on cloud.
Exposure to benchmarking, testing, and performance tuning for ML platforms.

Job ID: PC250220

Posted By

Naincy

IT Recruiter

Related Jobs

Site Reliability Engineer (SRE) – ML Platform
Austin, TX/Sunnyvale, CA

Full-time

COMPANY
Neshent Tech

experience
6 - 12 Years

Work Arrangement
On-Site

SALARY
Depends on Experience

SKILLS
- MLOps
- Kubernetes
- Python
- Linux

Posted On: Aug 26, 2025