Site Reliability Engineer (SRE) – ML Platform

PB Consulting

Austin, TX/ Sunnyvale, CA

Posted On: Aug 11, 2025

Posted On: Aug 11, 2025

Job Overview

Job Type

Full-time

Experience

9 - 15 Years

Salary

$100,000 - $140,000 Per Year

Work Arrangement

On-Site

Travel Requirement

0%

Required Skills

  • ML Ops
  • Python
  • MongoDB
  • cloud
  • Linux
  • SRE
Job Description
Roles and Responsibilities
  • Manage and enhance CI/CD pipelines using GitHub Actions, Flux, and Kustomize for continuous deployment.
  • Design and implement cloud-native MLOps solutions on AWS.
  • Support containerization and deployment of ML/LLM models using Docker, Kubernetes, and VLLM.
  • Collaborate with cross-functional teams (Data Scientists, Engineers, Architects) to understand requirements and improve infrastructure reliability.
  • Build and maintain scalable tools and services to support ML model training, evaluation, and inference.
  • Document processes, deployment procedures, and best practices clearly and concisely.
  • Ensure the platform’s reliability, scalability, and performance for ML workloads.
  • Support the development and automation of MLOps pipelines and workflows using tools like Kubeflow, MLflow, Airflow, and Argo.
  • Implement and monitor security, compliance, and operational standards for ML services in production.

 

Required Qualifications
  • 8+ years of experience in SRE, DevOps, or ML Ops roles with a strong focus on machine learning infrastructure.
  • Strong experience with Kubernetes, Docker, and cloud platforms (AWS preferred).
  • Proficient in Python, with experience in scripting and automation.
  • Experience working with MongoDB and Apache Solr.
  • Strong background in Linux system administration.
  • Familiarity with ML and LLM concepts, training, and deployment.
  • Hands-on experience with CI/CD pipelines, configuration management, and Infrastructure-as-Code.
  • Experience with cloud-native development, container orchestration, and cloud networking.
  • Strong understanding of API integration, cloud-based services, and microservices architecture.
  • Familiarity with open-source tools and frameworks for MLOps such as Kubeflow, MLflow, Airflow, DataRobot, or Argo.
  • Excellent communication skills and the ability to work effectively in a collaborative team environment.

 

Preferred Qualifications
  • Prior experience supporting large-scale ML/LLM systems in production environments.
  • Deep understanding of observability and monitoring tools (e.g., Prometheus, Grafana).
  • Experience with model versioning, reproducibility, and A/B testing in ML systems.
  • Knowledge of security best practices in ML/AI workloads on cloud.
  • Exposure to benchmarking, testing, and performance tuning for ML platforms.

Job ID: PC250220


Posted By

Naincy

Recruiter


Related Jobs
  • Full-time

  • Company
  • COMPANY

    Neshent Tech

  • Company
  • experience

    6 - 12 Years

  • Travel Requirements
  • Work Arrangement

    On-Site

  • Wallet
  • SALARY

    Depends on Experience

  • Skills
  • SKILLS

    • MLOps
    • Kubernetes
    • Python
    • Linux

Posted On: Aug 26, 2025