Senior Reliability Architect

Neshent Tech

New York, NY

Posted On: Aug 05, 2025

Posted On: Aug 05, 2025

Job Overview

Job Type

Contract - Independent, Contract - W2, Contract - Corp-to-Corp

Experience

8 - 18 Years

Salary

$80 - $85 Per Hour

Work Arrangement

Hybrid

Travel Requirement

0%

Required Skills

  • Site reliability
  • AI Ops
  • OpenTelemetry
  • Data Fabric
Job Description
Roles and Responsibilities
  • Architect scalable reliability solutions leveraging best-in-class practices in AI Ops and historical analytics.
  • Design and implement Historical Analytics Architecture for post-incident analysis, system health insights, and long-term trend discovery.
  • Lead the strategy and implementation of Data Fabric Architecture to unify data access and management across hybrid and multi-cloud environments.
  • Drive AI Ops initiatives, integrating machine learning and automation into incident detection, root cause analysis, and system optimization.
  • Develop and execute an AI Observability Strategy to enable intelligent insights, predictive alerts, and self-healing systems.
  • Champion OpenTelemetry (OTel) adoption and architecture, ensuring standardized and vendor-neutral observability instrumentation across the organization.
  • Provide architectural oversight on high-availability systems, reliability KPIs (e.g., SLOs, SLIs), and risk assessments.
  • Collaborate cross-functionally with development, operations, product, and executive teams to align reliability architecture with business goals.
  • Mentor engineering teams on best practices in reliability engineering and observability.

 

Required Qualifications
  • 8+ years of experience in site reliability, systems architecture, or a related field.
  • Proven experience architecting Historical Analytics systems and data pipelines at scale.
  • Deep understanding of Data Fabric concepts, including data virtualization, governance, and real-time access layers.
  • Hands-on experience developing or leading AI Ops and AI Observability strategies within production environments.
  • Strong knowledge of OpenTelemetry (OTel) standards, best practices, and implementation patterns.
  • Proficiency in cloud-native architectures (AWS, Azure, GCP), containerization (Kubernetes, Docker), and infrastructure as code (Terraform, etc.).
  • Solid understanding of monitoring tools (e.g., Prometheus, Grafana, Splunk, Datadog, New Relic) and logging platforms.
  • Excellent communication and stakeholder management skills.

 

Preferred Qualifications
  • Master’s degree in Computer Science, Engineering, or related field.
  • Experience with MLOps, event-driven architecture, and real-time data processing.
  • Familiarity with security and compliance in large-scale distributed systems.
  • Contributions to open-source observability or reliability projects.

Job ID: NT250247


Posted By

Abhishek

Resource Manager