Neshent Tech

New York, NY

Posted On: Aug 05, 2025

Job Type

Contract - Independent, Contract - W2, Contract - Corp-to-Corp

Experience

8 - 18 Years

Salary

$80 - $85 Per Hour

Work Arrangement

Hybrid

Travel Requirement

Required Skills

Architect scalable reliability solutions leveraging best-in-class practices in AI Ops and historical analytics.
Design and implement Historical Analytics Architecture for post-incident analysis, system health insights, and long-term trend discovery.
Lead the strategy and implementation of Data Fabric Architecture to unify data access and management across hybrid and multi-cloud environments.
Drive AI Ops initiatives, integrating machine learning and automation into incident detection, root cause analysis, and system optimization.
Develop and execute an AI Observability Strategy to enable intelligent insights, predictive alerts, and self-healing systems.
Champion OpenTelemetry (OTel) adoption and architecture, ensuring standardized and vendor-neutral observability instrumentation across the organization.
Provide architectural oversight on high-availability systems, reliability KPIs (e.g., SLOs, SLIs), and risk assessments.
Collaborate cross-functionally with development, operations, product, and executive teams to align reliability architecture with business goals.
Mentor engineering teams on best practices in reliability engineering and observability.

8+ years of experience in site reliability, systems architecture, or a related field.
Proven experience architecting Historical Analytics systems and data pipelines at scale.
Deep understanding of Data Fabric concepts, including data virtualization, governance, and real-time access layers.
Hands-on experience developing or leading AI Ops and AI Observability strategies within production environments.
Strong knowledge of OpenTelemetry (OTel) standards, best practices, and implementation patterns.
Proficiency in cloud-native architectures (AWS, Azure, GCP), containerization (Kubernetes, Docker), and infrastructure as code (Terraform, etc.).
Solid understanding of monitoring tools (e.g., Prometheus, Grafana, Splunk, Datadog, New Relic) and logging platforms.
Excellent communication and stakeholder management skills.

Master’s degree in Computer Science, Engineering, or related field.
Experience with MLOps, event-driven architecture, and real-time data processing.
Familiarity with security and compliance in large-scale distributed systems.
Contributions to open-source observability or reliability projects.

Job ID: NT250247

Posted By

Abhishek

Resource Manager