Senior Observability/ Site Reliability Engineer Jobs in Phoenix, AZ

PB Consulting

Phoenix, AZ

Posted On: Nov 19, 2024

Job Type

Contract - W2, Contract - Independent, Contract - Corp-to-Corp

Experience

10 - 20 Years

Salary

$70 - $75 Per Hour

Work Arrangement

Hybrid

Travel Requirement

Required Skills

Lead the design, development, and implementation of observability solutions across cloud-native infrastructure and applications.
Ensure the reliability, scalability, and performance of critical services by leveraging observability tools and platforms.
Develop and maintain performance monitoring systems to proactively identify and resolve potential issues.
Troubleshoot and resolve complex problems across distributed systems using observability data.
Implement best practices for logging, metrics, and tracing to ensure full-stack visibility.
Work closely with cross-functional teams to drive reliability improvements and foster a culture of "SRE-driven" automation and proactive monitoring.
Automate and optimize observability pipelines, ensuring data quality, completeness, and performance.
Manage and mentor junior engineers, providing technical leadership and guidance in observability and site reliability practices.
Ensure systems are continuously monitored, with appropriate alerting thresholds and response strategies in place.
Collaborate with engineering teams to identify reliability risks and develop strategies to mitigate them.

Strong experience in Site Reliability Engineering (SRE) principles and practices.
Expertise in observability development, including instrumentation, monitoring, and alerting strategies.
Experience in leading technical teams and acting as a technical lead on major observability projects.
Deep understanding of performance monitoring tools and techniques, including latency, throughput, error rate, and system health monitoring.
Experience with Grafana, Prometheus, Cortex, Loki, Tempo, and Mimir for metrics, logging, tracing, and performance monitoring.
Proven problem-solving skills and ability to troubleshoot issues in complex distributed systems.
GCP (Google Cloud Platform) experience is required; previous experience with AKS/Azure is a plus but GCP will be the focus for this role.
Strong scripting and automation skills (e.g., Python, Go, Bash) for building observability tooling and improving infrastructure reliability.
Experience working in cloud-native environments (e.g., Kubernetes, microservices, containers).

Job ID: PC240462

Posted By

Naincy

Recruiter