Senior Observability/ Site Reliability Engineer

PB Consulting

Phoenix, AZ

Posted On: Nov 19, 2024

Posted On: Nov 19, 2024

Job Overview

Job Type

Contract - W2, Contract - Independent, Contract - Corp-to-Corp

Experience

10 - 20 Years

Salary

$70 - $75 Per Hour

Work Arrangement

Hybrid

Travel Requirement

0%

Required Skills

  • SRE
  • observability
  • Grafana
  • GCP
Job Description
Roles and Responsibilities
  • Lead the design, development, and implementation of observability solutions across cloud-native infrastructure and applications.
  • Ensure the reliability, scalability, and performance of critical services by leveraging observability tools and platforms.
  • Develop and maintain performance monitoring systems to proactively identify and resolve potential issues.
  • Troubleshoot and resolve complex problems across distributed systems using observability data.
  • Implement best practices for logging, metrics, and tracing to ensure full-stack visibility.
  • Work closely with cross-functional teams to drive reliability improvements and foster a culture of "SRE-driven" automation and proactive monitoring.
  • Automate and optimize observability pipelines, ensuring data quality, completeness, and performance.
  • Manage and mentor junior engineers, providing technical leadership and guidance in observability and site reliability practices.
  • Ensure systems are continuously monitored, with appropriate alerting thresholds and response strategies in place.
  • Collaborate with engineering teams to identify reliability risks and develop strategies to mitigate them.

 

Qualifications
  • Strong experience in Site Reliability Engineering (SRE) principles and practices.
  • Expertise in observability development, including instrumentation, monitoring, and alerting strategies.
  • Experience in leading technical teams and acting as a technical lead on major observability projects.
  • Deep understanding of performance monitoring tools and techniques, including latency, throughput, error rate, and system health monitoring.
  • Experience with Grafana, Prometheus, Cortex, Loki, Tempo, and Mimir for metrics, logging, tracing, and performance monitoring.
  • Proven problem-solving skills and ability to troubleshoot issues in complex distributed systems.
  • GCP (Google Cloud Platform) experience is required; previous experience with AKS/Azure is a plus but GCP will be the focus for this role.
  • Strong scripting and automation skills (e.g., Python, Go, Bash) for building observability tooling and improving infrastructure reliability.
  • Experience working in cloud-native environments (e.g., Kubernetes, microservices, containers).

Job ID: PC240462


Posted By

Naincy

Recruiter