We are seeking an experienced Grafana & Prometheus Developer to design, implement, and maintain real-time monitoring systems that provide insights into system health and performance. The ideal candidate will have hands-on experience in creating and managing Grafana dashboards, building Prometheus queries, and using tools like PromQL and LogQL for time-series data analysis. Strong knowledge of Unix and Python scripting is essential, while experience with Thanos is a plus.
Primary Responsibilities
- Create and manage interactive Grafana dashboards to visualize KPIs and system health metrics in real-time.
- Write and optimize PromQL queries for efficient data collection and troubleshooting. Use Prometheus and LogQL to analyze, aggregate, and troubleshoot time-series data.
- Integrate Prometheus exporters with existing systems and applications to gather key metrics from various data sources.
- Design and configure alerting rules, thresholds, and escalation workflows to identify issues before they impact system performance.
- Analyze performance trends to identify bottlenecks and inefficiencies. Provide insights and reports to stakeholders for system optimization.
- Ensure the reliability and accuracy of metrics collection. Conduct testing and validation across services to maintain optimal monitoring performance.
- Work closely with DevOps and infrastructure teams to ensure seamless integration, scalability, and continuous improvement of monitoring systems.
- Troubleshoot system issues, optimize query performance, and implement strategies for data retention and archiving.
- Continuously enhance monitoring coverage to support the evolving needs of the system and business.
Required Skills & Experience
- Proficiency in building and managing Grafana dashboards.
- Strong hands-on experience with Prometheus and PromQL for metric collection and analysis.
- Familiarity with LogQL for querying logs and time-series data.
- Strong scripting skills in Unix/Linux and Python.
- Experience with system and infrastructure monitoring at scale.
- Knowledge of Prometheus exporters and their integration into monitoring systems.
Preferred Skills
- Experience with Thanos for scaling Prometheus and long-term storage.
- Knowledge of high-availability and fault-tolerant monitoring setups.