Oversee the migration from Dynatrace to Sentry and Grafana, ensuring seamless continuity of monitoring, tracing, and alerting capabilities across applications.
Design, build, and maintain Grafana dashboards to visualize metrics, KPIs, and application health, while configuring alerts to proactively track performance and issues.
Implement and manage Sentry for comprehensive error tracking, performance monitoring, and diagnostics across multiple applications and services.
Set up and configure Prometheus or Loki for efficient metrics and log aggregation, providing real-time insights into application performance and behavior.
Collaborate with development, DevOps, and infrastructure teams to integrate monitoring tools into the CI/CD pipeline for continuous monitoring and automated testing.
Define and implement SLA-based alerts and notifications to track and measure application performance, reliability, and user experience.
Conduct in-depth root cause analysis (RCA) for critical incidents, leveraging distributed tracing and monitoring data to identify and resolve issues.
Automate monitoring and alerting tasks using Python, Bash, or similar scripting languages to streamline operations and improve efficiency.
Ensure secure and compliant access to monitoring tools by configuring roles and permissions, and ensuring adherence to security best practices.
Document the migration process, create knowledge base articles, and provide training to internal teams to ensure smooth transitions and long-term efficiency.
Required Skills & Experience
Proven experience in application monitoring, tracing, and observability tools like Dynatrace, Grafana, Sentry, Prometheus, and Loki.
Strong understanding of Application Performance Management (APM) concepts, distributed tracing, and error tracking practices.
Hands-on experience building custom Grafana dashboards and configuring alerting for application health monitoring.
Expertise in setting up and integrating Sentry for error tracking and performance monitoring across multiple applications.
Familiarity with Prometheus for metrics aggregation and Loki for log aggregation to provide full-stack observability.
Experience integrating monitoring tools into CI/CD pipelines, aligning with DevOps best practices.
Proficiency in Python, Bash, or similar scripting languages to automate monitoring tasks, such as alerting and incident response.
Solid understanding of incident management, performing root cause analysis (RCA), and SLA tracking to ensure high availability and minimal downtime.
Experience with API integration and data transformation between observability platforms to streamline monitoring workflows.
Knowledge of security and compliance principles, particularly regarding access management and data governance for monitoring tools.
Preferred Qualifications
Prior experience migrating from Dynatrace or similar observability platforms to other monitoring and telemetry tools.
Familiarity with microservices and cloud-native monitoring solutions, including Kubernetes and containerized environments.
Experience working in an Agile environment with cross-functional teams to deliver iterative improvements and features.
Certifications in Grafana, Prometheus, or other relevant observability platforms are a plus.