We are seeking a highly skilled Kafka Operations Administrator to manage and maintain production-grade Apache Kafka clusters. The ideal candidate will have deep experience in Kafka operations, monitoring, automation, and production support within enterprise environments. This role includes 24x7 on-call responsibilities, incident management, performance tuning, and ensuring high availability and disaster recovery.
Roles and Responsibilities
- Deploy, configure, and manage Kafka clusters and related services to meet SLA requirements
- Participate in 24x7 on-call rotation, responding to incidents, alerts, and escalations
- Triage, diagnose, and remediate production incidents; coordinate with stakeholders, developers, and infrastructure teams
- Implement automation for provisioning, scaling, backups, and disaster recovery
- Maintain monitoring, alerting thresholds, dashboards, and Kafka ecosystem health
- Harden Kafka deployments by configuring TLS, ACLs, RBAC, encryption, and vulnerability remediation
- Perform routine maintenance including Kafka ecosystem upgrades (controllers, brokers, connect, and schema registry) and rolling restarts
- Create and maintain runbooks, automation scripts, and post-incident reports
- Optimize performance and resource utilization through benchmarking and tuning
- Support Kafka Connect and Schema Registry services; troubleshoot connector issues
- Contribute to CI/CD pipeline improvements for infrastructure and deployment automation
Required Technical / Functional Skills
- Production-grade Apache Kafka operations experience, including managing, maintaining, and upgrading Kafka clusters
- Strong experience with high availability, disaster recovery, failover, and overall reliability
- Proficient in monitoring and observability tools, including:
- Grafana (dashboards)
- Prometheus
- Splunk
- JMX metrics
- Automation and orchestration expertise using:
- Terraform
- Ansible
- Helm
- Kubernetes (EKS/AKS/GKE)
- Strong Linux system administration, including troubleshooting and scripting for infrastructure management
- Production support experience following ITIL processes
- Experience in 24x7 on-call rotations, incident documentation, and postmortems
- Experience with JVM tuning, GC analysis, and network/disk I/O diagnostics
- Strong understanding of TCP/IP, routing, switching, and firewall configurations relevant to Kafka operations
Required Skills
- Deep Kafka performance tuning and capacity planning experience
- Knowledge of message delivery semantics and guarantees (at-least-once, exactly-once)
- Cloud-native security/compliance experience (IAM, VPC, KMS, Security Groups)
- Relevant certifications: Confluent Certified Administrator, AWS/Azure/GCP
- Experience with Apache Kafka in KRaft mode
- Containerization and orchestration experience (Docker, Kubernetes)
- CI/CD pipeline and Git-based workflows
- Experience building custom Kafka Connect libraries and knowledge of serialization formats (Avro, JSON)
- Strong understanding of networking across on-prem and cloud environments
- Best practices for topic management and streaming security (TLS, ACLs, RBAC, encryption)
- Kafka ecosystem tooling experience (Kafka Connect, Schema Registry)
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or related field (preferred)
- 7+ years of experience in Kafka operations or platform engineering
- Proven experience in production support and infrastructure automation