We are looking for an experienced engineer with strong Linux and system-level expertise who can operate autonomously in complex production environments. You must be able to independently troubleshoot incidents, lead and support post-incident service recovery, and drive improvements to overall system stability, performance, and observability. We are looking for a hands-on Site Reliability Engineer (SRE) with a strong background in Linux infrastructure and third-party system operations.
This role focuses on managing and optimizing large-scale environments (5,000+ hosts) running technologies like Kafka, Redis, and Kubernetes.
The position does not involve application development but requires deep operational expertise and solid troubleshooting skills.