About HighLevel:HighLevel is an AI-powered business operating system that gives agencies, entrepreneurs and SMBs the infrastructure to build, automate and scale. Today, HighLevel supports SMBs across 150+ countries, fueling community-driven growth rooted in real customer outcomes.
To date, businesses operating on HighLevel have generated over $7 billion in ecosystem value, demonstrating the impact of shared infrastructure at scale. By centralizing conversations, automation and intelligence into one system, we help businesses move faster, reduce complexity and execute efficiently.
Behind the platform, HighLevel powers more than 4 billion API hits and 2.5 billion message events daily. With 250 terabytes of distributed data, 250+ microservices and over 1 million domain names supported, our architecture is built for performance, resilience and long-term scalability.
Our PeopleWith over 2,000 team members across 10+ countries, HighLevel operates as a global, remote-first organization built for speed and ownership. We value initiative, clarity and execution, creating space for ambitious people to build systems that support millions of businesses worldwide. Here, innovation thrives, ideas are celebrated and people come first, no matter where they call home.
Our ImpactEvery month, HighLevel enables more than 1.5 billion messages, 200 million leads and 20 million conversations for the more than 1 million businesses we support. Behind those numbers are real people building independence, expanding opportunity and creating measurable impact. Weโre proud to be a part of that.
Learn more about us on our
YouTube Channel or
Blog Posts About the Role:
We are looking for a Site Reliability Engineer (SRE) to join our team and help ensure the availability, performance, and scalability of our critical systems. You will work closely with development and operations teams to automate processes, enhance system reliability, and improve observability.
Responsibilities:
- Develop and improve observability using monitoring, logging, tracing, and alerting tools (Prometheus, Grafana, ELK, OpenTelemetry, etc.).
- Optimize system performance, troubleshoot incidents, and conduct post-mortems/RCA to prevent future issues.
- Collaborate with developers to enhance application reliability, scalability, and performance.
- Drive cost optimization efforts in cloud environments.
- Experience with multiple databases Mongo, Redis, ES, Queue based etc
Requirements:
- Experience: 5+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.
- Cloud Expertise: Hands-on experience with GCP and AWS.
- Infrastructure as Code (IaC): Terraform, Helm, or equivalent tools.
- Containerization & Orchestration: Docker, Kubernetes (GKE).
- Observability: Experience with Prometheus, Grafana, ELK, OpenTelemetry, or similar monitoring/logging tools.
- Programming/Scripting: Proficiency in Python, Bash, or Shell scripting. Basic understanding of API parsing and JSON manipulation.
- CI/CD Pipelines: Hands-on experience with Jenkins, GitHub Actions, ArgoCD, or similar tools.
- Incident Management: Experience with on-call rotations, SLOs, SLIs, SLAs, Escalation Policies, and incident resolution.
- Databases: Experience in monitoring Mongo, Redis, ES, Queue based etc
EEO Statement:
The company is an Equal Opportunity Employer. As an employer subject to affirmative action regulations, we invite you to voluntarily provide the following demographic information. This information is used solely for compliance with government record-keeping, reporting, and other legal requirements. Providing this information is voluntary and refusal to do so will not affect your application status. This data will be kept separate from your application and will not be used in the hiring decision.
We encourage you to review our Privacy Policy before submitting your application.