Who You’ll Work With
SRE's at Arista combine strong software and systems engineering with a passion for operating production systems at scale. As an SRE you’ll be part of the team responsible for our global service fleet.
What You’ll Do:
CloudVision is deployed on Kubernetes across global regions using Spinnaker for our CI/CD pipeline. Our tech stack runs on GKE, using HBase/Hadoop as main distributed database and storage layer, ElasticSearch for powering search data, ClickHouse for fast real time queries of flow data, our own Kafka-based distributed real time stream processing layer for analytics, and TensorFlow for ML analysis. Our monitoring system is built on top of Prometheus, Grafana, Loki, and other OSS tools.
As a Senior SRE, you’ll be responsible for our global CloudVision service fleet. This includes:
- Build, deploy safely and incrementally and operate critical production systems with focus on scalability, reliability, observability, performance and security.
- Monitor, support and enhance product deployment experience across services.
- Build automation to remove toil and efficiently operate production systems.
- Proactively monitor, respond to, and enhance alerts and set up automated alert handling
- Create and maintain the incident response runbooks.
- Build and deploy new systems with scalability, reliability, and observability as primary requirements
- Triage platform/infrastructural issues and help Arista software engineers in their triages. Engage with 3rd party vendor support.
- Deploy new systems in a staged manner
- Write postmortem documents and build solutions to avoid incidents from repeating.
- Plan and communicate maintenance windows on production systems.
- Work with Arista’s product development teams to identify infrastructural issues that are causing bottlenecks and limitations in their workflows. Design and implement solutions to resolve them.
- Survey and adopt best practices around infrastructure/platform to maintain secure, scalable and fault-tolerant systems.
- Implement solutions to scale the systems
- Implement fault-tolerance and performance to improve availability of the systems
- Study the design and sufficient implementation details of OSS systems for better triage and fix resolution.
- Bachelors in Computer Science or Engineering + 5 years’ experience, MS Computer Science or Engineering + 5 years’ experience, or equivalent work experience.
- Knowledge of one or more of Go, Python, bash shell scripting to be able to implement medium complexity automation workflows.
- Knowledge of Linux (or UNIX) from administration and debugging perspective
- Hands-on experience in operating software systems (infrastructure, complex applications etc) at scale
- Experience in server provisioning (esp from storage and networking perspective).
- Strong problem solving and software troubleshooting skills
- Experience with infrastructure-as-code.
- Desirable to have one/more of the following skills
- Experience managing databases - eg: PostgreSQL or equivalent RDBMS etc
- Experience with docker and virtualization technologies
- Experience managing monitoring stack - Prometheus, Grafana etc
- Experience managing Artifactory, docker registry etc
- Experience managing CI/CD systems like GitLab tools, Spinnaker etc
- Experience with infrastructure-as-code frameworks like Terraform
- Experience with container orchestration via Kubernetes
#LI-SZ1