We are looking for an experienced L2 Engineer to operate and support high-performance AI infrastructure platforms, including NVIDIA GPU clusters, InfiniBand fabrics, and Kubernetes-based IaaS environments.
This role focuses on deep infrastructure expertise, ensuring performance, scalability, and reliability of the platform layer that powers AI workloads — without being responsible for the workloads themselves.
You will play a key role in bare metal lifecycle management, advanced InfiniBand troubleshooting, and platform stability, working closely with engineering teams to operate cutting-edge infrastructure at scale.
Key responsibilities:
- Troubleshoot and maintain InfiniBand fabrics, including performance tuning, link issues, and topology validation.
- Act as the escalation point for L1 for complex infrastructure and hardware issues.
- Own and maintain accurate infrastructure modeling, IPAM, and source-of-truth data in NetBox.
- Own InfiniBand fabric management and advanced troubleshooting, utilizing Verity for configuration, monitoring, and optimization of high-performance interconnects.
- Diagnose and resolve issues across GPU servers, networking, storage, and Kubernetes platforms.
- Perform deep hardware and system-level diagnostics (GPUs, PCIe, NICs, firmware, etc.).
- Support Kubernetes platform stability (node health, networking, scheduling issues).
- Contribute to automation of provisioning and operational workflows.
- Lead incident response, root cause analysis (RCA), and post-incident improvements.
- Collaborate with vendors and internal engineering teams on complex issues.
- Support infrastructure upgrades, firmware management, and capacity expansion.
Required Skills & Experience:
- 3–6+ years of experience in infrastructure operations, datacenter engineering, or cloud platforms.
- Strong Linux systems expertise.
- Hands-on experience with bare metal provisioning systems and lifecycle management.
- Strong experience with InfiniBand networking (troubleshooting, performance, fabric management using UFM).
- Experience with IPAM/DCIM tools such as NetBox and Ethernet network configuration and validation leveraging Verity.
- Solid understanding of datacenter networking, storage, and hardware architecture.
- Working knowledge of Kubernetes in production environments.
- Strong troubleshooting skills across hardware and distributed systems.
Preferred qualifications:
- Experience with NVIDIA GPU platforms and accelerated computing infrastructure.
- Familiarity with automation tools (Terraform, Ansible, etc.).
- Exposure to OpenStack (optional).
- Experience with observability stacks (Prometheus, Grafana, ELK).
Success in this role:
- Rapid resolution of complex infrastructure and networking issues.
- High reliability and performance of InfiniBand and GPU infrastructure.
- Scalable and efficient bare metal provisioning processes.
- Strong contribution to automation and operational excellence.
- Trusted escalation point and technical leader within the team.
We offer:
- Work with an established Silicon Valley leader in the cloud infrastructure industry;
- Work with exceptionally passionate, talented and engaging colleagues, helping Fortune 500 and Global 2000 customers implement next-generation cloud technologies;
- Be a part of cutting-edge, open-source innovation;
- Thrive in the high-energy environment of a young company where openness, collaboration, risk-taking, and continuous growth are valued;
- Professional development and training;
- Attend conferences and working groups;
- Company outings, happy hours, hackathons, and tech talks;
- Receive a competitive compensation package with a strong benefits plan.
We are a Leader for Container Management in G2 (#2 after AWS)!