Remote Jobs - YesRemoteJobs

We are looking for an experienced L2 Engineer to operate and support high-performance AI infrastructure platforms, including NVIDIA GPU clusters, InfiniBand fabrics, and Kubernetes-based IaaS environments.

This role focuses on deep infrastructure expertise, ensuring performance, scalability, and reliability of the platform layer that powers AI workloads — without being responsible for the workloads themselves.

You will play a key role in bare metal lifecycle management, advanced InfiniBand troubleshooting, and platform stability, working closely with engineering teams to operate cutting-edge infrastructure at scale.

Key responsibilities:

Troubleshoot and maintain InfiniBand fabrics, including performance tuning, link issues, and topology validation.
Act as the escalation point for L1 for complex infrastructure and hardware issues.
Own and maintain accurate infrastructure modeling, IPAM, and source-of-truth data in NetBox.
Own InfiniBand fabric management and advanced troubleshooting, utilizing Verity for configuration, monitoring, and optimization of high-performance interconnects.
Diagnose and resolve issues across GPU servers, networking, storage, and Kubernetes platforms.
Perform deep hardware and system-level diagnostics (GPUs, PCIe, NICs, firmware, etc.).
Support Kubernetes platform stability (node health, networking, scheduling issues).
Contribute to automation of provisioning and operational workflows.
Lead incident response, root cause analysis (RCA), and post-incident improvements.
Collaborate with vendors and internal engineering teams on complex issues.
Support infrastructure upgrades, firmware management, and capacity expansion.

Required Skills & Experience:

3–6+ years of experience in infrastructure operations, datacenter engineering, or cloud platforms.
Strong Linux systems expertise.
Hands-on experience with bare metal provisioning systems and lifecycle management.
Strong experience with InfiniBand networking (troubleshooting, performance, fabric management using UFM).
Experience with IPAM/DCIM tools such as NetBox and Ethernet network configuration and validation leveraging Verity.
Solid understanding of datacenter networking, storage, and hardware architecture.
Working knowledge of Kubernetes in production environments.
Strong troubleshooting skills across hardware and distributed systems.

Preferred qualifications:

Experience with NVIDIA GPU platforms and accelerated computing infrastructure.
Familiarity with automation tools (Terraform, Ansible, etc.).
Exposure to OpenStack (optional).
Experience with observability stacks (Prometheus, Grafana, ELK).

Success in this role:

Rapid resolution of complex infrastructure and networking issues.
High reliability and performance of InfiniBand and GPU infrastructure.
Scalable and efficient bare metal provisioning processes.
Strong contribution to automation and operational excellence.
Trusted escalation point and technical leader within the team.

We offer:

Work with an established Silicon Valley leader in the cloud infrastructure industry;
Work with exceptionally passionate, talented and engaging colleagues, helping Fortune 500 and Global 2000 customers implement next-generation cloud technologies;
Be a part of cutting-edge, open-source innovation;
Thrive in the high-energy environment of a young company where openness, collaboration, risk-taking, and continuous growth are valued;
Professional development and training;
Attend conferences and working groups;
Company outings, happy hours, hackathons, and tech talks;
Receive a competitive compensation package with a strong benefits plan.

We are a Leader for Container Management in G2 (#2 after AWS)!

L2 Datacenter Support Engineer

Browse Similar Jobs

L2 Datacenter Support Engineer

Related Jobs

Customer Experience Specialist

Shift Supervisor

RCM Solutions Engineer

Enterprise Solutions Engineer - Netherlands

Construction Cost Estimator - Workplace Solutions (Will Relocate)

School Social Worker

Specialized Instruction Educator

Onboarding Specialist

Solutions Engineer

Sr Renewals Specialist