Key Skills and Role Responsibilities:
This role is for a strategic and technical leader to define, build, and operate the infrastructure orchestration systems that power our organization's cutting-edge Artificial Intelligence (AI) initiatives. The Senior Director will lead a team responsible for ensuring a robust, scalable, cost-efficient, and high-performance platform for all stages of the AI lifecycle, from experimentation and training to deployment and inference.
Define and execute the long-term vision and roadmap for the company’s AI infrastructure Network Services, aligning it with overall business and AI Services goals.
Lead, mentor, and grow a high-performing engineering and operations team focused on AI infrastructure and platform engineering.
Manage budget and resource allocation for AI infrastructure Network Services deliverables.
Act as a key liaison between AI infrastructure and other services owners and consumers, core engineering, Cloud infrastructure, and executive leadership.
Oversee the design, implementation, and maintenance of the core network orchestration platforms for large-scale AI model training (e.g., distributed training, hyperparameter tuning) and deployment (e.g., containerization, serverless functions, edge deployment).
Ensure reliability, security, and compliance of the AI infrastructure, meeting strict standards for data governance and model integrity.
Establish Service Level Objectives (SLOs) and Key Performance Indicators (KPIs) for the AI platform services and lead efforts for continuous optimization and performance tuning.
Select, evaluate, and integrate the core technologies required for the AI stack (e.g., Cloud Overlay/Under networking, Infiniband, Load Balancer, DNS, Core Networking, Kubernetes, Ray, GPU/accelerator management, distributed file systems).
Champion infrastructure-as-code (IaC) principles to manage and provision AI resources consistently and at scale.
Education: Bachelor's or Master’s degree in Computer Science, Engineering, or a related technical field.
Experience:
15+ years of progressive experience in software engineering, infrastructure, or platform operations.
5+ years of experience leading and managing technical teams, ideally in a Director or Sr. Director level or equivalent capacity.
Deep, hands-on experience designing and operating large-scale distributed systems and cloud-native network architectures.
Proven experience specifically with AI infrastructure orchestration (e.g., using Kubernetes) and managing accelerated compute resources (GPUs, TPUs, etc.).
15+ years of Cloud backend engineering, Cloud Design, Deployment, DevOps.
15+ years of experience leading system design and architecture leveraging Private Clouds and AWS and/or Azure/GCP.
10+ years of demonstrable experience building and operating infrastructure as code, Infra Automation, and comfort with various flavors of Linux.
15+ years of experience in building high-performance, highly available, and scalable distributed systems in the cloud.
15+ years of experience in building and managing high-performance, highly available, and scalable Hybrid Cloud environments.
Excellent cross-group collaboration, outstanding verbal and written communication skills.
Skills:
Expert-level knowledge of containerization and orchestration (Docker, Kubernetes).
Software Defined Cloud Networking.
Strong background in DevOps and MLOps principles and tooling.
Proficiency in at least one modern programming language (e.g., Python, Go).
Exceptional strategic planning, organizational, and written/verbal communication skills.
Prior experience managing infrastructure for training and inference of large language models (LLMs) or foundation models.
Experience in a regulated industry with strict compliance requirements.
AI Private Cloud - Building and operating.
A successful Senior Director - AI Infrastructure Orchestration will be measured by:
The time-to-market for AI infrastructure build, scale, and operation.
The resource utilization rate and cost efficiency of the AI compute infrastructure.
The reliability and uptime of the core AI platform services.
The talent retention and development within the AI Infrastructure team.
Please mention you found this job on YesRemoteJobs - it really helps us!
Coupang is one of the largest and fastest growing e-commerce platforms on the planet. Our vision is to create a world in which Customers ask "How did I ever live without Coupang?" We are looking for passionate builders to help us get there. Powered by world-class technology and operations, we have set out to transform the end-to-end Customer experience -- from revolutionizing last-mile delivery to rethinking how Customers search and discover on a truly mobile-first platform. We have been named one of the "50 Smartest Companies in the World" by MIT Technology Review and "30 Global Game Changers" by Forbes. Coupang is a global company with offices in Beijing, Los Angeles, Seattle, Seoul, Shanghai, and Silicon Valley.
View company