As a trusted global transformation partner, Welocalize accelerates the global business journey by enabling brands and companies to reach, engage, and grow international audiences. Welocalize delivers multilingual content transformation services in translation, localization, and adaptation for over 250 languages with a growing network of over 400,000 in-country linguistic resources. Driving innovation in language services, Welocalize delivers high-quality training data transformation solutions for NLP-enabled machine learning by blending technology and human intelligence to collect, annotate, and evaluate all content types. Our team works across locations in North America, Europe, and Asia serving our global clients in the markets that matter to them. www.welocalize.com
To perform this job successfully, an individual must be able to perform each essential duty satisfactorily. The requirements listed below are representative of the knowledge, skill, and/or ability required. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.
Job Reference: #LI-JC1
Role Summary:
We are seeking a Senior Scalability Engineer to design and optimize platforms capable of supporting the significant growth of AI/ML workloads. This role is focused on ensuring the scalability, reliability, and efficiency of AI/ML infrastructure while contributing to the development of robust, high-performance systems. The ideal candidate will collaborate with cross-functional teams to build resilient infrastructure and implement solutions that ensure seamless model deployment, monitoring, and lifecycle management at scale.
Key Responsibilities:
Platform Scalability: Design and implement scalable solutions for AI/ML infrastructure, enabling horizontal scaling, efficient resource utilization, and fault tolerance under high-demand scenarios.
Stability & Reliability: Apply best practices for platform stability, high availability, and disaster recovery, ensuring uninterrupted operations during peak workloads.
Observability & Monitoring: Build and maintain advanced observability frameworks, including monitoring, logging, and tracing solutions, leveraging tools like Datadog.
Automation & Efficiency: Develop automation pipelines for infrastructure provisioning, deployment, and operational workflows to minimize manual intervention and maximize efficiency.
Cross-Functional Collaboration: Work closely with data science, product, and engineering teams to align infrastructure capabilities with organizational goals and ensure seamless model deployment, testing, and lifecycle management.
Cost Optimization: Implement strategies to optimize cloud resource usage and manage platform costs effectively while maintaining performance and reliability.
Incident Response: Participate in incident response efforts, including post-mortems and root cause analyses, to improve platform resilience and prevent recurring issues.
Continuous Improvement: Stay current with industry trends in cloud infrastructure, distributed systems, and observability, applying innovative solutions to enhance platform scalability and performance.
Qualifications:
Educational Background: Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
Experience: 5+ years of experience in AI/ML platform engineering, infrastructure, or operations.
Proven track record of designing, scaling, and maintaining large, distributed systems with a focus on scalability, stability, and performance.
Technical Expertise:
Expertise in cloud infrastructure (AWS, GCP, Azure) and infrastructure-as-code tools (Terraform, CloudFormation, etc.).
Strong programming skills in Python and
Node.js, with experience building scalable, maintainable systems.
Deep understanding of observability practices, including distributed tracing, log aggregation, and real-time monitoring.
Scalability & Reliability:
Proven ability to design scalable architectures and implement solutions for automated failover and disaster recovery.
Experience in optimizing performance and resource utilization for high-demand environments.
Communication & Collaboration:
Strong communication skills, capable of articulating technical concepts to both technical and non-technical stakeholders.
Ability to collaborate effectively with cross-functional teams to deliver integrated solutions.
Problem-Solving Skills:
Excellent problem-solving skills and the ability to address complex technical challenges in a fast-paced environment.
Cost Optimization: Experience with cost management strategies for cloud-based platforms, with a focus on maintaining an optimal balance between performance and cost.