Senior Backend Engineer - AML Engine Orchestration

Singapore·R&D·engineering
Apply on ByteDance (TikTok) →

Team Introduction The mission of our AML team is to push next-generation machine learning algorithms and platforms for the recommendation system, ads ranking and search ranking in our company. We also drive substantial impact on core businesses of the company. Responsibilities: 1. Resource Efficiency Optimization in Distributed Orchestration and Scheduling: - Develop and extend distributed orchestration frameworks within the Kubernetes/Godel ecosystem. Select appropriate frameworks based on different business scenarios, and optimize cluster utilization and load balancing strategies according to the specific characteristics of each scenario; - Integrate and expand AutoScaling and automatic parallelization capabilities for various models and tasks. Employ load modeling and analytic methods for different models to automatically optimize resource requests, achieving large-scale improvements in resource usage efficiency and global optimality; - Responsible for preemption and re-scheduling mechanisms for services with different prioritties, and manage automatic resource multiplexing across different clusters and resource types; handle scheduling and load adaptation across multi-datacenter, multi-region, and multi-cloud environments. 2. Building Training System Architecture for Next-Generation Ultra-Large and Ultra-Deep Recommendation Models: - Develop a flexible, elastic and robust distributed training runtime focused on hyper-scaled embeddings and large-scale GPU training; - Design and optimize distributed computing APIs and runtimes geared towards future recommendation and ads model paradigms (e.g., reinforcement learning, fine-tuning and/or distillation); - Collaborate with platform teams to enhance the diagnosability and usability of distributed training systems. 3. Constructing Online Orchestration Architecture for Next-Generation Recommendation Systems: - Build a robust distributed model inference architecture for online learning scenarios involving hyper-scaled embeddings; - Optimize the usability of online recommendation and ads model architectures and MLops workflows.

More open roles at ByteDance (TikTok)