Production Engineer - Applied Machine Learning

San Jose·R&D·engineering
Apply on ByteDance (TikTok) →

The mission of our AML team is to push next-generation recommendation-based algorithms and platform for the company. We also drive substantial impact for core businesses of the company. Currently we are looking for Production Engineers to join our team to support and advance that mission Responsibilities: - System Stability & Production Management: Responsible for the production management and stability assurance of AML (Applied Machine Learning) training, inference, and storage systems. This covers core pipelines including scheduling and orchestration, K8s/GPU clusters, distributed training, online inference serving, and ParameterServer/NoSQL storage. - Reliability Engineering: Build and maintain mechanisms for SLO/SLA, observability, alerting, On-call processes, fault diagnosis, auto-healing, disaster recovery, and incident reviews (post-mortems). - Engineering Excellence: Drive engineering capabilities such as CI/CD, canary releases, auto-rollback, automated inspections, pre-flight checks, capacity forecasting, and elastic auto-scaling. - Resource & Cost Management: Oversee resource governance across GPU/CPU/storage/network, including quota management, cost attribution, and performance tuning, to improve system availability, resource utilization, and overall R&D efficiency.

More open roles at ByteDance (TikTok)