Production Engineer - Applied Machine Learning Engine (Singapore)
About The Team Backed by ByteDance’s world-leading core algorithm businesses in recommendation, advertising, and search, the Data-AML team is dedicated to building high-performance, highly available machine learning storage systems that support trillion-parameter models. We tackle the extreme challenges of globalized, ultra-large-scale clusters, while playing a key role in the development and evolution of machine learning infrastructure. In this team, you'll have the opportunity to sharpen your expertise in multiple subdirections, being model serving, model training, scheduling and orchestration. You are working in the team serving very centric machine learning services at ByteDance with the highest level of availability, as well as creating highly automated systems and pipelines. Responsibilities - Responsible for production operations management and stability assurance of AML training, inference, and storage systems, covering core pipelines such as scheduling and orchestration, Kubernetes (K8s)/GPU clusters, distributed training, online inference serving, and Parameter Server/NoSQL storage. - Build and maintain SLO/SLA frameworks, observability, alerting, on-call processes, incident diagnosis, self-healing mechanisms, disaster recovery, and post-incident review (postmortem) practices. - Drive engineering capabilities including CI/CD, canary/gradual deployments, automated rollback, system health inspections, pre-flight checks, capacity forecasting, and elastic scaling. - Lead resource governance and optimization across GPU, CPU, storage, and network infrastructure, including quota management, cost attribution, and performance tuning, to improve system availability, resource utilization, and engineering productivity.