Applied Scientist - LLM Training System as a Service - Global Frontier Tech Recruitment Program - 2027 Start (PhD)

San Jose·R&D·data
Apply on ByteDance (TikTok) →

We are looking for talented individuals to join our team in 2027. As a graduate, you will get opportunities to pursue bold ideas, tackle complex challenges, and unlock limitless growth. Launch your career where inspiration is infinite at our Company. Successful candidates must be able to commit to an onboarding date by end of year 2027. Please state your availability and graduation date clearly in your resume. Team Introduction: AML-MLsys combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and Inference system/services around the world, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI. Topic Content: With the evolution from large language models (LLMs) to AI Agents, the training paradigm is undergoing a fundamental shift. Traditional distributed training frameworks like Megatron-LM are designed around relatively static parallelism strategies, whereas Agent training introduces more dynamic patterns, including external tool interactions, multi-step reasoning, and iterative self-improvement. In this context, tightly coupled system design can limit flexibility and efficiency. To better support these emerging workloads, we aim to build a robust architecture that cleanly separates “logical control” from “compute execution,” enabling more scalable and adaptable training workflows. Responsibilities: - Responsible for developing and optimizing LLM training & inference & Reinforcement Learning framework. - Working closely with model researchers to scale LLM training & Reinforcement Learning to the next level. - Responsible for GPU and CUDA Performance optimization to create an industry-leading high-performance LLM training and inference and RL engine.

More open roles at ByteDance (TikTok)