Research Engineer – Multimodal Training Infrastructure (Seed Infra)

San Jose·R&D·engineering
Apply on ByteDance (TikTok) →

About the team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models. Responsibilities - Conduct research and development on large-scale infrastructure to enable efficient training of foundation models, multimodal LLMs, and image/video generation models - Design and optimize distributed training strategies for multimodal LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters - Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads - Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements - Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods - Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world infrastructure solutions

More open roles at ByteDance (TikTok)