Senior Software Engineer, Machine Learning Systems (Multiple Positions)
About ByteDance Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content. Why Join Us Inspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and enrich life - a mission we work towards every day. As ByteDancers, we strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our Company, and our users. When we create and grow together, the possibilities are limitless. Join us. About the Team Our team plays a crucial role in ensuring the company’s success. We seek people who are willing to learn and put in the effort to solve problems. Our challenges are not your regular day-to-day problems - you’ll be part of a team that’s developing new solutions to new challenges. It’s working fast, at scale, and we’re making a difference. We are looking for talents to join us on this exciting journey! Responsibilities Design, develop, and optimize machine learning systems, focusing on heterogeneous computing architectures, resources management, and system monitoring. Deploy and maintain scalable ML infrastructure, including distributed task scheduling and large-scale training pipelines. Facilitate cross-layer optimization across hardware, systems, and AI algorithms to improve performance and efficiency of ML workloads. Implement and enhance training framework features, supporting general-purpose functionality and model-specific optimizations, including large language models and diffusion models. Improve reliability, efficiency, and throughput of massive-scale distributed training jobs across diverse computing environments. Mentor interns and junior engineers.