Backend Engineer - AML Framework Development (Search, Ads, and Recommendation Direction)
About The Team The mission of our AML team is to push the next-generation AI infrastructure and recommendation platform for the ads ranking, search ranking, live & e-Commerce ranking in our company. We also drive substantial impact on core businesses of the company. Responsibilities - Responsible for the iteration of the underlying architecture of the large model inference engine and end-to-end GPU performance optimization, through means such as operator fusion and compilation optimization, deeply optimizing GPU memory access, computing pipeline, and Stream asynchronous scheduling, eliminating inference computing bottlenecks, improving single-card inference throughput, and reducing inference latency. - Adapt to all series of GPU/NPU hardware architectures, refine the universality of the inference engine and hardware adaptability, and build a high-performance, low-loss underlying base for large model inference. - Lead the design, development, and optimization of distributed parallel solutions for large model inference scenarios, with a focus on implementing multi-dimensional parallel strategies such as tensor parallelism (TP), pipeline parallelism (PP), sequence parallelism, and MoE expert parallelism, to address core issues such as multi-card splitting and deployment of ultra-large models, high cross-card communication overhead, load imbalance, and low parallel efficiency. - Follow up on cutting-edge technologies such as global large model inference, GPU high-performance computing, distributed parallelism, and cache optimization, benchmark against mainstream inference frameworks such as vLLM and TensorRT-LLM, complete the implementation of solutions and technological innovation, continuously iterate and optimize the performance and cost advantages of the inference system, and build the core technological barriers of the team.