Senior Systems Engineer – Server Provisioning & Deployment, DCS
The Data Center Service team supports the company's fast growth by building and operating hyperscale data centers. The team manages the end to end lifecycle of server fleet, providing cloud solutions and various infrastructure services ensuring that they are scalable and are reliable. Responsibilities: 1. Large-Scale Server OS Deployment - Responsible for operating system deployment and delivery across large-scale IDC environments. - Perform OS image installation, system initialization, and customized OS provisioning for servers. 2. Provisioning Platform Architecture Evolution - Design, develop, and continuously enhance the core architecture of hyperscale automated server provisioning platforms. - Drive platform scalability, reliability, and operational efficiency improvements. 3. Low-Level Services & Hardware Enablement - Develop and maintain core backend components of the provisioning system, including PXE services, OS image management, and related infrastructure. - Support hardware enablement and compatibility for new server platforms and components. 4. Complex Troubleshooting & AIOps Innovation - Investigate and resolve complex issues across the end-to-end server delivery lifecycle. - Explore and implement Large Language Models (LLMs) and AI Agent technologies for intelligent log analysis, root cause identification, automated troubleshooting, and self-healing systems. 5. Engineering Efficiency & Security - Build and optimize CI/CD pipelines for infrastructure changes. - Strengthen lifecycle security compliance, risk mitigation, and disaster recovery capabilities. 6. Hardware Validation & Delivery Assurance - Coordinate end-to-end server hardware validation activities to ensure delivery quality and compliance requirements are met. 7. Performance Testing & Optimization - Lead validation and testing of critical server components, including CPUs, memory, storage devices, and GPUs. - Conduct single-node and cluster-level GPU performance benchmarking, stress testing, and performance tuning. 8. Test Automation - Develop automated benchmarking and stress-testing frameworks using scripting languages to improve testing efficiency and coverage. 9. Quality Analytics & Continuous Improvement - Perform quality analysis on large-scale server shipments. - Drive quality control initiatives and manage closed-loop resolution of hardware and delivery issues.