vLLM Beijing Meetup: Advancing Large-scale LLM Deployment

On August 2, 2025, Tencent’s Beijing Headquarters hosted a major event in the field of large model inference—the vLLM Beijing Meetup. A total of 260 developers, engineers, and industry experts gathered to witness the rapid growth of the vLLM ecosystem and its powerful capabilities in real-world applications.

The meetup was packed with valuable content. Experts from the core vLLM team, along with leading tech companies including Tencent, Huawei, Ant Group, ByteDance, Moonshot AI, and Xiaomi, shared cutting-edge practices and groundbreaking advancements. Their talks provided clear and insightful demonstrations of vLLM’s core strengths: efficiency, flexibility, and scalability.

Highlights from the Meetup

1. Overview of vLLM and Latest Developments

KaiChao You, a core maintainer of vLLM, gave a comprehensive overview of the project’s development journey, highlighting its core technologies and the latest advancements. He showcased vLLM’s breakthroughs in large-scale distributed inference, multimodal support, more refined scheduling strategies, and extensibility. He also outlined the future roadmap, focusing on extreme performance optimization, broader hardware support, and a richer ecosystem toolchain, kicking off the event with a deep technical dive.

2. vLLM’s PD Disaggregation: Practice and Exploration in Tencent’s Inference Framework

 

Chao Zhang, an expert from Tencent, shared a deeply customized PD (Prefill-Decode) disaggregation framework built on top of vLLM. By decoupling the compute-critical path, this solution significantly improves inference efficiency. It has already been deployed at scale across multiple Tencent business scenarios, providing a reusable, enterprise-grade inference framework for high-concurrency large model services.

3. vLLM Ascend: Ascend’s Practice in Large-Scale Distributed Inference and Reinforcement Learning

Xiyuan Wang and Jie Wen, experts from the vLLM Ascend project team, shared their in-depth work on adapting vLLM to the Ascend AI hardware platform. They first presented recent achievements of the vLLM Ascend project over the past few months—including major improvements in feature support, version releases, software quality, and inference performance.

They then demonstrated how to leverage the unique capabilities of the Ascend chips to optimize vLLM for large-scale distributed inference, using the DeepSeek large-scale EP scenario as a case study. Thanks to vLLM’s strong cross-platform adaptability, vLLM Ascend offers an efficient solution for deploying large models on Ascend hardware.

4. A 10x Performance Leap: Key Optimization Paths for DeepSeek Inference

Wengang Chen and Shoujian Zheng, engineers from Ant Group’s infrastructure team, delved into the key optimization strategies that boosted DeepSeek’s inference performance by 10x. Breaking down their approach. From GPU memory optimization strategies to latency reduction techniques, from single-node multi-model deployment practices to the application of the PD (Prefill-Decode) disaggregation architecture. The talk served as a highly practical performance tuning guide, offering valuable insights for the community.

5. AIBrix v0.4.0 Preview: A More Efficient and Cost-Effective Control Plane for Large-Scale Inference

Jiannan Tan, GPU Infra Engineer at ByteDance, shared insights based on ByteDance’s extensive online workload practices, offering a deep dive into how AIBrix addresses the core challenge of balancing efficiency and cost in large-scale model inference. He highlighted the tight integration between AIBrix and the high-performance vLLM inference engine, which not only improves inference efficiency but also significantly reduces resource costs—providing the industry with an innovative and practical approach to deploying large model services efficiently.

6. Kimi K2 Training and Inference Best Practices

Weiran He from Moonshot AI shared hands-on experience with the Kimi K2 model operating under strict SLO requirements, balancing high-concurrency online inference with reinforcement learning (RL) training demands. He focused on the coordinated architecture and key deployment strategies optimized for different hardware resources and workload constraints.

7. Native PD disaggregation in vLLM via Point-to-Point NCCL

Zhonghua Deng, AI Infra Engineer at Xiaomi, gave an in-depth presentation on a native PD (Prefill-Decode) disaggregation solution implemented using point-to-point NCCL communication. He thoroughly explained the design principles and key breakthroughs of this architecture within vLLM. Backed by real-world deployment cases, he detailed the significant performance improvements achieved, offering valuable insights for collaboration within the vLLM open-source ecosystem.

With the continuous strengthening of core functionalities, the ongoing expansion of the hardware ecosystem, and the increasing maturity of the control plane and deployment solutions, vLLM is becoming a solid foundation driving the practical adoption of large models and empowering countless industries. We’re looking forward to our next gathering to witness the even more dazzling growth of the vLLM ecosystem!

Read More