SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

With the rapid expansion in the scale of large
language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed
inference techniques such as Tensor Parallelism
pose a significant challenge to achieve scalability
and low latency. Therefore, we introduce a novel
optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that…Apple Machine Learning Research