PyTorch/XLA 2.7 Release Usability, vLLM boosts, JAX bridge, GPU Build

PyTorch/XLA is a Python package that uses the XLA deep learning compiler to enable PyTorch deep learning workloads on various hardware backends, including Google Cloud TPUs, GPUs, and AWS Inferentia/Trainium. The PyTorch/XLA team has been working hard to bring new capabilities to researchers and developers using TPUs/GPUs and XLA backends. In this update, we’ve made many additions and improvements to the framework. Some of the notable highlights are: 

  • Usability improvements
  • Experimental bridge with JAX operations
  • A new Pallas-based kernel for ragged paged attention, enabling further optimizations on vLLM TPU

These features, bug fixes, and other details are outlined in the release notes. Let’s now delve into the highlights in detail!

Usability Improvements

Developers are now able to better target areas of code that they want to measure the performance of by marking the exact regions of code that they would like to profile.  An example of this is: 

server = xp.start_server(8001)
xp.start_trace(profiling_dir)
# Run some computation
...
xp.stop_trace()

PyTorch/XLA 2.7 also introduces an API to query the number of cached compilation graphs, aiding in the detection of unexpected compilations during production inference or training. An additional enhancement optimizes host-to-device transfers by avoiding unnecessary tensor copying, thus improving performance.

JAX Bridge in PyTorch/XLA (Prototype)

We’re experimenting with integrating JAX operations directly into PyTorch/XLA graphs as a way to enable a bridge between the frameworks — this method enables users to call JAX functions inside PyTorch models running with XLA.

As a use case, we’ve explored calling `jax.experimental.shard_alike` from PyTorch/XLA. This function improves sharding propagation in certain code patterns like scan, and we’ve integrated it as part of the GSPMD (Generalized SPMD) workflow in the compiler. This tool is utilized in torchprime to enable support for the SplashAttention Pallas kernel.

 import torch_xla.core.xla_builder as xb
# Native function written in JAX
def jax_function(...):
  import jax
  ...
  return ...
res = xb.call_jax(...) </pre?

Ragged Paged Attention Pallas Kernel

Efficient attention for variable-length sequences is critical for scaling large language models, and the new Pallas kernel for ragged paged attention brings a major performance and usability upgrade to vLLM TPU.

This update introduces a custom kernel implemented using the Pallas custom kernel language and is lowered to Mosaic for TPU. It supports ragged (variable-length) input sequences and implements a paged attention pattern. Below are the key features:

  • Supports mixed prefill and decode operations to increase inference throughput (e.g., up to a 5x speedup compared to the padded Multi-Queries Paged Attention implementation for llama-3-8b).
  • No GMM (Grouped Matmul) Metadata required! We calculate the metadata on the fly in the kernel. This can increase performance by 10%.
  • Provides a CUDA Flash Attention equivalent with Paged Attention support and a similar interface.

We are continuously collaborating with the vLLM community to further optimize performance, expand kernel coverage, and streamline TPU inference at scale.

GPU Build is Back

The GPU build was paused in the PyTorch/XLA 2.6 release, but we’ve now re-enabled GPU Continuous Integration (CI) in version 2.7. The current release includes GPU builds with CUDA 12.6, marking an important step forward for GPU support.

While CUDA support is still considered experimental in this release, we plan to expand coverage to additional CUDA versions in upcoming releases.

Get Involved

Please check out the latest changes on GitHub. As always, we’re actively seeking feedback and contributions from the community.

Read More

MetaShuffling: Accelerating Llama 4 MoE Inference

MetaShuffling: Accelerating Llama 4 MoE Inference

Mixture-of-Experts (MoE) is a popular model architecture for large language models (LLMs). Although it reduces computation in training and inference by activating fewer parameters per token, it imposes additional challenges in achieving optimal computation efficiency with high memory and communication pressure, as well as the complexity to handle the dynamism and sparsity nature of the model. Here we introduce a new MoE inference solution, MetaShuffling, which enables us to efficiently deploy Llama 4 models for production inference.

Llama 4 Architecture

Llama 4 Scout and Maverick models are officially released.  Scout / Maverick has a shared expert and 16 / 128 routed experts with dropless token-choice routing and Top-1 selection for each MoE layer. Besides, both shared and routed experts use SwiGLU activation with 3 linear layers. Please refer to The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation for more information about the model.

Key Concept

There are multiple common solutions to handle dynamism and sparsity problems introduced in MoE layers. Here we demonstrate different solutions of token-choice routing with Top-1 selection.

The above diagram shows the padding design.  Each box represents a token, and the yellow / green color represents valid tokens with different routed experts, and the grey color represents padded tokens. Each row of boxes in the second step represents different routed experts. Ti represents the i-th token from the current rank of the data parallel group.

  • Padding: In this approach, we pad activation to maximum sequence length for each expert and run a single batched matrix multiplication (BMM). It incurs:
    • Increased memory on holding paddings.
    • Increased latency on processing paddings. Note that it is possible to avoid processing padding through jagged kernels, but jagged kernels may also incur high overhead when the number of experts is large. 
  • Slicing: In this approach, we slice activation to exact sequence length for each expert and run multiple matrix multiplications (MM). It avoids the problems in padding, but it incurs:
    • Reduced kernel efficiency, caused by repeated kernel launches on small shapes.
    • Reduced device utilization, caused by frequent host and device synchronizations on dynamic shapes, plus extra kernel launch overheads, as it is incompatible with graph capturing mechanisms (e.g. CUDAGraph and torch.compile).

  • Concatenation: In this approach, we further concatenate the activations after slicing and run a single grouped matrix multiplication (GMM). It avoids the kernel efficiency problem in slicing, but still incurs 
    • Reduced device utilization, as it still requires host and device synchronization, and still incompatible with graph capturing mechanisms.

To further improve the solution, we propose a shuffling-based mechanism:

  • Shuffling: In this approach, we directly sort the tokens so that routed tokens are ordered by routed expert’s ID. By doing so, no padding or splitting is introduced, and tokens assigned to the same experts are stored together and can be processed together inside GroupedGEMM. It provides a dense model interface and avoids all the problems mentioned above.
    • No paddings as the activation remains a dense tensor.
    • No host and device synchronization, as the activation remains a static-shaped tensor.

We built an end-to-end MoE inference solution, MetaShuffling, based on this design.

Runtime Design

No Parallelism for Single-GPU Inference

Above is the overall runtime design for single-GPU inference without model parallelism. Note that, to optimize performance, the first and third linear layers of SwiGLU activation are merged together as GroupedGEMM13 / GEMM13.

  • Solid dark blue/orange boxes represent tensor core heavy kernels on routed/shared expert streams.
  • Solid light blue/orange boxes represent CUDA core or memory traffic-heavy kernels on routed/shared expert streams.
  • Red arrows represent data flows of activation tensors.
  • Green arrows represent data flows of metadata tensors.

All metadata tensors are placed on the device. There is no blocking device to host synchronization. All kernels are launched back to back without bubbles. The diagram shows data flows only, not a demonstration of actual profiling traces.

Kernel Interfaces And Data Flows

  • RoutingScores: A function or fused kernel that handles routing scores calculation.
    • Input: input_tokens: [T, D] (T: number of tokens; D: feature dimension); router_weights: [D, E] (E: number of experts); router_biases: [E];
    • Output: routing_scores: [T, E]; scaling_factors: [T, E];
  •  IndexShuffling: A fused kernel that handles shuffling and sorting of indices. We will introduce an optimized implementation in the Kernel Design section.
    • Input: routing_scores: [T, E]; K (threshold for top-k routing);
    • Output: routed_token_indices: [K * T]; routed_expert_indices: [K * T]; routed_token_counts_per_expert: [E];
  • GatherMul: A fused kernel that shuffles tokens based on sorted indices and scales them.
    • Input: input_tokens: [T, D]; routed_token_indices: [K * T]; routed_expert_indices: [K * T]; scaling_factors: [T, E];
    • Output: scaled_routed_tokens: [K * T, D]
  • GroupedGEMM: An optimized GroupedGEMM kernel that handles on-device shape information about batches along M dimension without restrictions. We will introduce an optimized implementation in the Kernel Design section.
    • Input: tokens: [K * T, D]; weights: [E, D, HD] (HD: hidden dimension); routed_token_counts_per_expert: [E];
    • Output: tokens: [K * T, HD]
  • GEMM: An optimized GEMM kernel. Similar interface to dense model.
  • NonLinearity: A fused kernel that handles non-linearity. Similar interface to dense model.
  • ScatterAdd: An optimized kernel that reverses token shuffling based on sorted indices and directly performs scatter add to shared expert output without materializing an unshuffled tensor.
    • Input: shared_output_tokens: [T, D]; routed_output_tokens: [K * T, D]; routed_token_indices: [K * T]; 
    • Output: combined_output_tokens: [T, D]

Note that if quantization is applied, then activation quantization kernels are fused into the preceding non-GEMM kernels, which means fusing into GatherMul for GroupedGEMM13 and fusing into NonLinearity for GroupedGEMM2, etc.

Note if using a large K * T, the GatherMul and ScatterAdd operation can be further fused into following/proceeding GroupedGEMM operations, which should be complete as global memory to shared memory/registers or shared memory to global memory step in prologue/epilogue, however, it adds additional challenge on overlapping with tensor core execution at the kernel design level. Besides, fusing ScatterAdd requires shared experts to complete before routed experts, which might not be a good design choice if these kernels can be used to hide AlltoAll latency.

Tensor Parallelism for Single-Host Inference

Above is the overall runtime design for single-host inference with tensor parallelism (TP). Compared to single-GPU inference, the additional step is:

  • Solid light mint boxes represent network traffic-heavy communication kernels.

Still, all metadata tensors are placed on the device, there is no device to host synchronization. All kernels are launched back to back without bubbles. The diagram shows data flows only, not a demonstration of actual profiling traces.

Workload Sharding and Additional Kernels

No additional custom kernel is introduced compared to the single GPU inference use case. For GEMM, GroupedGEMM, and non-linearity kernels, the activation and weights are both shared to 1/TP along different dimensions, and the computation/memory overhead is also shared to 1/TP.

The final step should be AllReduce if only tensor parallelism is applied. Alternatively, ReduceScatter if tensor parallelism is applied with sequence parallelism.

Expert Parallelism for Multi-Host Inference

To enable expert parallelism (EP), we swap data parallelism dimension out of the routed expert as the expert parallelism dimension inside the routed expert. Note that tensor parallelism can be further swapped with expert parallelism for better GEMM efficiency with increased routing imbalance risk, but we won’t cover this design in this blog.

If expert parallelism is enabled with token-choice routing, then we must decide between using dense tensors or using static shapes, because the number of routed tokens to different expert groups is dynamic. 

  • We use dense tensors and dynamic shapes when using eager mode is preferred to avoid waste on network traffic and memory space caused by running unpadded AlltoAll.
  • We use sparse tensors and static shapes when using graph mode is preferred to avoid generating GPU bubbles caused by CPU launch overheads and device-to-host synchronization through running with CUDAGraph.

Note that wasted network traffic with padded activations can also be avoided using a custom AlltoAll implementation, but we won’t cover any topics on custom communication or communication and computation fusion kernels in this blog.

Above is the overall runtime design for multi-host inference with tensor parallelism and expert parallelism. Compared to single-host inference with tensor parallelism.

  • Solid red arrows represent intra-node communication.
  • Solid purple arrows represent inter-node communication.

Kernel Interfaces And Data Flows

For added expert parallelism-based communication, we use 3-shot All2All communication to exchange shapes and tokens:

  • 1st A2A: Exchange on-device metadata tensor about number of tokens routed to each expert, which is `routed_token_counts_per_expert: [E]`, the output generated from IndexShuffling kernel.
  • 2nd A2A: Exchange tokens from data parallelism based to expert parallelism based, dispatching to different EP ranks based on routing.  
  • 3rd A2A: Exchange tokens from expert parallelism based to data parallelism based, combining from different EP ranks based on routing.

Besides, we added 2 additional shuffling kernels and 1 special scatter kernel:

  • CombineShuffling (Dense or Padded): Reshuffles received tokens from rank first order to expert first order. Following T* indicates the number of total tokens received from all peers, which can be further interpreted as a jagged dimension with shape information from routed_token_counts_per_rank_per_expert tensor.
    • Input: received_tokens: [T*, D] (first ordered by dp ranks, then ordered by expert indices); routed_token_counts_per_rank_per_expert: [EP, E // EP];
    • Output: reshuffled_tokens: [T*, D] (first ordered by expert indices, then ordered by dp ranks); routed_token_counts_per_expert: [E // EP];
  • SplitShuffling (Dense or Padded): Reverse process of CombineShuffling. Reshuffles to-send tokens from expert first order to rank first order.
    • Input: reshuffuled_tokens: [T*, D] (first ordered by expert indices, then ordered by dp ranks); routed_token_counts_per_rank_per_expert: [EP, E // EP];
    • Output: to_send_tokens: [T*, D] (first ordered by dp ranks, then ordered by expert indices);
  • ScatterAdd (Padded): Scatter adds validate tokens from padded tensors.
    • Input: shared_output_tokens: [T, D]; received_padded_routed_output_tokens: [EP, K*T, D];  routed_token_indices: [K * T];  routed_token_counts_per_expert: [E]; 
    • Output: combined_output_tokens: [T, D]

We will provide a better demonstration of the above kernels in detail in the `Padded Communication with Static Shapes In Graph Mode` section.

Unpadded Communication with Dynamic Shapes In Eager Mode

High-level diagram on runtime behavior. The actual runtime of different components might vary based on software and hardware.

Minimize Usage of Dynamic Shapes

As the routing is dynamic per MoE layer, the minimal amount of device/host synchronization required is once per layer. To achieve this, we made a delay of the D2H copy of `send_sizes`, and concatenated it with `recv_sizes` to transfer them together with a single D2H copy. It reduces the device/host synchronization to once per layer.

Minimize Negative Impact on Dynamic Shapes

To further hide the device/host synchronization overhead, we further split the shared experts into 2 parts.

  • We had the first part dispatched right after routing, but before dispatch A2As. Then, when the device/host synchronization happens, the device is still kept busy running shared experts.
  • We have the second part dispatched right after MoE but before combining A2A. This will further help overlapping the second A2A.

Padded Communication with Static Shapes In Graph Mode

Minimize Usage of Padding

With a dropless token choice design, the maximum possible number of tokens routed to any single expert is T. However, if we group multiple experts together and place them on a single GPU through expert parallelism sharding, for TopK routing,

  • The maximum number of tokens routed to 1 expert is T.
  • The maximum number of tokens routed to 2 expert is 2 * T.
  • The maximum number of tokens routed to K experts is K * T.
  • The maximum number of tokens routed to K + 1 experts is, still, K * T. 

So the maximum number of tokens routed to an expert group of N experts will be capped at min(N, K) * T tokens. 

For Top1 routing, the number of tokens routed to an expert group of any size will always be capped at T tokens, and the minimal required memory to allocate and hold for dynamic tokens is EP * T tokens, as there are EP expert groups. 

To achieve the minimal required padding, we directly use AllGather to gather all active tokens from different EP ranks and then splits and reshuffles routed tokens locally through custom kernels. The activation size is compressed to 1 / (E // EP), which corresponds to reductions in memory and network traffic.

The above diagram shows the padding design. Each box represents a token, the blue / green color represents valid tokens with expert assignments and the grey color represents padded tokens. RiTj represents the j-th token from the i-th rank of expert parallelism group.

Minimize Negative Impact on Padding

Even though the paddings are reduced to minimal allowance, we also ensure that the paddings only cause memory space (allocation) and network traffic (communication), but not cause redundant computation (GroupedGEMM / NonLinear), redundant memory bandwidth (CombineShuffling / SplitShuffling / ScatterAdd) through taking on device shape information `routed_token_counts_per_expert` or `routed_token_counts_per_rank_per_expert`.

Activation Conceptional Interpretation

Most importantly,

  • When the total number of active tokens is small across all EP ranks, it is important to do so to avoid activating redundant experts in GroupedGEMM and causing extra memory traffic.
  • When the total number of active tokens is large across all EP ranks, it is also important to do so to avoid converting GroupedGEMM from memory bound to compute bound. 

CombineShuffling: The tokens assigned to the current EP rank are reshuffled from expert first order to rank first order right after AllGather. The tokens not assigned are not copied, and the remaining allocated memory space at the end of the tensor remains untouched.

SplitShuffling: The tokens assigned to the current EP rank are reshuffled from rank-first order to expert-first order right before AlltoAll. The tokens not assigned are not copied, and the reshuffled tensors have paddings stored in an interleaved fashion.

ScatterAdd (Padded): Each EP rank finally receives activations computed from all other ranks, it will understand where are the valid tokens and where are the padded tokens, and then only read the valid tokens to do scatter_add with.

Communication Deduplication

Different tensor parallelism ranks have the same activation before 1st GroupedGEMM and after 2nd GroupedGEMM, so the same tokens are exchanged across nodes repeatedly. 

We enabled communication deduplication to evenly distribute the inter-node communication workload to different ranks with extra intra-node communication introduced. Example of DP2/TP8/EP2:

  • For first AlltoAll in eager mode, split T*D inter-node AlltoAll to T*D/8 inter-node AlltoAll and T*D intra-node AllGather.

  • For second AlltoAll in eager / graph mode, split T*D inter-node AlltoAll to T*D/8 intra-node ReduceScatter and T*D/8 inter-node AlltoAll.

  • For first AllGather in graph mode, split 2*T*D inter-node AlltoAll to 2*T*D/8 inter-node AllGather and 2*T*D intra-node AllGather.

Kernel Design

We implemented more than 10 custom kernels to support MetaShuffling MoE inference design in different use cases running on both Nvidia H100 GPU and AMD MI300X GPU. We open sourced all computation kernels as PyTorch operators in FBGEMM Generative AI Kernel Library. We hope it can help users efficiently serve Llama 4 models in their preferred framework and preferred accelerators, for example, vLLM / SGLang. In this blog, we will focus on the 2 most interesting kernels designs as the key to improve inference performance, GroupedGEMM and IndexShuffling.

GroupedGEMM

We implemented Triton-based GroupedGEMM kernels for BF16 / FP16 / FP8 Rowwise.

Interface

def grouped_gemm_fp8_rowwise(
	x: torch.Tensor, 		# shape: [M, K]
	w: torch.Tensor, 		# shape: [G*N, K]
	m_sizes: torch.Tensor, 	# shape: [G]
	x_scales: torch.Tensor,	# shape: [M]
	w_scales: torch.Tensor, 	# shape: [G*N]
) -> torch.Tensor:               # shape: [M, N]
	...

The interface is quite similar to single GEMM in that it takes a single LHS, a single RHS tensor, and produces a single output. There is no dynamism or sparsity from the runtime point of view.

However, the kernel dynamically splits the M dimension of the LHS tensor using the data of `m_sizes` and statically splits the N dimension of the RHS tensor using the shape of `m_sizes`. This design has several advantages:

  • No additional padding or alignment requirement within different batches of Ms. So `m_sizes` can store any non-negative values as long as its total does not exceed `M`.
  • The `m_sizes` can be zero values to skip loading weights of unactivated experts.
  • The `m_sizes` can have a total sum less than `M` to skip computation on padded tokens at the end without extra overhead.
  • The `m_sizes`, or the splitting of the LHS activation, is known to the device but unknown to the host. So it supports dynamic routing information without incurring device-to-host synchronization. 

Workload Partition

We adopt the persistent kernel design to launch 1 CTA per SM and have all the CTAs running through all the partitioned tiles in an interleaved fashion. Conceptually, the workload partition happens as follows.


def partition_workload(G: int, Ms: List[int], N: int):
	partitions = []
	for g in range(G):
		for n in range(0, N, BLOCK_N):
			for m in range(0, Ms[g], BLOCK_M):
				partitions.append((g, m, n))
	paritions_per_cta = [[] for _ in NUM_SMS]
	for i, part in enumerate(partitions):
		paritions_per_cta[i % NUM_SMS].append(part)

The partitions are dynamically calculated on the device side at runtime with a small overhead. However, by doing so, we can achieve:

  • Balanced workload across different SMs.
  • Small launching overhead as each SM will only launch 1 CTA.
  • High L2 cache hit rate. The order of workload partition makes sure the weights/activations will most likely be loaded once from HBM and cached on L2. Because usages of the same weight/activation tile will almost always happen concurrently / consecutively from different SMs.

Persistent Kernel with Warp Specialization

We adopted host-side tensor map-based loading of activations and weights, and optional device-side tensor map-based storing of outputs, to reduce memory transfer overhead on Hopper GPUs. With a contiguous storage format of activations, we can use a single host-side TMA (Tensor Memory Accelerator) descriptor to load activations and mask out the tokens that belong to other experts. However, we need to create multiple device-side TMA descriptors to store outputs without dynamic masking support.

We adopted a warp specialization-based kernel design to have the kernel running in a truly persistent fashion that each SM switches between 3 warp groups (1 producer and 2 consumers). This design keeps TMA engine, Tensor core, and CUDA core execution overlapping with each other, utilizing asynchronous TMA instructions and WGMMA (Asynchronous Warpgroup Level Matrix Multiply-Accumulate) instructions with memory barriers on shared memory. We received tremendous help from our Meta’s Triton compiler team to enable it. It is only possible to hide prologue and epilogue with warp specialization, as the traditional software pipeline approach cannot handle complicated control flows with pointer chasing.

IndexShuffling

We implemented CUDA / HIP-based index shuffling kernels.

Interface

def index_shuffling(
	scores: torch.Tensor,			        # shape: [T, E]
):
	token_counts: torch.Tensor = ...		# shape: [E]
	expert_indices: torch.Tensor = ...	        # shape: [T]
	token_indices: torch.Tensor = ...		# shape: [T]
	return token_counts, expert_indices, token_indices

The kernel takes routing scores of all tokens on all experts, figures out the specific expert each token is routed to, reorders the token indices such that all the tokens routed to the same expert are placed contiguously, and returns:

  • `token_counts`: As the number of tokens routed to each expert. It will be fed into the GroupedGEMM kernel discussed above.
  • `expert_indices`: As the expert index each shuffled token belongs to. It will be fed into the GatherMul kernel discussed above.
  • `token_indices`: As the original token index each shuffled token belongs to. It will be fed into the GatherMul and ScatterAdd kernel discussed above.

Cooperative Kernel

We adopted the cooperative kernel design, and split the kernel into 2 major phases, top-k reduction phase and bucket sort phase, with a global synchronization in the middle.

  • 1. Load scores
    • It loads a tile of routing scores from global memory (HBM) to shared memory (SMEM) and stores associated expert indices along with it on SMEM.
  • 2. Reduction
    • Performs TopK reduction on SMEM across E dimension. For Llama 4 use cases, it performs ArgMax sorting as Top1 reduction, which includes a 2D parallel tree reduction on the scores and associated expert indices on SMEM. Between different tree reduction phases,
      • All threads will concurrently work on reductions of multiple tokens on SMEM.
      • Each thread will sequentially work on reductions of multiple tokens on SMEM.
  • 3. Counting & Store Buffers:  
    • It iterates all the tokens on the tile, getting the selected expert index from SMEM, storing it to the buffer (`buf_expert_index`) on HBM, and performs an `atomicAdd` operation on the output counter (`token_counts`) on HBM. 
    • The interesting part is, the `atomicAdd` operation will return the value previously on the memory location, which indicates the place of the token within the group, and we will store this value inside a buffer (`buf_local_token_index`) and use it to determine the global order among all the tokens.
  • Repeat 1-3 iteratively until all the tokens assigned to the CTA are processed.
  • 4. Global Synchronization: 
    • It performs an `atomicAdd` operation on the global counter on HBM. Afterwards, all CTAs will wait until the global counter reaches the number of total tokens, with a `st.release` + `ld.aquire` barrier guarding preceding store operations and following load operations to ensure correctness.
  • 5. Scan
    • It performs a simple load and prefix sum of `token_counts` and transforms it into `token_counts_cumsums` on SMEM.
  • 6. Load Buffer & Store Output
    • It iterates all the tokens assigned to this CTA. For each token, it loads the expert index the token is assigned to from `buf_expert_index`, and then figure out the new token index after shuffling as a sum of 
      • The number of tokens before it that belong to previous experts, using the SMEM tensor `token_counts_cumsums`.
      • The number of tokens before it that belong to the same expert, using the HBM tensor `buf_local_token_index`.
    • Afterwards, it simply does a direct store on `expert_indices` and `token_indices` output at the new token index after shuffling.

Performance

Example Kernel Performance

Our setup used H100 80GB SMX5 HBM3 700W SKUs, Python 3.12, and CUDA 12.8. The theoretical peak HBM memory bandwidth on a single H100 is 3.35 TB/s.

GroupedGEMM

Prefill Performance

The following table shows the prefill performance of the kernel on Llama 4 Scout and Maverick single host serving. The experiment setup assumes 16,384 total number of tokens and tensor parallelism sharding. 

Precision G M N K Time

(us)

Compute

(TFlops)

Memory

(GB/s)

BF16 16 1,024 2,048 5,120 523.85 655.90 1,088.90
BF16 16 1,024 5,120 1,024 294.95 582.46 1,251.39
BF16 128 128 2,048 5,120 975.41 352.26 2,992.82
BF16 128 128 5,120 1,024 510.78 336.35 3,021.86
FP8 16 1,024 2,048 5,120 286.89 1,197.64 1,111.10
FP8 16 1,024 5,120 1,024 182.41 941.83 1,471.62
FP8 128 128 2,048 5,120 517.16 664.40 2,887.28
FP8 128 128 5,120 1,024 290.25 591.90 2,947.93

Note: G indicates the number of groups. M indicates the number of tokens per group. N indicates the output feature dimension per group. K indicates the input feature dimension per group. FP8 indicates FP8 rowwise scaling (per-token scaling on activation and per-channel scaling on weight) with fast accumulation. Quantization kernels are not included in benchmarking. Scales are not included in memory bandwidth calculation. Benchmarked with rotating buffers and CUDAGraphs.

Decode Performance

The following table shows the decode performance of the kernel on Llama 4 Scout and Maverick single host serving. The experiment setup assumes 128 total number of tokens and tensor parallelism sharding. 

Precision G M N K Time

(us)

Compute

(TFlops)

Memory

(GB/s)

BF16 16 8 2,048 5,120 112.54 23.85 2,997.82
BF16 16 8 5,120 1,024 60.00 22.37 2,822.50
BF16 128 1 2,048 5,120 861.22 3.12 3,119.07
BF16 128 1 5,120 1,024 433.15 3.10 3,102.26
FP8 16 8 2,048 5,120 59.81 44.88 2,824.60
FP8 16 8 5,120 1,024 34.86 38.50 2,447.64
FP8 128 1 2,048 5,120 440.53 6.09 3,049.44
FP8 128 1 5,120 1,024 225.14 5.96 2,987.15

IndexShuffling

The following table shows the performance of the kernel on Llama 4 Scout and Maverick single-host serving, comparing against native PyTorch implementations.

Num Tokens Num Experts IndexShuffling (us) Unfused Ops (us) Speedup
128 16 5.08 36.71 722.53%
128 128 8.95 34.36 384.05%
2048 16 7.06 57.10 808.51%
2048 128 13.53 69.84 516.18%
4096 16 7.42 68.96 929.98%
4096 128 18.89 87.46 463.09%
8192 16 9.26 123.94 1339.16%
8192 128 30.56 165.21 540.71%

Note: Benchmarked with rotating buffers and CUDAGraphs.

Example Trace Analysis

Llama 4 Scout BF16 Decode

Here is an example decoding trace of Llama 4 Scout BF16 with 64 tokens using our MetaShuffling MoE inference solution. 

  • The total memory traffic of MoE is (ignoring activations):
    • Router: 5120x16x2 = 163,840 Bytes
    • Shared Experts: (2048×5120 + 5120×1024)x2=31,457,280 Bytes
    • Routed Experts: 16x(2048×5120 + 5120×1024)x2=503,316,480 Bytes
    • Total combined: 163,840 + 31,457,280 + 503,316,480=534,937,600 Bytes
    • The total execution time of MoE is 197.456usThe memory bandwidth achieved is 534,937,600 / (197.456 * 10^-6)=2,709,148,367,231 Bytes/s ~= 2.71 TB/s, which is 80.90% of the theoretical peak HBM memory bandwidth of H100 80GB SMX5 HBM3 as 3.35 TB/s.

Here is a breakdown of different components of the trace.

First is the breakdown of routing and shared experts. Both components are running concurrently on 2 different streams to achieve better resource utilization.

For the router stream (marked with red boxes):

  • 1. Router GEMM: CuBLAS-based GEMM with a split-k design. It launches 2 kernels with the second kernel being the reduction kernel.
  • 2. Sigmoid (Router Activation): PyTorch native sigmoid.
  • 3. IndexShuffling: FBGEMM-based index shuffling with a cooperative kernel design. It can be viewed as a fusion of 3 operations, topk, bincount, and sort. It launches 2 kernels with the first kernel being the setup kernel.
  • 4. GatherMul: FBGEMM-based gather scaling. It can be viewed as a fusion of 3 operations: gather (tokens), gather (scores), and mul operations.

For the shared expert stream (marked with orange boxes):

  • 5. SharedExpert GEMM13: CuBLAS-based GEMM with a split-k design. It launches 2 kernels, with the second kernel being the reduction kernel.
  • 6. SwiGLU: Fused SwiGLU. It can be viewed as a fusion of 2 operations, sigmoid and mul.
  • 7. SharedExpert GEMM2: CuBLAS based GEMM.

Second is the breakdown of routed experts. This component is running exclusively on 1 stream to let the GroupedGEMM kernels take full ownership of all SMs.

For the routed expert stream (marked with red boxes):

  • 8. RoutedExperts GroupedGEMM13: FBGEMM-based GroupedGEMM with a persistent kernel design. 
  • 9. SwiGLU: Fused SwiGLU. As mentioned in 6.
  • 10. RoutedExperts GroupedGEMM2: FBGEMM-based GroupedGEMM with a persistent kernel design, fused with scatter add in the epilogue.

The decoding step is running on dense tensors with static shapes using CUDAGraph.

Llama 4 Maverick FP8 Prefill

Here is an example prefill trace of Llama 4 Maverick FP8 with 5000 tokens using our MetaShuffling MoE inference solution. Note FP8 rowwise scaling for routed experts, and BF16 for router and shared experts.

Compared to the decode trace:

  • It uses a single stream to avoid interactions of kernels between router and shared experts. As the kernels are working on a large enough problem size that can saturate compute resources, having additional overlapping simply causes contentions, especially on L2 cache.
  • It runs on dense tensors with static shapes, but in eager mode. As the kernel execution time is large enough and there is no device/host synchronization, the kernels can be launched back to back without bubbles.

Here we highlight the kernel difference between these two traces, except for execution time.

  • Router GEMM and SharedExpertGEMM13: CuBLAS-based GEMM without using split-k design. So it launches 1 kernel instead of 2.

  • 4. GatherMul (FP8 Rowwise Quantize): FBGEMM-based gather scaling and quantization. It can be viewed as a fusion of 8 operations: gather (tokens), gather (scores), mul, max, divide, mul, clamp, and typecast.
  • 9. SwiGLU (FP8 Rowwise Quantize): Fused SwiGLU and quantization. It can be viewed as a fusion of 7 operations: sigmoid and mul, max, divide, mul, clamp, and typecast.

Takeaway

We take the following steps progressively to optimize the inference performance of our MoE solution:

    • Improve device-level utilization by avoiding host and device synchronization.
    • Reduce wasted resources by removing paddings or avoiding processing paddings.
    • Reduce kernel launch and I/O overhead by aggressive kernel fusion.
    • Improve computation and memory efficiency by various kernel optimizations, pushing performance towards hardware limits.
    • Improve hardware component level utilization by concurrent execution of computation, memory traffic, or network traffic heavy kernels, but avoiding undesirable contention at the same time.

Single Host Serving

We benchmarked the single-host serving performance of Llama 4 Maverick and Llama 4 Scout with our internal MetaShuffling-based MoE inference stack using 1000 requests with random prompts. To compare against openly available data from vLLM and SGLang, we adopted the same experiment setup (i.e., Maverick with FP8, Scout with BF16, on a 8xH100 host with maximum batch size 64). Our setup used H100 80GB SMX5 HBM3 700W SKUs, Python 3.12, and CUDA 12.8.  We open sourced all computation kernels used in the MetaShuffling MoE inference stack on FBGEMM and an example implementation of MetaShuffling as a reference.

To keep the best accuracy, we benchmarked Llama 4 Maverick with FP8 precision on routed experts. BF16 precision on attention linear, attention, shared experts, router, and KV cache.

To match external benchmark numbers, we benchmarked Llama 4 Scout with BF16 precision on all linear layers (attention linear, shared experts, router, and routed experts), attention, and KV cache.

Disclaimer: Here we use datapoints released from official channels as a reference. However, as all inference frameworks are rapidly evolving, they might already be outdated at the time of publication. We hope the community can continuously break records in improving the efficiency of serving Llama 4 models.

Acknowledgement

We would like to thank Jing Zhang, Ying Zhang, and Manman Ren for providing technical review and guidance.

We would also like to thank Bradley Davis, Yan Cui, Rengan Xu, Josh Fromm, Jiawen Liu, Sarunya Pumma, Jie Wang, Xinfeng Xie, Benson Ma, Michael Shu, Bingzhe Liu, Jingyi Yang, Min Si, Pavan Balaji, Dhruva Kaushal. for contributions to this project.

Read More

PyTorch Foundation at MLSys 2025

PyTorch Foundation at MLSys 2025

PyTorch Foundation at MLSys 2025: Supporting the Future of Machine Learning Systems

The PyTorch Foundation is proud to support MLSys 2025 as a Gold Sponsor. Held May 12–15 in Santa Clara, CA, this premier conference sits at the intersection of machine learning and systems, bringing together researchers, engineers, and practitioners pushing the boundaries of scalable AI infrastructure.

📍Visit the PyTorch Booth
Stop by to connect with the PyTorch Foundation team, including Executive Director Matt White, and contributors from across the ecosystem. Learn more about PyTorch Foundation’s recent expansion into an umbrella foundation and the acceptance of two leading open source AI projects—vLLM and DeepSpeed.

🎤Featured Sessions from the PyTorch Ecosystem

Extreme PyTorch: Inside the Most Demanding ML Workloads—and the Open Challenges in Building AI Agents to Democratize Them
Speaker: Soumith Chintala
Monday, May 12 | 9:30–10:30 a.m. PT | Mission City Ballroom

In this talk, Soumith Chintala will explore how cutting-edge users are pushing PyTorch to its limits, from planetary-scale training on interactive supercomputers to ultra-efficient, real-time inference on exotic hardware. These power users offer a unique window into today’s most demanding ML systems challenges. Chintala will also examine a bold idea that’s top of mind at this conference: using AI agents to automate a large portion of the work these users currently perform. He will outline the open challenges in building such agents and share concrete opportunities for open collaboration toward making SysML AI agents a reality.

An AI Stack: From Scaling AI Workloads to Evaluating LLMs
Speaker: Ion Stoica
Tuesday, May 13 | 10:30–11:30 a.m. PT | Mission City Ballroom

Ion Stoica will discuss how large language models (LLMs) are enabling new applications, intensifying GPU shortages, and raising concerns about output accuracy. He will present several projects developed to address these challenges, focusing on: (i) Ray, a distributed framework for scaling AI workloads; (ii) vLLM and SGLang, two high-throughput inference engines for LLMs; and (iii) Chatbot Arena, a platform for accurate LLM benchmarking. The session will conclude with key lessons learned and directions for future research.

⚡Additional Highlight
PyTorch Foundation Executive Director Matt White will also deliver a lightning talk during a PhD-focused session at the conference on the value of open source AI and the mission and value of the PyTorch Foundation.

We look forward to an engaging week of learning, collaboration, and technical exchange with the systems and ML research communities.

🔗 Learn more and register at mlsys.org

Read More

Introducing the PyTorch Ambassador Program: A Global Network of Community Leaders

Introducing the PyTorch Ambassador Program: A Global Network of Community Leaders

PyTorch Ambassador Program Launch

The PyTorch Foundation is proud to launch the PyTorch Ambassador Program, an initiative that recognizes and supports individuals who are passionate about building, educating, and advocating for PyTorch in impactful ways.

From organizing local events to mentoring new users, creating technical tutorials, and speaking at global conferences, PyTorch Ambassadors play a critical role in growing and supporting the global PyTorch ecosystem. The first official cohort of Ambassadors will launch in June 2025, with nominations open from May 7 to June 7, 2025.

About the Program

The PyTorch Ambassador Program highlights independent, trusted voices in the PyTorch community. These leaders help others get started with PyTorch, contribute to the ecosystem, and promote its use across industry, academia, and research.

The program is designed to:

  • Support local and regional PyTorch communities
  • Recognize technical contributions and thought leadership
  • Enable global knowledge sharing and collaboration

What PyTorch Ambassadors Do

Ambassadors are active contributors who:

  • Organize PyTorch-focused events, both virtual and in-person
  • Create technical tutorials, blog posts, and videos
  • Mentor new users and encourage inclusive participation
  • Represent PyTorch at conferences, meetups, and academic institutions

Ambassadors are expected to participate in at least one of these focus areas on a regular basis and commit to a one-year term.

Program Benefits

The Ambassador Program provides a range of resources and opportunities to help community leaders make a lasting impact:

Recognition and Visibility

  • Official designation as a PyTorch Ambassador
  • Featured profile on the PyTorch Foundation website
  • Promotion through PyTorch social media and communications channels

Exclusive Access

  • Private collaboration channels with fellow Ambassadors and Foundation staff
  • Invitations to briefings, workshops, and leadership training
  • Event planning toolkits and templates

Community and Event Support

  • Reimbursement for approved community activities and travel
  • Complimentary admission to PyTorch Conference
  • PyTorch-branded materials and Ambassador kits

Professional Development

  • Opportunities to speak at industry and Foundation events
  • Recognition for top contributors
  • Networking with machine learning leaders across the globe

Nomination Process

Nominations are open now through June 7, 2025. Individuals can nominate themselves or someone else. All applications will be reviewed by the PyTorch Foundation team, and selected Ambassadors will be invited to participate in onboarding and training sessions beginning in June.

To apply, visit the PyTorch Ambassador Program Application Page and click on the button that says Learn More and Apply.

Eligibility and Selection

To be eligible, nominees must:

  • Be at least 18 years old
  • Sign the PyTorch Ambassador Agreement and NDA
  • Follow the PyTorch Foundation Code of Conduct and Linux Foundation Antitrust Policy
  • Demonstrate technical knowledge of PyTorch through open source contributions, published content, or community leadership
  • Commit to participating for a one-year term

Ambassador nominations will be evaluated on the following criteria:

  • Community impact and engagement
  • Technical expertise and thought leadership
  • Consistent activity within the PyTorch ecosystem
  • Commitment to openness, inclusion, and collaboration

A Global Community

The PyTorch Foundation is seeking Ambassadors from all regions to build a globally representative program. Nominees will be asked to share their location to help identify opportunities for regional engagement and support. 

The inaugural cohort of PyTorch Ambassadors will be announced in June 2025. Their stories, events, and contributions will be featured on the PyTorch Foundation website and shared across community channels.

The PyTorch Ambassador Program is an exciting new chapter in our community’s growth. We invite you to join us in building an even more connected, inclusive, and global ecosystem.

Read More

PyTorch Foundation Launches New Website

Welcome to the new PyTorch Foundation website! We’re delighted to introduce you to our new PyTorch Foundation website, which serves as a fresh and centralized hub for information on open source AI and the PyTorch Foundation. As the community and ecosystem around PyTorch continue to grow, this new platform is designed to serve as the cornerstone for our global community to converge around PyTorch, fostering open collaboration and innovation in AI.

Why a New Website?

To better support the dynamic and diverse PyTorch community, we recognized the need for a dedicated space to:

  • Highlight the Foundation’s Vision: Share our mission and initiatives as a neutral steward of the PyTorch ecosystem.
  • Provide Centralized Resources: Offer easy access to news, events, governance updates, and community resources.
  • Engage the Community: Create pathways for collaboration and contribution, including updates on technical topics, ecosystem tools, events, etc.
  • Celebrate Impact: Showcase the incredible innovations powered by PyTorch across research, industry, and beyond.
  • Improve Navigability: Enable you to find what you need quickly and efficiently.
  • Refresh our Look: Update our site to reflect a more modern and sleek design.

What You’ll Find on the New Website

The new site is designed to be both intuitive and comprehensive. Here’s a breakdown of the top-level navigation:

About

Learn about the organization’s mission, members, and impact, meet the leaders guiding the PyTorch Foundation’s direction, and reach out with inquiries or collaboration opportunities.

Learn

Find step-by-step guides to help you begin your journey with PyTorch, get hands-on tutorials to build practical skills and understanding, and get up to speed on how to get started with the PyTorch Foundation.

Community

Contribute to and collaborate within the growing PyTorch ecosystem. Explore tools and libraries that extend PyTorch functionality and connect with fellow developers, researchers, and users.

Projects

The PyTorch Foundation will host a range of diverse, high-quality AI and ML projects beyond PyTorch Core and these projects will be listed here. 

Docs

Find core documentation for the PyTorch framework and explore domain-specific applications and resources. Note: We moved pytorch.org/docs to docs.pytorch.org and implemented redirects.

Blog & News

Get the latest news from the PyTorch Foundation, including technical deep dives on the blog. Stay up-to-date with information about upcoming conferences, meetups, and webinars, and subscribe to the PyTorch Foundation newsletter.

Join PyTorch Foundation

For organizations interested in becoming a member of the PyTorch Foundation, find out how to get involved and support the growth of the PyTorch Foundation.

This intuitive navigation ensures you can quickly find the resources and information you need on our website.

Designed with You in Mind

Whether you’re a researcher, developer, educator, or industry professional, the website is built to meet your needs. We’ve focused on simplicity, accessibility, and discoverability to make it easier than ever to:

  • Find the latest updates about the PyTorch Foundation’s work.
  • Navigate key resources to enhance your AI and ML projects.
  • Connect with a vibrant and collaborative community.

We’re eager to hear your feedback—what works, what doesn’t, and what you’d love to see next. Contact us to share your feedback.

Check It Out Today

We invite you to explore the new site at pytorch.org and use it as your go-to resource for all things PyTorch Foundation.

Read More

PyTorch Foundation Welcomes DeepSpeed as a Hosted Project

PyTorch Foundation Welcomes DeepSpeed as a Hosted Project

PyTorch Foundation Welcomes DeepSpeed

The PyTorch Foundation is excited to welcome DeepSpeed, a deep learning optimization library, as a PyTorch Foundation-hosted project. Contributed by Microsoft, DeepSpeed empowers developers to streamline distributed training and inference, making it easier to scale AI models efficiently while minimizing costs and operational complexity. Since inception, DeepSpeed has leveraged core PyTorch functionalities as the foundation for building deep learning features and optimizations.

The PyTorch Foundation recently announced its expansion to an umbrella foundation to accelerate AI innovation and is pleased to welcome DeepSpeed as one of the first new projects. Foundation-Hosted Projects are projects that fall under the umbrella, they are officially governed and administered under the PyTorch Foundation’s neutral and transparent governance model. 

What is DeepSpeed?

DeepSpeed is designed to optimize deep learning workflows, providing a robust set of features that enhance the performance, scalability, and cost-efficiency of AI model training and deployment. It enables seamless scaling across thousands of GPUs while also optimizing resource utilization for constrained systems, addressing key technical challenges in AI development.

Key features of DeepSpeed include:

  • Scalable Model Training: Supports dense and sparse Mixture-of-Experts (MoE) models with billions or trillions of parameters, scaling seamlessly across thousands of GPUs.
  • Heterogeneous Hardware Support: Offers compatibility with diverse hardware platforms, including Nvidia, AMD, and Intel GPUs, Huawei Ascend NPU, and Intel Gaudi, ensuring flexibility and adaptability in deployment.
  • Optimized Resource Use: Facilitates training and inference on systems with limited GPU capacity, maximizing hardware efficiency and increasing accessibility.
  • Low-Latency Inference: Achieves minimal latency and high throughput for real-time model inference.
  • Compression Capabilities: Reduces model size and inference latency, lowering costs for large-scale deployments without sacrificing performance.

Accelerating Open Source AI Together

DeepSpeed’s addition as a PyTorch Foundation strengthens the foundation’s mission to accelerate open source AI. By joining PyTorch Foundation, DeepSpeed gains access to a thriving ecosystem of open source projects, a global network of contributors, and robust technical and operational resources. This collaboration enables the DeepSpeed community to scale its efforts, enhance interoperability with other projects, and drive broader adoption of its optimization library. Additionally, PyTorch Foundation’s focus on open governance and community-driven development ensures that DeepSpeed’s growth aligns with the shared goals of transparency, inclusivity, and innovation in the AI space.

Learn more about DeepSpeed and how to get involved by visiting the DeepSpeed website.

Read More

PyTorch: The Open Language of AI

PyTorch: The Open Language of AI

PyTorch The Open Language of AI

Key takeaways:

  • PyTorch today powers the generative AI world with major AI players like Meta, OpenAI, Microsoft, Amazon, Apple and many others building cutting edge AI systems.
  • PyTorch has evolved from a framework focused on AI research to supporting production, deep AI compilation and has become foundational to thousands of projects and companies in the AI ecosystem.
  • The PyTorch Foundation is expanding to be an umbrella organization and will now house some of the most popular and highly complementary projects making it easier for users to build AI at scale.
  • Overall, the PyTorch Foundation is uniquely positioned to support the AI transformation throughout the stack, from accelerating compute to supporting the next wave of agentic systems, from research to production.. 

When we look back at the early days of PyTorch, our main focus was initially on accelerated training and developer experience for AI researchers. We wanted to empower researchers to easily express their ideas (no matter how crazy they were) and accelerate training, enabling them to quickly validate those ideas. This evolved to be broader when we established PyTorch 1.0, brought in Caffe2 and expanded the mission to become ‘research to production’. With PyTorch 2.0, the scope and vision yet again expanded to include a major focus on performance, including an expansion in our compiler investments, heterogenous hardware support, which has led to torch.compile, TorchInductor and investment in the Triton project. Throughout all of this, we maintained a design philosophy that values: (1) Usability over performance; (2) Simple over easy; and (3) Python first with a focus on language interoperability.

Moreover, when we first put the 3 year vision together for PyTorch back in 2020, the goals we set were: 

  1. Industry leading: Winning across research and production with a partner ecosystem that is strategically and commercially aligned and collaborating with us toward this vision; 
  2. Diverse: A global community from academia and industry contributing to an ecosystem made up of projects, platforms and research built on or around PyTorch and that continually pushes the field forward; and 
  3. Sustainable: Maintains its diversity of major contributors and productivity over a long period (3 years+) and can survive inherent changes such as new technologies or new products (e.g., from competitors) that can change the population (the community of users, developers etc).

With the PyTorch Foundation joining the Linux Foundation in 2022, this set the stage for the next phase of growth for the project. If we fast forward to today, the foundation is rapidly growing with 13 Premier members and a total of 30 member organizations, more diverse contributors than ever and a growing ecosystem. The PyTorch Foundation is well-positioned to continue to play a leadership role in the rapidly evolving field of AI. 

All of that said, we yet again have a major opportunity to evolve PyTorch to play a much more integral role in setting the direction of open source AI and the industry at large.

Challenges in the AI Space Today

Over the past two years, the AI landscape has undergone a remarkable transformation. Large Language Models (LLMs) have moved to the forefront, powering applications like ChatGPT and driving an open revolution of models spearheaded by Llama. Now, we’re witnessing agentic systems entering the mainstream. Despite these advances, significant challenges persist as we transition into a generative AI and agent-first world. To better understand the nature of these challenges, we should consider several key questions:

  1. How do we optimize and maximize the creation of intelligence per a given amount of power?
  2. How can we democratize the continual improvement, customization, and adaptation of intelligence?
  3. How can additional capabilities outside of models, such as tools and environments, be accelerated in the way that we’ve optimized other systems?
  4. And lastly, how do we effectively measure intelligence such that it aligns with what we want as end users?

These challenging questions require a collective community working towards common goals to address them effectively. By bringing together diverse perspectives, we can build a comprehensive framework that integrates all layers of AI development—from hardware acceleration primitives to sophisticated agentic development, evaluation, and deployment practices. What we’re envisioning transcends typical technological initiatives; it’s more akin to developing a new language or operating system—a foundational infrastructure that enables entirely new possibilities.

A Broader Vision

One way to frame a broader vision for PyTorch is for it to be “The Open Language of AI”. 

Modulo some wordsmithing, it feels like we should consider adding an additional goal for PyTorch:

Viewed as a foundational operating system: PyTorch powers the foundation of AI systems throughout the industry

As the depth of the AI stack expands, the reach of the PyTorch Foundation is also set to grow and expand.The PyTorch Foundation has thus just evolved into an umbrella foundation that expands the scope and impact of PyTorch well beyond its traditional roots as an AI framework, and allows the foundation to host high value projects in the AI landscape.

In welcoming new projects to the PyTorch Foundation, we will look to uphold the same design principles that have guided PyTorch to this day.

Starting with vLLM and DeepSpeed, we are going to bring together some of the most innovative communities in AI development and in the process building a broader and more comprehensive stack for AI developers. 

Look for more project announcements coming very soon!

Near-Term Focus

As we move forward here in 2025, the core PyTorch project continues to make progress across a number of areas. Some of the high-level themes include:

  1. Inference: We are investing in cleaning up and clarifying our APIs/runtimes across non-LLMs, LLMs, and edge/local deployment:
    1. vLLM, as well as SGLang (part of the PyTorch Ecosystem), for server LLMs
    2. ExecuTorch will be the umbrella API for non-server deployment 
    3. Non-LLM serving – we will deprecate TorchServe and promote the ecosystem solutions around us 
  2. Post training: We will be unifying our post training efforts to support end-to-end PyTorch native post training with online/async RL. 
  3. Large scale training: We are working on large scale disaggregated semi/async fault tolerant training with a focus on enabling high levels of parallelism and incorporating online RL. 
  4. Compiler: We will be doubling down on improvements for torch.compile + integrations with vLLM and SGLang, improving the devX of Triton, deeper integration with training frameworks – Titan, etc.. 
  5. Model compression: Low precision model support via TorchAO and integration with vLLM and SGLang. 
  6. Edge/local deployment: Increasing our investment in ExecuTorch, bringing it closer to core and expanding the scope to support AI PCs, Macbooks, and edge devices as well as supporting projects like Ollama. 

You can dig into the published roadmaps here for more details. 

Where We Go From Here…

With the PyTorch Foundation now an umbrella foundation, we are focused on bringing in high-quality and complementary projects that extend the scope and vision of PyTorch. You will see projects announced over the weeks and months joining the foundation that will strengthen the PyTorch vision and create a rich portfolio of AI projects that integrate well and create a frictionless user experience. Our focus is on creating a trusted landscape of projects that are open source, have demonstrable value to the community and solve problems in the AI lifecycle. 

Additionally, our community continues to grow rapidly, with 3 PyTorch Days announced for 2025 in Asia, Europe and India and our keystone event, the PyTorch Conference coming October 22nd and 23rd in the middle of Open Source AI Week 2025. The conference will also add an extra day this year with the Startup Showcase, Measuring Intelligence Summit, AI Infra Summit, hackathons, co-located open source AI events and networking opportunities. We’re excited about this awesome celebration of open source AI innovation and hope you’ll join us!

Cheers,
Joe & Luca

Read More

PyTorch Foundation Expands to Umbrella Foundation and Welcomes vLLM and DeepSpeed Projects

PyTorch Foundation Expands to Umbrella Foundation and Welcomes vLLM and DeepSpeed Projects

PyTorch Foundation Expands to Accelerate AI Innovation & Welcomes vLLM and DeepSpeed Projects

Expanded Foundation will Provide a Trusted and Vendor-Neutral Home for High-Impact and Innovative Open Source AI Projects

PyTorch Day France, Paris, France – May 7, 2025 – The PyTorch Foundation, a community-driven hub for open source AI, today announced its expansion into an umbrella foundation. As part of this milestone, two leading open source AI projects—vLLM and DeepSpeed—have been accepted into the foundation by the Technical Advisory Council. This expansion positions the PyTorch Foundation as the trusted home for a broad range of community-driven AI projects spanning the entire AI lifecycle—from training and inference and domain-specific applications to agentic frameworks.

As artificial intelligence becomes a critical driver of global innovation and competitive advantage, enterprises are under pressure to adopt scalable, secure, and future-focused AI solutions. With global GenAI spending forecasted to hit $644 billion this year, and demand growing for open source AI alternatives, the PyTorch Foundation is positioning itself as a vendor-neutral home for trusted and innovative new AI projects. The foundation will support the development of the next generation of open source AI tooling, ensuring interoperability, reducing vendor lock-in, and enabling faster integration of trusted, production-grade technologies. With transparent governance and broad industry collaboration, the PyTorch Foundation is playing a crucial role in shaping the infrastructure enterprises rely on to build and deploy responsible AI at scale.

“This is an exciting new chapter for the PyTorch Foundation and the broader open source AI ecosystem,” said Matt White, Executive Director of the PyTorch Foundation. “By transitioning to an umbrella foundation, we’re not only formalizing the momentum we’ve built across the PyTorch ecosystem—we’re creating space for new projects and innovators to thrive within a vendor-neutral, open governance environment.”

The decision to expand to an umbrella foundation is a natural evolution of the PyTorch Foundation’s rapid growth and global momentum. In just two and a half years, the organization has grown to include over 30 member companies and 120 vibrant ecosystem projects, and PyTorch itself has become the preferred framework for AI research and deployment. The new umbrella structure will support a broader portfolio of high-impact projects, foster deeper collaboration across domains, and help scale innovation throughout the AI lifecycle.

The PyTorch Foundation’s expanded scope allows it to host two new categories of projects:

  • Platform Projects – Solutions that support multiple stages of the AI lifecycle, including training, inference, model optimization, deployment, and agentic systems.
  • Vertical Projects – Tools tailored for specific industries and applications, such as bioinformatics, geospatial intelligence, and protein folding.

Projects accepted under the PyTorch Foundation benefit from neutral IP governance, strategic support, increased visibility, and a global community of contributors. The PyTorch Foundation distinguishes between ecosystem projects, which remain independently governed, and foundation-hosted projects, which adopt the foundation’s open governance model and receive comprehensive operational support.

The first two projects accepted into the PyTorch Foundation as hosted projects are:

  • vLLM – An open and efficient inference engine for large language models. vLLM enables high-throughput, low-latency LLM serving through optimized memory management and scheduling techniques, including PagedAttention. It supports popular model architectures and is designed to maximize hardware utilization, making LLM inference more scalable and cost-effective across a range of deployments. Learn more about the contribution of vLLM to the PyTorch Foundation here.
  • DeepSpeed – A distributed training library that simplifies scaling AI workloads. DeepSpeed provides a suite of optimization techniques—such as ZeRO (Zero Redundancy Optimizer), 3D parallelism, and inference acceleration—to enable training of extremely large models efficiently. It is used extensively in both academic research and production environments to push the limits of model size, speed, and efficiency. Learn more about the contribution of DeepSpeed to the PyTorch Foundation here.

The PyTorch Foundation is committed to fostering the growth and adoption of open source AI solutions and tooling. Communities interested in joining the PyTorch Foundation’s expanding ecosystem can learn more about the process for becoming a project here.

Supporting Quotes

“AMD has been a consistent supporter of open source software and the community of open source AI projects. We are excited about this expansion of the PyTorch Foundation, which provides a great opportunity for important AI projects to mature in an open and vendor-neutral ecosystem.”

– Ramine Roane, Corporate Vice President of AI Product Management, AMD

“At Arm, we believe collaboration is essential to empower developers and accelerate AI innovation from cloud to edge. The expansion of the PyTorch Foundation is a major milestone for the open source AI community—by providing a trusted home for projects like vLLM and DeepSpeed, the PyTorch Foundation is helping to unlock scalable, efficient AI and we’re proud to support this important evolution.”

– Alex Spinelli, Senior Vice President, AI and Developer Platforms and Services, Arm

“Open source frameworks are essential to advancing AI development, which is why AWS has been committed to the long-term success of the PyTorch ecosystem since its early days and through our continued support of the PyTorch Foundation. Expanding to an umbrella foundation highlights the rapid growth of this community and will make it easier to support a broader portfolio of high-impact projects, foster deeper collaboration across domains, and help scale innovation throughout the AI lifecycle.”

– Brian Granger, Senior Principal Technologist of AI Platforms, Amazon Web Services

“DeepSpeed is delighted to become a hosted Platform project in the PyTorch Foundation. From inception, DeepSpeed has built on PyTorch, with critical dependencies on features such as Module, Tensor, Distributed, and Compiler. We are eager to leverage this closer integration with the PyTorch ecosystem to achieve our goal of providing open and democratized access to state-of-the-art AI technologies for all.“

Olatunji Ruwase, Project Lead, DeepSpeed

“Google congratulates the PyTorch Foundation on its expansion into an umbrella foundation. This evolution is poised to not only champion important open-source AI projects like vLLM and DeepSpeed, but is also a significant step forward in cultivating deeper collaboration and driving innovation within the AI community. We look forward to continuing to collaborate with the foundation and contributing to the expanded ecosystem.”

– Joe Pamer, Senior Director, ML Frameworks and Compilers, Google

“As a significant contributor to vLLM, DeepSpeed and PyTorch, Huawei welcomes their move to the foundation. We believe the professional services offered under the umbrella model will foster continued growth and value for users and developers.”

– Li Yongle, General Manager of Open Source Development, Huawei’s Computing Product Line

“Super excited for vLLM and DeepSpeed to join the PyTorch Foundation as it becomes an umbrella foundation. These packages are essential tools in the deep learning stack and integrate seamlessly with PyTorch. This is a strategic move that ensures future growth and maintenance for them.”

– Lysandre Debut, Chief Open-Source Officer, Hugging Face

“As a pivotal member of the PyTorch community for years, IBM applauds the expansion of the PyTorch Foundation to an umbrella foundation. This shift provides opportunities to support projects such as vLLM and others across the entire AI model lifecycle, from training to tuning to inference. An umbrella organization structure will support new workstreams underpinned by essential AI governance principles, accelerating performance in a new era of open, responsible AI.”

– Sriram Raghavan, VP, IBM Research AI

“As a premier member of the PyTorch Foundation, Intel is excited about the foundation’s expansion into an umbrella model. This strategy empowers developers with essential resources and support, enabling them to create innovative, community-driven AI projects that tackle real-world challenges.”

– Kismat Singh, VP, Engineering for AI Frameworks, Intel Corporation

“PyTorch sits at the very core of AI today. Meanwhile, the depth of the AI stack has grown dramatically—evolving from enabling accelerated compute to powering fully autonomous systems. Broadening the PyTorch Foundation is a key step in keeping the AI revolution open and accessible to all, across the stack and aligned with the principles PyTorch was built on.”

– Luca Antiga, CTO, Lightning AI

“Today PyTorch plays such a fundamental role in the AI space underpinning Llama, ChatGPT and so many other influential projects. This move to create an umbrella foundation enables PyTorch to significantly expand its ecosystem both horizontally and vertically in this era of agentic systems. I really believe this will usher in a new wave of innovation, and I’m especially excited about vLLM and DeepSpeed joining. These projects have a strong history of being critical to AI’s advances and it’s exciting that we are joining forces to grow this amazing community!”

– Joe Spisak, Product Director for PyTorch, Meta

“The PyTorch Foundation plays a vital role in advancing the PyTorch ecosystem by driving innovation, supporting education, and fostering community collaboration. Its expansion to an umbrella foundation helps ensure the long-term success of open source tools and creates the conditions necessary to welcome new projects that are essential to the future of open source AI.”

– Ujval Kapasi, VP of Deep Learning Software, NVIDIA

“At Snowflake, we believe that empowering the AI community is fundamental, and strengthening the open, vendor-neutral foundation around pivotal projects like these is crucial for progress. It’s truly exciting to witness the PyTorch Foundation evolve into an umbrella organization and welcome essential projects like DeepSpeed and vLLM. Having been part of the PyTorch ecosystem, I deeply appreciate the significance of this strategic move. We eagerly anticipate the accelerated innovation this closer collaboration within the PyTorch Foundation will bring.”

– Dwarak Rajagopal, VP of AI Engineering and Research, Snowflake

“We’re excited that vLLM is one of the first Platform Projects joining the PyTorch Foundation. vLLM is built on top of PyTorch with deep integration such as Torch Compile and multi-hardware support. We look forward to further collaborating with the ecosystem that powers innovations in open source and vendor neural technologies for AI.“

– Simon Mo, Project Co-Lead, vLLM

###

About the PyTorch Foundation

The PyTorch Foundation is a community-driven hub supporting the open source PyTorch framework and a broader portfolio of innovative open source AI projects. Hosted by the Linux Foundation, the PyTorch Foundation provides a vendor-neutral, trusted home for collaboration across the AI lifecycle—from model training and inference, to domain-specific applications. Through open governance, strategic support, and a global contributor community, the PyTorch Foundation empowers developers, researchers, and enterprises to build and deploy responsible AI at scale. Learn more at https://pytorch.org/foundation

The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see our trademark usage page. Linux is a registered trademark of Linus Torvalds.

Media Contact
Natasha Woods
The Linux Foundation
nwoods@linuxfoundation.org 

Read More

PyTorch Foundation Welcomes vLLM as a Hosted Project

PyTorch Foundation Welcomes vLLM as a Hosted Project

PyTorch Foundation Welcomes vLLM

The PyTorch Foundation is excited to welcome vLLM as a PyTorch Foundation-hosted project. Contributed by the University of California – Berkeley, vLLM is a high-throughput, memory-efficient inference and serving engine designed for LLMs. vLLM has always had a strong connection with the PyTorch project. It is deeply integrated into PyTorch, leveraging it as a unified interface to support a wide array of hardware backends. These include NVIDIA GPUs, AMD GPUs, Google Cloud TPUs, Intel GPUs, Intel CPUs, Intel Gaudi HPUs, and AWS Neuron, among others. This tight coupling with PyTorch ensures seamless compatibility and performance optimization across diverse hardware platforms. 

The PyTorch Foundation recently announced its expansion to an umbrella foundation to accelerate AI innovation and is pleased to welcome vLLM as one of the first new projects. Foundation-Hosted Projects are projects that fall under the umbrella, they are officially governed and administered under the PyTorch Foundation’s neutral and transparent governance model. 

What is vLLM?

Running large language models (LLMs) is both resource-intensive and complex, especially as these models scale to hundreds of billions of parameters. That’s where vLLM comes in. Originally built around the innovative PagedAttention algorithm, vLLM has grown into a comprehensive, state-of-the-art inference engine. A thriving community is also continuously adding new features and optimizations to vLLM, including pipeline parallelism, chunked prefill, speculative decoding, and disaggregated serving.

Since its release, vLLM has garnered significant attention, achieving over 46,500 GitHub stars and over 1000 contributors—a testament to its popularity and thriving community. This milestone marks an exciting chapter for vLLM as we continue to empower developers and researchers with cutting-edge tools for efficient and scalable AI deployment. Welcome to the next era of LLM inference!

Key features of vLLM include:

  • Extensive Model Support: Powers 100+ LLM architectures with multi-modal capabilities for image and video, while supporting specialized architectures like sparse attention, Mamba, BERT, Whisper, embedding, and classification models.
  • Comprehensive Hardware Compatibility: Runs on NVIDIA GPUs through Blackwell, with official support for AMD, Google TPU, AWS Neuron, Intel CPU/XPU/HPU, and ARM. Third-party accelerators like IBM Spyre and Huawei Ascend easily integrate via our plugin system.
  • Highly Extensible: Enables custom model implementations, hardware plugins, torch.compile optimizations, and configurable scheduling policies to match your specific needs.
  • Optimized for Response Speed: Delivers minimal latency through speculative decoding, quantization, prefix caching, and CUDA graph acceleration.
  • Engineered for Maximum Throughput: Achieves peak performance with tensor/pipeline parallelism and specialized kernels.
  • Seamless RLHF Integration: Provides first-class support for reinforcement learning from human feedback and common post training frameworks.
  • Enterprise-Scale Distributed Inference: Enables cluster-wide scaling through KV cache offloading, intelligent routing, and prefill-decode disaggregation.
  • Production-Hardened: Delivers enterprise-grade security, comprehensive observability, and battle-tested operational reliability.

Accelerating Open Source AI Together

By becoming a PyTorch Foundation project, vLLM will collaborate with the PyTorch team closely on feature development. For example: 

  • vLLM will make sure the code runs on Torch nightly, and the PyTorch team will monitor to ensure all tests are passed. 
  • PyTorch team is enhancing torch.compile and FlexAttention support for vLLM.
  • Close collaboration and support with native libraries such as TorchTune, TorchAO, and FBGEMM. 

The partnership creates significant mutual advantages for both vLLM and PyTorch core. vLLM gains a committed steward in the Foundation, ensuring long-term codebase maintenance, production stability, and transparent community governance. Meanwhile, PyTorch benefits from vLLM’s ability to dramatically expand PyTorch adoption across diverse accelerator platforms while driving innovation in cutting-edge features that enhance the entire ecosystem.

Read More

Recap of the PyTorch Korea User Group Meetup: A Technical Conference with a PyTorch Core Maintainer

Recap of the PyTorch Korea User Group Meetup: A Technical Conference with a PyTorch Core Maintainer

At the end of March, the PyTorch Korea User Group hosted a special meetup that brought together prominent speakers for deep discussions on the PyTorch core and its broader ecosystem. With the event more than doubling in size compared to past gatherings, we were able to connect with even more developers and share insights. Huge thanks to goorm for sponsoring the fantastic venue! 😄

This recap is for those who couldn’t attend in person, as well as for participants who want to revisit the energy and insights of the day. The event featured experts in core PyTorch, AI accelerators, inference optimization, and large language model development. Below is a quick overview of the key sessions that anchored the conference.

1⃣ Jerry Lee | PyTorch Foundation

Representing the PyTorch Foundation, part of the Linux Foundation, Jaeung provided an overview of how PyTorch is driving core open source technologies forward. He shared PyTorch’s growth story, the many global projects currently in motion, and the ecosystem’s impressive 20%+ annual growth. The session also covered how the foundation operates, how member organizations are involved, and upcoming plans that are particularly useful for practitioners.

2⃣ Alban Desmaison | PyTorch Roadmap

Alban shared the design philosophy behind PyTorch and Meta’s official contribution roadmap (link). He provided a deep technical dive into the differences between Eager and Compiled modes, especially breaking down the backend architecture of device Eager execution. Practical tools and improvements were also introduced—such as memory profilers, enhanced custom operator support, and pinned memory optimizations.

3⃣ Hongseok Kim | PyTorch on Rebellions AI Accelerators: Status

Rebellions is building runtime integration for their proprietary NPU architecture, fully aligned with the structural changes in PyTorch 2.0. This talk introduced the performance and scalability of their upcoming chip, their integration strategy with the PyTorch runtime, and challenges in supporting Eager Mode. Hongseok also previewed their roadmap toward releasing these features within the year.

4⃣ Kyujin Cho | Backend.AI: A Unified Platform for All AI Accelerators

Backend.AI abstracts and integrates various AI accelerators into a unified workflow. As the diversity of accelerator architectures grows, the need for portability and infrastructure unification becomes even more important. This session showcased features across development and operations—from NPU scheduling and resource allocation to monitoring. Backend.AI currently supports accelerators from NVIDIA, Intel, Tenstorrent, Rebellions, and more.

5⃣ Taeho Kim | Optimizing & Deploying Models Across Multiple Chipsets Using NetsPresso

This talk focused on the challenges of inference in real-world industrial applications of AI models. As new state-of-the-art models emerge rapidly, there’s a growing need for environments that can quickly validate device compatibility—ideally with one-click ease. NetsPresso is actively working on a static graph representation compatible with PyTorch, offering efficient support for model development, optimization, and testing.

6⃣ Jungyeop Lee | The Journey to Reproduce Deepseek-R1

Jungyeop took us through his journey of reproducing Deepseek, a large language model—an effort that involved 201 experiments. He shared real-world lessons from training with Korean data, tokenizer modifications, and fine-tuning strategies. His practical insights and next steps were especially valuable for those building or re-implementing large models from scratch.

7⃣ Sol Kim | A journey from TCP architecture to production-level LLMs

Sol presented an integrated optimization approach to deploying large models using the TCP(Tensor Contraction Processor) architecture, which supports tensor contraction at the hardware level. The talk highlighted optimization techniques built on hardware abstraction layers (HALs) and bottom-up integration strategies with PyTorch—offering a hybrid hardware-software perspective.

💡 Panel Talk & Q&A 💡

The event wrapped up with an engaging panel discussion. Attendees asked sharp questions, and the speakers offered insightful answers. It was a powerful moment that captured the community’s enthusiasm for PyTorch and their hunger for deeper technical understanding.

Final Thoughts

Since our first offline meetup in October 2022, the PyTorch Korea User Group has held five major technical conferences. Each event deepens our appreciation for the scale and depth of the PyTorch ecosystem. With perspectives from users, contributors, and ecosystem builders, the stories we share are only growing—and we’re committed to continuing this journey together.

See you at the next conference—with even more exciting talks to come! 🙌

Read More