PyTorch 2.8 Release Blog

PyTorch 2.8 Release

We are excited to announce the release of PyTorch® 2.8 (release notes)! This release features:

A limited stable libtorch ABI for third-party C++/CUDA extensions
High-performance quantized LLM inference on Intel CPUs with native PyTorch
Wheel Variants, a mechanism for publishing platform-dependent wheels and selecting the most suitable package variant for a given platform. Note: This feature is experimental with an aim to get into Python upstream, but there are no guarantees for compatibility. Do share feedback.
Added functional support for the new gfx950 architecture on ROCm 7. Specifically, max-autotune support with (matmul, addmm, conv2d, bmm, _scaled_mm) templates for TorchInductor and AOTInductor Composable Kernel backend.
Control flow operators (`cond`, `while_loop`, `scan`, `associative_scan`, and `map`), for compiling and exporting models.
Inductor CUTLASS backend support for both torch.compile and AOTInductor, as well as GEMMs such as mm, fp8 mm, addmm, and bmm.

and more! See below.

This release is composed of 4164 commits from 585 contributors since PyTorch 2.7. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.8. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

API-UNSTABLE FEATURES

[API-Unstable] torch::stable::Tensor

Do you use or maintain third-party C++/CUDA extensions with torch? Whenever there’s a new release of PyTorch, like this one, you likely find yourself having to rebuild all of your wheels. If only there were a set of limited APIs these extensions could depend on so that you wouldn’t have to anymore! We’ve started building out a limited stable libtorch ABI, and now in 2.8, we have APIs for library registration (STABLE_TORCH_LIBRARY) and for the Tensor object (torch::stable::Tensor). An extension that relies on this stable subset of APIs would be stable with libtorch, meaning that one can build that extension with one torch version and run with another torch version. We are continuing to expand this subset of stable limited APIs, but you can check out a toy libtorch stable extension here.

[API-Unstable] High-performance quantized LLM inference on Intel CPUs with native PyTorch

Quantization of LLMs saves storage, memory, and reduces inference latency, so it is a popular technique to deploy LLMs. This feature provides high-performance quantization inference of LLMs on the latest Intel CPU platforms with native PyTorch. Supported configurations include A16W8, DA8W8, and A16W4, etc.

When torch.compile’ing the quantized model, we lower the patterns of quantized GEMM to template-based high-performance GEMM kernels with max-autotune in Inductor. With this feature, the performance with PyTorch native stack can reach close to roofline performance on a single Intel CPU device, which enables PyTorch users to run low-precision LLM inference with native experience and good performance. More details can be found in the RFC.

[API-Unstable] Experimental Wheel Variant Support

Current mechanisms for distinguishing Python Wheels (i.e., Python ABI version, OS, CPU architecture, and Build ID) are insufficient for modern hardware diversity, particularly for environments requiring specialized dependencies such as high-performance computing, hardware-accelerated software (GPU, FPGA, ASIC, etc.), etc.

Wheel Variants, a mechanism for publishing platform-dependent wheels and selecting the most suitable package variant for a given platform, are introduced in this release. They include:

a system that enables multiple wheels for the same Python package version, distinguished by hardware-specific attributes.
a Provider Plugin system that dynamically detects platform attributes and recommends the most suitable wheel.

This experimental release comes with automatic & transparent NVIDIA CUDA platform detection that includes GPU and CUDA driver detection and installs the best-fitting package for the machine.

Note: This feature is experimental (see RFC here) with an aim to get into Python upstream, but there are no guarantees for compatibility. Please keep this in mind and do share feedback. More details will be provided in an upcoming blog about the future of PyTorch’s packaging, as well as the release 2.8 live Q&A on August 14th! (see link)

Give it a try today:

Linux x86 and aarch64, MacOS:

curl -LsSf https://astral.sh/uv/install.sh | INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh sh

Windows x86 :

powershell -ExecutionPolicy Bypass -c “$env:INSTALLER_DOWNLOAD_URL=‘https://wheelnext.astral.sh’; irm https://astral.sh/uv/install.ps1 | iex

[API-Unstable] Inductor CUTLASS backend support

CUTLASS is an Nvidia header-only library that generates high-performance GEMMs with flexibility for fusions. It includes GEMM templates capable of instantiating thousands of kernels, which can be compiled independently of problem shapes and exhibit varying levels of performance across different shapes.

TorchInductor automates the autotuning process for all GEMMs in a model by precompiling kernels, caching them locally, and benchmarking them to select the optimal kernel for a given problem shape during model compilation. The generated kernels are performant and achieve state-of-the-art performance for some shapes, for bmm and fp8 kernels this was up to 10%, and 16% over triton/cublas on production workloads. The CUTLASS backend supports both torch.compile and AOTInductor, as well as GEMMs such as mm, fp8 mm, addmm, and bmm. For more information, see this video in the PyTorch Compiler YouTube series.

[API-Unstable] Inductor Graph Partition for CUDAGraph

For functions with only CUDA kernels, CUDAGraph mitigates cpu launching overhead and usually leads to good performance. However, complexities in a function may prevent CUDAGraph since it cannot support some popular ops (e.g., cpu ops, device copy, cudagraph unsafe custom ops). Graph partition is a compiler solution to automatically split off these ops, reorder ops to reduce the number of partitions, and cudagraphify individual partitions.

[API-Unstable] `torch.compile` Hierarchical Compilation

Instructs torch.compilethat the marked set of operations form a nested compile region (which are often repeated in the full model) whose code can be compiled once and safely reused. During torch.compile tracing, the compiler applies hierarchical compilation with `nested_compile_region`: it emits optimized code for the marked region the first time it is encountered and re-emits the previously compiled code on every subsequent invocation. This can substantially reduce overall compile time for deeply-stacked, structurally-identical components such as the transformer layers of a large-language-model (LLM).

[API-Unstable] Control Flow Operator Library

Users can use five control flow operators: `cond`, `while_loop`, `scan`, `associative_scan`, and `map` to express complex control flow in models. They provide the ability to:

Compile or export models with data-dependent control flow, where execution paths depend on tensor values available only at runtime.
Avoid recompilation caused by dynamic shape-dependent control flow, where loop counts or conditions change with tensor sizes.
Optimize large computational graphs by preventing their size from growing linearly due to loop unrolling, thereby reducing compilation time. The library is primarily focused on inference and export.

Training is also supported, with the exception of `while_loop`, which will be supported in the 2.9 release.

[API-Unstable] HuggingFace SafeTensors support in PyTorch Distributed Checkpointing

PyTorch Distributed Checkpointing (DCP) is addressing the interoperability blockers to ensure that popular formats, like HuggingFace safetensors, can work well with PyTorch’s ecosystem. Since HuggingFace has become a leading format in inference and fine-tuning, DCP added support to save, load, and re-shard checkpoints in the SafeTensors format.

[API-Unstable] SYCL support in PyTorch CPP Extension API

This feature allows users to implement new custom operators for Intel GPU platforms as SYCL kernels accessible via PyTorch XPU device backend. SYCL is an open standard developed by the Khronos Group that allows developers to program heterogeneous architectures in standard C++. At the moment, this feature is available for Linux users. See tutorial here.

[API-Unstable] A16W4 on XPU Device

This feature allows users to leverage A16W4 weight-only quantization to run LLM inference on Intel GPUs with TorchAO to reduce memory consumption and boost inference speed. It supports both BF16 and FP16 activations and additionally allows users to select between RTN (Rounding-to-Nearest) or AWQ (Automatic Weight Quantization) methods based on the accuracy requirements of specific scenarios.

[API-Unstable] Intel GPU distributed backend (XCCL)

XCCL is a distributed backend for Intel GPUs which allows users to enable various distributed training paradigms like DDP (DistributedDataParallel), FSDP (FullyShardedDataParallel), PP (pipeline parallelism), and TP (tensor parallelism) on XPU devices. XCCL backend provides all communication ops defined in PyTorch, such as allreduce, allgather, and reducescatter. XCCL backend can be applied transparently on XPU devices or explicitly specified with “xccl” name during PyTorch process group initialization. See tutorial here.

Vedere AI