Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s

Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s

Meta: Less Wright, Hamid Shojanazeri, Vasiliy Kuznetsov, Daniel Vega-Myhre, Gokul Nadathur, Will Constable, Tianyu Liu, Tristan Rice, Driss Guessous, Josh Fromm, Luca Wehrstedt, Jiecao Yu Crusoe: Ethan Petersen, Martin Cala, Chip Smith

Working with Crusoe.AI we were provided access to one of their new 2K H200 clusters in Iceland, which enabled us to showcase training accelerations of 34 – 43% at scale by leveraging TorchTitan’s HSDP2 and TorchAO’s new float8 rowwise, with comparable convergence and stability vs BF16.

In this post we detail the synergy of H200’s with PyTorch’s new Float8 rowwise training with TorchTitan’s FSDP2/HSDP2 and CP at scale.

Background – what is an H200?

H200’s are an ‘enhanced’ H100, offering the exact same compute as an H100, but with two additional improvements.

  • Larger global memory, 141GiB HBM3e vs the standard 80GiB HBM3
  • Memory bandwidth is ~43% faster with 4.8TB/s vs 3.35 TB/s. The faster memory transfer has an outsized effect on training speed, especially for PyTorch’s AsyncTP.

What is PyTorch Float8 rowwise?

Float 8 Rowwise is a finer grained resolution for Float8 vs the previous ‘tensor wise’ Float8. It is designed to ensure finer grained accuracy to support larger workloads that tend to become more sensitive to quantization at scale and as training progresses.

There are two key improvements with Float8 rowwise:

  • Each row now maintains its own scaling factor versus a single scaling factor for the entire tensor, thus improving quantization precision. Finer grained scaling per row helps reduce the effect of outliers (extreme values that force the quantization scaling factor to stretch and degrade the precision of the normally distributed values) and thus ensures better precision.
  • The scaling factor itself is now implemented by rounding down to the nearest power of 2. This has been shown to help reduce quantization errors when multiplying/dividing by the scaling factor as well as ensuring large values remain scaled to the same value in both the forward and backward passes.

Note that other large scale models have been trained using Float8 at 2K scale with a combination of 1×128 groupwise and 128×128 blockwise, with power of 2 scaling factors. They had the same goal of improving Float8’s precision for supporting large scale training.

Thus, Float8 rowwise offers a similar promise to enable Float8 for very large scale training, but we wanted to provide proof of stability and convergence at scale, which training on the Crusoe H200 2k cluster provided initial verification thereof.

Showcasing Float8 Rowwise Loss convergence vs BF16 at 1600 and 1920 GPU Scale:

In order to verify comparable loss convergence, we ran two separate runs at both 1920 and then 1600 (1.6k) gpu scale using TorchTitan and Lllama3 70B. The 1.6K GPU runs were set for 2.5k iterations, using TorchTitans’ HSDP2 and Context Parallel to enable 2D parallelism.

The loss convergence tests were run using Titan’s deterministic mode – this mode effectively freezes most potential sources of variation from run to run, and thus helps ensure that the only substantial change is what we want to test, namely the loss convergence and loss curves of BF16 vs Float8 Rowwise.

Note that deterministic mode also slows down training speed because various kernels will not be autotuned to maximize throughput (otherwise we risk using different kernels between runs and introducing variance).

Two runs were completed, one with BF16 and the other with Float8 Rowwise.

Both runs completed their assigned 2.5k iters without issue, showcasing the Crusoe cluster stability, with FP8 completing at exactly 24 hours and BF16 finishing after 31 hours, 19 minutes.

DType Time / Iters Loss
BF16 24 hours 3.15453
Float8 Rowwise 24 hours 2.86386
BF16 31 hours, 19 minutes / 2.5K 2.88109
Float8 Rowwise 24 hours / 2.5K 2.86386

At the 24 hour mark, Float8 completed 2.5K iterations showcasing the comparative speed up (even in deterministic mode) of float8 training. At the 24 hour mark, Float8 enabled a +9.21% relative improvement in loss compared to BF16 for the same 24 hours of large scale training time.

After 31 hours, 19 minutes, the BF16 run finally completed its 2.5k iters.

The final loss numbers:
BF16 = 2.88109 Float8 = 2.86386

From the loss curves we observed very similar curves at the first and last ⅓ and then a turbulent zone in the middle where both showed similar spikes, but with a slight skew to the relative timing of the spikes.

As a result of this, we can see that PyTorch’s Float8 rowwise offers similar convergence but over 33% speedup for the same amount of training time.

Long Term Training stability with Float8 Rowwise

Beyond showcasing comparable convergence, we also wanted to show longer term training stability with Float8 and thus we launched a 4 day, 15K run at 256 scale.

As shown above, Float8 training ran for over 100 hours with no issues, highlighting the long term stability of Float8 Rowwise.

Determinism in TorchTitan

To verify determinism and to see if the spikiness in the longer runs was from scale, we also ran a smaller run comprising of 2 runs of BF16, and 1 run of Float8 at 256 scale, and with HSDP2 only (i.e. without 2D Context parallel).

In this case both BF16 runs had identical curves and final loss, and we saw a similar spikiness zone for all three runs.

At the 2K iteration mark, both Float8 and BF16 ending at nearly identical points:
BF16 *2 = 3.28538
Float8 rowwise = 3.28203

The above result confirms that neither CP nor scale (2k) are responsible for spikiness in the loss as we saw similar effect at 256 scale as well. The most likely explanation for the loss spikes could be content distribution in the dataset.

For the sake of determinism, the experiments were run with a serialized C4 dataset (not shuffled), meaning the spikes could be from encountering new content within the dataset.

Net speedups at various Scales with Float8 rowwise:

We performed shorter runs at various GPU scales to understand how Float8 Rowwise would scale in terms of training acceleration as cluster sizes expanded. Doubling in scale from 960 to 1920, Float8 continued to deliver impressive training speedups, with a range of over 34-43% gains compared to BF16. We also want to note that scaling from 1k to 2k GPUs communication overhead likely kicked in and we observed a 4% hit on throughput with BF16.

As shown in the longer training runs at scale above, Float8 rowwise delivered substantial speedups with equal or even slightly improved loss endpoints while delivering 34% speedups at 1920 (DeepSeek) scale.

How can I use Float8 Rowwise in my training?

Float8 Rowwise is available now for you to use in your large scale training. It is packaged in TorchAO’s latest builds (0.9 and higher) and integrated into TorchTitan natively if you want to get up and running quickly.

To activate Float8 Rowwise in TorchTitan:

First enable the model converter to hotswap the nn.linears into float8 linear layers in your models .toml file – see line 29:

Secondly, specify the ‘rowwise’ float8 recipe – see line 72:

Note that you have three choices for the ‘recipe_name’:

  • rowwise which is the recommended default,
  • tensorwise (the older style float8) and
  • rowwise_with_gw_hp.

The gw_hp rowwise option keeps the gradients to the weights in BF16 precision during the backwards pass, and this can further enhance float8 precision for extremely sensitive workloads. But, it can ironically be a bit more performant than generic rowwise if the majority of the matmul sizes in your model are smaller (with an estimated tipping point at roughly 13-16K dimensions on H100).

Thus while we recommend rowwise as the default, it may be worth comparing with gw_hp on your model to verify which provides the best performance, with an upside of even greater precision.

By toggling the model converter on and off with a #, you can directly compare training acceleration between BF16 and Float8 Rowwise to understand the potential speedups for your own training.

Future Updates:

We’ll have an additional update coming showcasing multiple improvements for Pipeline Parallel and Async Distributed Checkpointing so please stay tuned.

Read More

Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s

Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s

Meta: Less Wright, Hamid Shojanazeri, Vasiliy Kuznetsov, Daniel Vega-Myhre, Gokul Nadathur, Will Constable, Tianyu Liu, Tristan Rice, Driss Guessous, Josh Fromm, Luca Wehrstedt, Jiecao Yu Crusoe: Ethan Petersen, Martin Cala, Chip Smith

Working with Crusoe.AI we were provided access to one of their new 2K H200 clusters in Iceland, which enabled us to showcase training accelerations of 34 – 43% at scale by leveraging TorchTitan’s HSDP2 and TorchAO’s new float8 rowwise, with comparable convergence and stability vs BF16.

In this post we detail the synergy of H200’s with PyTorch’s new Float8 rowwise training with TorchTitan’s FSDP2/HSDP2 and CP at scale.

Background – what is an H200?

H200’s are an ‘enhanced’ H100, offering the exact same compute as an H100, but with two additional improvements.

  • Larger global memory, 141GiB HBM3e vs the standard 80GiB HBM3
  • Memory bandwidth is ~43% faster with 4.8TB/s vs 3.35 TB/s. The faster memory transfer has an outsized effect on training speed, especially for PyTorch’s AsyncTP.

What is PyTorch Float8 rowwise?

Float 8 Rowwise is a finer grained resolution for Float8 vs the previous ‘tensor wise’ Float8. It is designed to ensure finer grained accuracy to support larger workloads that tend to become more sensitive to quantization at scale and as training progresses.

There are two key improvements with Float8 rowwise:

  • Each row now maintains its own scaling factor versus a single scaling factor for the entire tensor, thus improving quantization precision. Finer grained scaling per row helps reduce the effect of outliers (extreme values that force the quantization scaling factor to stretch and degrade the precision of the normally distributed values) and thus ensures better precision.
  • The scaling factor itself is now implemented by rounding down to the nearest power of 2. This has been shown to help reduce quantization errors when multiplying/dividing by the scaling factor as well as ensuring large values remain scaled to the same value in both the forward and backward passes.

Note that other large scale models have been trained using Float8 at 2K scale with a combination of 1×128 groupwise and 128×128 blockwise, with power of 2 scaling factors. They had the same goal of improving Float8’s precision for supporting large scale training.

Thus, Float8 rowwise offers a similar promise to enable Float8 for very large scale training, but we wanted to provide proof of stability and convergence at scale, which training on the Crusoe H200 2k cluster provided initial verification thereof.

Showcasing Float8 Rowwise Loss convergence vs BF16 at 1600 and 1920 GPU Scale:

In order to verify comparable loss convergence, we ran two separate runs at both 1920 and then 1600 (1.6k) gpu scale using TorchTitan and Lllama3 70B. The 1.6K GPU runs were set for 2.5k iterations, using TorchTitans’ HSDP2 and Context Parallel to enable 2D parallelism.

The loss convergence tests were run using Titan’s deterministic mode – this mode effectively freezes most potential sources of variation from run to run, and thus helps ensure that the only substantial change is what we want to test, namely the loss convergence and loss curves of BF16 vs Float8 Rowwise.

Note that deterministic mode also slows down training speed because various kernels will not be autotuned to maximize throughput (otherwise we risk using different kernels between runs and introducing variance).

Two runs were completed, one with BF16 and the other with Float8 Rowwise.

Both runs completed their assigned 2.5k iters without issue, showcasing the Crusoe cluster stability, with FP8 completing at exactly 24 hours and BF16 finishing after 31 hours, 19 minutes.

DType Time / Iters Loss
BF16 24 hours 3.15453
Float8 Rowwise 24 hours 2.86386
BF16 31 hours, 19 minutes / 2.5K 2.88109
Float8 Rowwise 24 hours / 2.5K 2.86386

At the 24 hour mark, Float8 completed 2.5K iterations showcasing the comparative speed up (even in deterministic mode) of float8 training. At the 24 hour mark, Float8 enabled a +9.21% relative improvement in loss compared to BF16 for the same 24 hours of large scale training time.

After 31 hours, 19 minutes, the BF16 run finally completed its 2.5k iters.

The final loss numbers:
BF16 = 2.88109 Float8 = 2.86386

From the loss curves we observed very similar curves at the first and last ⅓ and then a turbulent zone in the middle where both showed similar spikes, but with a slight skew to the relative timing of the spikes.

As a result of this, we can see that PyTorch’s Float8 rowwise offers similar convergence but over 33% speedup for the same amount of training time.

Long Term Training stability with Float8 Rowwise

Beyond showcasing comparable convergence, we also wanted to show longer term training stability with Float8 and thus we launched a 4 day, 15K run at 256 scale.

As shown above, Float8 training ran for over 100 hours with no issues, highlighting the long term stability of Float8 Rowwise.

Determinism in TorchTitan

To verify determinism and to see if the spikiness in the longer runs was from scale, we also ran a smaller run comprising of 2 runs of BF16, and 1 run of Float8 at 256 scale, and with HSDP2 only (i.e. without 2D Context parallel).

In this case both BF16 runs had identical curves and final loss, and we saw a similar spikiness zone for all three runs.

At the 2K iteration mark, both Float8 and BF16 ending at nearly identical points:
BF16 *2 = 3.28538
Float8 rowwise = 3.28203

The above result confirms that neither CP nor scale (2k) are responsible for spikiness in the loss as we saw similar effect at 256 scale as well. The most likely explanation for the loss spikes could be content distribution in the dataset.

For the sake of determinism, the experiments were run with a serialized C4 dataset (not shuffled), meaning the spikes could be from encountering new content within the dataset.

Net speedups at various Scales with Float8 rowwise:

We performed shorter runs at various GPU scales to understand how Float8 Rowwise would scale in terms of training acceleration as cluster sizes expanded. Doubling in scale from 960 to 1920, Float8 continued to deliver impressive training speedups, with a range of over 34-43% gains compared to BF16. We also want to note that scaling from 1k to 2k GPUs communication overhead likely kicked in and we observed a 4% hit on throughput with BF16.

As shown in the longer training runs at scale above, Float8 rowwise delivered substantial speedups with equal or even slightly improved loss endpoints while delivering 34% speedups at 1920 (DeepSeek) scale.

How can I use Float8 Rowwise in my training?

Float8 Rowwise is available now for you to use in your large scale training. It is packaged in TorchAO’s latest builds (0.9 and higher) and integrated into TorchTitan natively if you want to get up and running quickly.

To activate Float8 Rowwise in TorchTitan:

First enable the model converter to hotswap the nn.linears into float8 linear layers in your models .toml file – see line 29:

Secondly, specify the ‘rowwise’ float8 recipe – see line 72:

Note that you have three choices for the ‘recipe_name’:

  • rowwise which is the recommended default,
  • tensorwise (the older style float8) and
  • rowwise_with_gw_hp.

The gw_hp rowwise option keeps the gradients to the weights in BF16 precision during the backwards pass, and this can further enhance float8 precision for extremely sensitive workloads. But, it can ironically be a bit more performant than generic rowwise if the majority of the matmul sizes in your model are smaller (with an estimated tipping point at roughly 13-16K dimensions on H100).

Thus while we recommend rowwise as the default, it may be worth comparing with gw_hp on your model to verify which provides the best performance, with an upside of even greater precision.

By toggling the model converter on and off with a #, you can directly compare training acceleration between BF16 and Float8 Rowwise to understand the potential speedups for your own training.

Future Updates:

We’ll have an additional update coming showcasing multiple improvements for Pipeline Parallel and Async Distributed Checkpointing so please stay tuned.

Read More

Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s

Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s

Meta: Less Wright, Hamid Shojanazeri, Vasiliy Kuznetsov, Daniel Vega-Myhre, Gokul Nadathur, Will Constable, Tianyu Liu, Tristan Rice, Driss Guessous, Josh Fromm, Luca Wehrstedt, Jiecao Yu, Sandeep Parab
Crusoe: Ethan Petersen, Martin Cala, Chip Smith

Working with Crusoe.AI we were provided access to one of their new 2K H200 clusters in Iceland, which enabled us to showcase training accelerations of 34 – 43% at scale by leveraging TorchTitan’s HSDP2 and TorchAO’s new float8 rowwise, with comparable convergence and stability vs BF16.

bar chart

In this post we detail the synergy of H200’s with PyTorch’s new Float8 rowwise training with TorchTitan’s FSDP2/HSDP2 and CP at scale.

Background – what is an H200?

H200’s are an ‘enhanced’ H100, offering the exact same compute as an H100, but with two additional improvements.

  • Larger global memory, 141GiB HBM3e vs the standard 80GiB HBM3
  • Memory bandwidth is ~43% faster with 4.8TB/s vs 3.35 TB/s. The faster memory transfer has an outsized effect on training speed, especially for PyTorch’s AsyncTP.

What is PyTorch Float8 rowwise?

Float 8 Rowwise is a finer grained resolution for Float8 vs the previous ‘tensor wise’ Float8. It is designed to ensure finer grained accuracy to support larger workloads that tend to become more sensitive to quantization at scale and as training progresses.

There are two key improvements with Float8 rowwise:

  • Each row now maintains its own scaling factor versus a single scaling factor for the entire tensor, thus improving quantization precision. Finer grained scaling per row helps reduce the effect of outliers (extreme values that force the quantization scaling factor to stretch and degrade the precision of the normally distributed values) and thus ensures better precision.
  • The scaling factor itself is now implemented by rounding down to the nearest power of 2. This has been shown to help reduce quantization errors when multiplying/dividing by the scaling factor as well as ensuring large values remain scaled to the same value in both the forward and backward passes.

Note that other large scale models have been trained using Float8 at 2K scale with a combination of 1×128 groupwise and 128×128 blockwise, with power of 2 scaling factors. They had the same goal of improving Float8’s precision for supporting large scale training.

Thus, Float8 rowwise offers a similar promise to enable Float8 for very large scale training, but we wanted to provide proof of stability and convergence at scale, which training on the Crusoe H200 2k cluster provided initial verification thereof.

Showcasing Float8 Rowwise Loss convergence vs BF16 at 1600 and 1920 GPU Scale:

In order to verify comparable loss convergence, we ran two separate runs at both 1920 and then 1600 (1.6k) gpu scale using TorchTitan and Lllama3 70B. The 1.6K GPU runs were set for 2.5k iterations, using TorchTitans’ HSDP2 and Context Parallel to enable 2D parallelism.

The loss convergence tests were run using Titan’s deterministic mode – this mode effectively freezes most potential sources of variation from run to run, and thus helps ensure that the only substantial change is what we want to test, namely the loss convergence and loss curves of BF16 vs Float8 Rowwise.

Note that deterministic mode also slows down training speed because various kernels will not be autotuned to maximize throughput (otherwise we risk using different kernels between runs and introducing variance).

Two runs were completed, one with BF16 and the other with Float8 Rowwise.

Both runs completed their assigned 2.5k iters without issue, showcasing the Crusoe cluster stability, with FP8 completing at exactly 24 hours and BF16 finishing after 31 hours, 19 minutes.

DType Time / Iters Loss
BF16 24 hours 3.15453
Float8 Rowwise 24 hours 2.86386
BF16 31 hours, 19 minutes / 2.5K 2.88109
Float8 Rowwise 24 hours / 2.5K 2.86386

At the 24 hour mark, Float8 completed 2.5K iterations showcasing the comparative speed up (even in deterministic mode) of float8 training. At the 24 hour mark, Float8 enabled a +9.21% relative improvement in loss compared to BF16 for the same 24 hours of large scale training time.

After 31 hours, 19 minutes, the BF16 run finally completed its 2.5k iters.

The final loss numbers:
BF16 = 2.88109
Float8 = 2.86386

From the loss curves we observed very similar curves at the first and last ⅓ and then a turbulent zone in the middle where both showed similar spikes, but with a slight skew to the relative timing of the spikes.

line chart

As a result of this, we can see that PyTorch’s Float8 rowwise offers similar convergence but over 33% speedup for the same amount of training time.

Long Term Training stability with Float8 Rowwise

Beyond showcasing comparable convergence, we also wanted to show longer term training stability with Float8 and thus we launched a 4 day, 15K run at 256 scale.

line chart

As shown above, Float8 training ran for over 100 hours with no issues, highlighting the long term stability of Float8 Rowwise.

Determinism in TorchTitan

To verify determinism and to see if the spikiness in the longer runs was from scale, we also ran a smaller run comprising of 2 runs of BF16, and 1 run of Float8 at 256 scale, and with HSDP2 only (i.e. without 2D Context parallel).

In this case both BF16 runs had identical curves and final loss, and we saw a similar spikiness zone for all three runs.

At the 2K iteration mark, both Float8 and BF16 ending at nearly identical points:
BF16 *2 = 3.28538
Float8 rowwise = 3.28203

line chart

The above result confirms that neither CP nor scale (2k) are responsible for spikiness in the loss as we saw similar effect at 256 scale as well. The most likely explanation for the loss spikes could be content distribution in the dataset.

For the sake of determinism, the experiments were run with a serialized C4 dataset (not shuffled), meaning the spikes could be from encountering new content within the dataset.

Net speedups at various Scales with Float8 rowwise:

We performed shorter runs at various GPU scales to understand how Float8 Rowwise would scale in terms of training acceleration as cluster sizes expanded. Doubling in scale from 960 to 1920, Float8 continued to deliver impressive training speedups, with a range of over 34-43% gains compared to BF16. We also want to note that scaling from 1k to 2k GPUs communication overhead likely kicked in and we observed a 4% hit on throughput with BF16.

bar chart

As shown in the longer training runs at scale above, Float8 rowwise delivered substantial speedups with equal or even slightly improved loss endpoints while delivering 34% speedups at 1920 (DeepSeek) scale.

How can I use Float8 Rowwise in my training?

Float8 Rowwise is available now for you to use in your large scale training. It is packaged in TorchAO’s latest builds (0.9 and higher) and integrated into TorchTitan natively if you want to get up and running quickly.

To activate Float8 Rowwise in TorchTitan:

First enable the model converter to hotswap the nn.linears into float8 linear layers in your models .toml file – see line 29:

code

Secondly, specify the ‘rowwise’ float8 recipe – see line 72:

code

Note that you have three choices for the ‘recipe_name’:

  • rowwise which is the recommended default,
  • tensorwise (the older style float8) and
  • rowwise_with_gw_hp.

The gw_hp rowwise option keeps the gradients to the weights in BF16 precision during the backwards pass, and this can further enhance float8 precision for extremely sensitive workloads. But, it can ironically be a bit more performant than generic rowwise if the majority of the matmul sizes in your model are smaller (with an estimated tipping point at roughly 13-16K dimensions on H100).

Thus while we recommend rowwise as the default, it may be worth comparing with gw_hp on your model to verify which provides the best performance, with an upside of even greater precision.

By toggling the model converter on and off with a #, you can directly compare training acceleration between BF16 and Float8 Rowwise to understand the potential speedups for your own training.

Future Updates:

We’ll have an additional update coming showcasing multiple improvements for Pipeline Parallel and Async Distributed Checkpointing so please stay tuned.

Read More

Accelerate PyTorch 2.7 on Intel® GPUs

Accelerate PyTorch 2.7 on Intel® GPUs

PyTorch 2.7 continues to deliver significant functionality and performance enhancements on Intel® GPU architectures to streamline AI workflows. Application developers and researchers seeking to fine-tune, inference and develop PyTorch models on Intel GPUs will now have a consistent user experience across various operating systems, including Windows, Linux and Windows Subsystem for Linux (WSL2). This is made possible through improved installation, eager mode script debugging, a performance profiler, and graph model (torch.compile) deployment. As a result, developers have greater options with a unified GPU programming paradigm for both front-end and back-end development.

Incremental improvements of Intel GPU support in PyTorch

Since PyTorch 2.4, we’ve made steady improvements to Intel GPU support with each release. With PyTorch 2.7, we are excited to share that we have established a solid foundation to have Intel GPU work in both graph mode (torch.compile) and eager mode on Windows and Linux. This includes a wide range of Intel GPU products, many of which you may already access. We hope these enhancements will unlock more ubiquitous hardware for your AI research and development.

Check out the detailed advancements in these related release blogs: PyTorch 2.4, PyTorch 2.5, and PyTorch 2.6.

What’s New in PyTorch 2.7

These are the features in PyTorch 2.7 that were added to help accelerate performance on Intel GPUs.

  • Improve scaled dot-product attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
    With the new SDPA optimization for Intel GPUs on PyTorch 2.7, Stable Diffusion float16 inference achieved up to 3x gain over PyTorch 2.6 release on Intel® Arc™ B580 Graphics and Intel® Core™ Ultra 7 Processor 258V with Intel® Arc™ Graphics 140V on eager mode. See Figure 1 below.

Figure 1. PyTorch 2.7 Stable Diffusion Performance Gains Over PyTorch 2.6

  • Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux. With this, Intel GPUs became the first accelerator to support torch.compile on Windows. Refer to Windows tutorial for details.
    Graph model (torch.compile) is enabled in Windows 11 for the first time across Intel GPUs, delivering the performance advantages over eager mode as on Linux by PyTorch 2.7. The latest performance data was measured on top of PyTorch Dynamo Benchmarking Suite using Intel® Arc™ B580 Graphics on Windows showcase torch.compile speedup ratio over eager mode as shown in Figure 2. Both training and inference achieved similar significant improvements.

Figure 2. Torch.compile Performance Gains Over Eager Mode on Windows

  • Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide full graph mode quantization pipelines with enhanced computational efficiency. Refer to PT2E tutorial for details.
  • Enable AOTInductor and torch.export on Linux to simplify deployment workflows. Refer to AOTInductor tutorial for details.
  • Enable profiler on both Windows and Linux to facilitate model performance analysis. Refer to the PyTorch profiler tutorial for details.

Review the Getting Started on Intel GPU Guide for a tour of the environment setup and a quick start on Intel GPUs.

Future Work

Looking ahead, we will continue the Intel GPU upstream efforts in future PyTorch releases to:

  • Attain state-of-the-art PyTorch-native performance to showcase competitive GEMM computational efficiency for torch.compile, and enhance performance for LLM models through FlexAttention and lower precision data types.
  • Broaden feature compatibility by delivering distributed XCCL backend support for Intel® Data Center GPU Max Series.
  • Expand accelerator support across core PyTorch ecosystem components including torchao, torchtune, and torchtitan.

Follow along in the PyTorch Dev Discussion to learn more about Intel GPU & CPU enabling status and features. As we get further along, we will create tickets on GitHub to document our progress.

Summary

In this blog, we reviewed the Intel GPU upstream progress starting in PyTorch 2.4 and highlighted the new features of PyTorch 2.7 that accelerate AI workload performance across various Intel GPUs. These new features, especially SDPA on Windows, achieved up to 3x inference (Stable Diffusion, float16) gain over PyTorch 2.6 release on Intel Arc B580 Graphics and Intel Core Ultra 7 Processor 258V with Intel Arc Graphics 140V. Also, torch.compile on Windows delivers similar performance advantages over eager mode on Dynamo benchmarks as on Linux.

Acknowledgements

We want to thank the following PyTorch maintainers for their technical discussions and insights: Nikita ShulgaJason AnselAndrey TalmanAlban Desmaison, and Bin Bao.

We also thank collaborators from PyTorch for their professional support and guidance.

Product and Performance Information

Measurement on Intel Core Ultra 7 258V: 2200 MHz, 8 Core(s), 8 Logical Processor(s) with Intel Arc 140V GPU (16GB), GPU memory 18.0 GB, using Intel Graphics Driver 32.0.101.6647 (WHQL Certified), Windows 11 Pro – 24H2. And Intel Core Ultra 5 245KF: 4200 MHz, 14 Core(s), 14 Logical Processor(s), Intel Arc B580 Graphics, dedicated GPU memory 12.0 GB, shared GPU memory 15.8 GB, using Intel Graphics Driver 32.0.101.6647 (WHQL Certified), Windows 11 Enterprise LTSC – 24H2. Test by Intel on Apr 8th, 2025.

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.  See backup for configuration details.  No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

AI Disclaimer

AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at www.intel.com/AIPC. Results may vary.

Read More

Accelerate PyTorch 2.7 on Intel® GPUs

Accelerate PyTorch 2.7 on Intel® GPUs

PyTorch 2.7 continues to deliver significant functionality and performance enhancements on Intel® GPU architectures to streamline AI workflows. Application developers and researchers seeking to fine-tune, inference and develop PyTorch models on Intel GPUs will now have a consistent user experience across various operating systems, including Windows, Linux and Windows Subsystem for Linux (WSL2). This is made possible through improved installation, eager mode script debugging, a performance profiler, and graph model (torch.compile) deployment. As a result, developers have greater options with a unified GPU programming paradigm for both front-end and back-end development.

Incremental improvements of Intel GPU support in PyTorch

Since PyTorch 2.4, we’ve made steady improvements to Intel GPU support with each release. With PyTorch 2.7, we are excited to share that we have established a solid foundation to have Intel GPU work in both graph mode (torch.compile) and eager mode on Windows and Linux. This includes a wide range of Intel GPU products, many of which you may already access. We hope these enhancements will unlock more ubiquitous hardware for your AI research and development.

Check out the detailed advancements in these related release blogs: PyTorch 2.4, PyTorch 2.5, and PyTorch 2.6.

What’s New in PyTorch 2.7

These are the features in PyTorch 2.7 that were added to help accelerate performance on Intel GPUs.

  • Improve scaled dot-product attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
    With the new SDPA optimization for Intel GPUs on PyTorch 2.7, Stable Diffusion float16 inference achieved up to 3x gain over PyTorch 2.6 release on Intel® Arc™ B580 Graphics and Intel® Core™ Ultra 7 Processor 258V with Intel® Arc™ Graphics 140V on eager mode. See Figure 1 below.

Figure 1. PyTorch 2.7 Stable Diffusion Performance Gains Over PyTorch 2.6

  • Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux. With this, Intel GPUs became the first accelerator to support torch.compile on Windows. Refer to Windows tutorial for details.
    Graph model (torch.compile) is enabled in Windows 11 for the first time across Intel GPUs, delivering the performance advantages over eager mode as on Linux by PyTorch 2.7. The latest performance data was measured on top of PyTorch Dynamo Benchmarking Suite using Intel® Arc™ B580 Graphics on Windows showcase torch.compile speedup ratio over eager mode as shown in Figure 2. Both training and inference achieved similar significant improvements.

Figure 2. Torch.compile Performance Gains Over Eager Mode on Windows

  • Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide full graph mode quantization pipelines with enhanced computational efficiency. Refer to PT2E tutorial for details.
  • Enable AOTInductor and torch.export on Linux to simplify deployment workflows. Refer to AOTInductor tutorial for details.
  • Enable profiler on both Windows and Linux to facilitate model performance analysis. Refer to the PyTorch profiler tutorial for details.

Review the Getting Started on Intel GPU Guide for a tour of the environment setup and a quick start on Intel GPUs.

Future Work

Looking ahead, we will continue the Intel GPU upstream efforts in future PyTorch releases to:

  • Attain state-of-the-art PyTorch-native performance to showcase competitive GEMM computational efficiency for torch.compile, and enhance performance for LLM models through FlexAttention and lower precision data types.
  • Broaden feature compatibility by delivering distributed XCCL backend support for Intel® Data Center GPU Max Series.
  • Expand accelerator support across core PyTorch ecosystem components including torchao, torchtune, and torchtitan.

Follow along in the PyTorch Dev Discussion to learn more about Intel GPU & CPU enabling status and features. As we get further along, we will create tickets on GitHub to document our progress.

Summary

In this blog, we reviewed the Intel GPU upstream progress starting in PyTorch 2.4 and highlighted the new features of PyTorch 2.7 that accelerate AI workload performance across various Intel GPUs. These new features, especially SDPA on Windows, achieved up to 3x inference (Stable Diffusion, float16) gain over PyTorch 2.6 release on Intel Arc B580 Graphics and Intel Core Ultra 7 Processor 258V with Intel Arc Graphics 140V. Also, torch.compile on Windows delivers similar performance advantages over eager mode on Dynamo benchmarks as on Linux.

Acknowledgements

We want to thank the following PyTorch maintainers for their technical discussions and insights: Nikita ShulgaJason AnselAndrey TalmanAlban Desmaison, and Bin Bao.

We also thank collaborators from PyTorch for their professional support and guidance.

Product and Performance Information

Measurement on Intel Core Ultra 7 258V: 2200 MHz, 8 Core(s), 8 Logical Processor(s) with Intel Arc 140V GPU (16GB), GPU memory 18.0 GB, using Intel Graphics Driver 32.0.101.6647 (WHQL Certified), Windows 11 Pro – 24H2. And Intel Core Ultra 5 245KF: 4200 MHz, 14 Core(s), 14 Logical Processor(s), Intel Arc B580 Graphics, dedicated GPU memory 12.0 GB, shared GPU memory 15.8 GB, using Intel Graphics Driver 32.0.101.6647 (WHQL Certified), Windows 11 Enterprise LTSC – 24H2. Test by Intel on Apr 8th, 2025.

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.  See backup for configuration details.  No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

AI Disclaimer

AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at www.intel.com/AIPC. Results may vary.

Read More

Accelerate PyTorch 2.7 on Intel® GPUs

Accelerate PyTorch 2.7 on Intel® GPUs

PyTorch 2.7 continues to deliver significant functionality and performance enhancements on Intel® GPU architectures to streamline AI workflows. Application developers and researchers seeking to fine-tune, inference and develop PyTorch models on Intel GPUs will now have a consistent user experience across various operating systems, including Windows, Linux and Windows Subsystem for Linux (WSL2). This is made possible through improved installation, eager mode script debugging, a performance profiler, and graph model (torch.compile) deployment. As a result, developers have greater options with a unified GPU programming paradigm for both front-end and back-end development.

Incremental improvements of Intel GPU support in PyTorch

Since PyTorch 2.4, we’ve made steady improvements to Intel GPU support with each release. With PyTorch 2.7, we are excited to share that we have established a solid foundation to have Intel GPU work in both graph mode (torch.compile) and eager mode on Windows and Linux. This includes a wide range of Intel GPU products, many of which you may already access. We hope these enhancements will unlock more ubiquitous hardware for your AI research and development.

Check out the detailed advancements in these related release blogs: PyTorch 2.4, PyTorch 2.5, and PyTorch 2.6.

What’s New in PyTorch 2.7

These are the features in PyTorch 2.7 that were added to help accelerate performance on Intel GPUs.

  • Improve scaled dot-product attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
    With the new SDPA optimization for Intel GPUs on PyTorch 2.7, Stable Diffusion float16 inference achieved up to 3x gain over PyTorch 2.6 release on Intel® Arc™ B580 Graphics and Intel® Core™ Ultra 7 Processor 258V with Intel® Arc™ Graphics 140V on eager mode. See Figure 1 below.

chart

Figure 1. PyTorch 2.7 Stable Diffusion Performance Gains Over PyTorch 2.6

  • Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux. With this, Intel GPUs became the first accelerator to support torch.compile on Windows. Refer to Windows tutorial for details.
    Graph model (torch.compile) is enabled in Windows 11 for the first time across Intel GPUs, delivering the performance advantages over eager mode as on Linux by PyTorch 2.7. The latest performance data was measured on top of PyTorch Dynamo Benchmarking Suite using Intel® Arc™ B580 Graphics on Windows showcase torch.compile speedup ratio over eager mode as shown in Figure 2. Both training and inference achieved similar significant improvements.

chart

Figure 2. Torch.compile Performance Gains Over Eager Mode on Windows

  • Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide full graph mode quantization pipelines with enhanced computational efficiency. Refer to PT2E tutorial for details.
  • Enable AOTInductor and torch.export on Linux to simplify deployment workflows. Refer to AOTInductor tutorial for details.
  • Enable profiler on both Windows and Linux to facilitate model performance analysis. Refer to the PyTorch profiler tutorial for details.

Review the Getting Started on Intel GPU Guide for a tour of the environment setup and a quick start on Intel GPUs.

Future Work

Looking ahead, we will continue the Intel GPU upstream efforts in future PyTorch releases to:

  • Attain state-of-the-art PyTorch-native performance to showcase competitive GEMM computational efficiency for torch.compile, and enhance performance for LLM models through FlexAttention and lower precision data types.
  • Broaden feature compatibility by delivering distributed XCCL backend support for Intel® Data Center GPU Max Series.
  • Expand accelerator support across core PyTorch ecosystem components including torchao, torchtune, and torchtitan.

Follow along in the PyTorch Dev Discussion to learn more about Intel GPU & CPU enabling status and features. As we get further along, we will create tickets on GitHub to document our progress.

Summary

In this blog, we reviewed the Intel GPU upstream progress starting in PyTorch 2.4 and highlighted the new features of PyTorch 2.7 that accelerate AI workload performance across various Intel GPUs. These new features, especially SDPA on Windows, achieved up to 3x inference (Stable Diffusion, float16) gain over PyTorch 2.6 release on Intel Arc B580 Graphics and Intel Core Ultra 7 Processor 258V with Intel Arc Graphics 140V. Also, torch.compile on Windows delivers similar performance advantages over eager mode on Dynamo benchmarks as on Linux.

Acknowledgments

We want to thank the following PyTorch maintainers for their technical discussions and insights: Nikita Shulga, Jason Ansel, Andrey Talman, Alban Desmaison, and Bin Bao.

We also thank collaborators from PyTorch for their professional support and guidance.

Product and Performance Information

Measurement on Intel Core Ultra 7 258V: 2200 MHz, 8 Core(s), 8 Logical Processor(s) with Intel Arc 140V GPU (16GB), GPU memory 18.0 GB, using Intel Graphics Driver 32.0.101.6647 (WHQL Certified), Windows 11 Pro – 24H2. And Intel Core Ultra 5 245KF: 4200 MHz, 14 Core(s), 14 Logical Processor(s), Intel Arc B580 Graphics, dedicated GPU memory 12.0 GB, shared GPU memory 15.8 GB, using Intel Graphics Driver 32.0.101.6647 (WHQL Certified), Windows 11 Enterprise LTSC – 24H2. Test by Intel on Apr 8th, 2025.

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.  See backup for configuration details.  No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

AI Disclaimer

AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at www.intel.com/AIPC. Results may vary.

Read More

PyTorch 2.7 Release

PyTorch 2.7 Release

We are excited to announce the release of PyTorch® 2.7 (release notes)! This release features:

  • support for the NVIDIA Blackwell GPU architecture and pre-built wheels for CUDA 12.8 across Linux x86 and arm64 architectures.
  • torch.compile support for Torch Function Modes which enables users to override any *torch.** operation to implement custom user-defined behavior.
  • Mega Cache which allows users to have end-to-end portable caching for torch;
  • new features for FlexAttention – LLM first token processing, LLM throughput mode optimization and Flex Attention for Inference.

This release is composed of 3262 commits from 457 contributors since PyTorch 2.6. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.7. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Beta Prototype
Torch.Compile support for Torch Function Modes NVIDIA Blackwell Architecture Support
Mega Cache PyTorch Native Context Parallel
Enhancing Intel GPU Acceleration
FlexAttention LLM first token processing on x86 CPUs
FlexAttention LLM throughput mode optimization on x86 CPUs
Foreach Map
Flex Attention for Inference
Prologue Fusion Support in Inductor

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

This feature enables users to override any *torch.** operation to implement custom user-defined behavior. For example, ops can be rewritten to accommodate a specific backend. This is used in FlexAttention to re-write indexing ops.

See the tutorial for more information.

[Beta] Mega Cache

Mega Cache allows users to have end-to-end portable caching for torch. The intended use case is after compiling and executing a model, the user calls torch.compiler.save_cache_artifacts() which will return the compiler artifacts in a portable form. Later, potentially on a different machine, the user may call torch.compiler.load_cache_artifacts() with these artifacts to pre-populate the torch.compile caches in order to jump-start their cache.

See the tutorial for more information.

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

PyTorch 2.7 introduces support for NVIDIA’s new Blackwell GPU architecture and ships pre-built wheels for CUDA 12.8. For more details on CUDA 12.8 see CUDA Toolkit Release.

  • Core components and libraries including cuDNN, NCCL, and CUTLASS have been upgraded to ensure compatibility with Blackwell platforms.
  • PyTorch 2.7 includes Triton 3.3, which adds support for the Blackwell architecture with torch.compile compatibility.
  • To utilize these new features, install PyTorch with CUDA 12.8 using: pip install torch==2.7.0 –index-url https://download.pytorch.org/whl/cu128

More context can also be found here.

[Prototype] PyTorch Native Context Parallel

PyTorch Context Parallel API allows users to create a Python context so that every *torch.nn.functional.scaled_dot_product_attention() *call within will run with context parallelism. Currently, PyTorch Context Parallel supports 3 attention backends: 1. Flash attention; 2. Efficient attention; and 3. cuDNN attention.

As an example, this is used within TorchTitan as the Context Parallel solution for LLM training.

See tutorial here.

[Prototype] Enhancing Intel GPU Acceleration

This latest release introduces enhanced performance optimizations for Intel GPU architectures. These improvements accelerate workloads across various Intel GPUs through the following key enhancements:

  • Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux.
  • Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide a full graph mode quantization pipelines with enhanced computational efficiency.
  • Improve Scaled Dot-Product Attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
  • Enable AOTInuctor and torch.export on Linux to simplify deployment workflows.
  • Implement more Aten operators to enhance the continuity of operators execution on Intel GPU and increase the performance on Intel GPU in eager mode.
  • Enable profiler on both Windows and Linux to facilitate model performance analysis.
  • Expand the Intel GPUs support to Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics, and Intel® Arc™ B-Series graphics on both Windows and Linux.

For more information regarding Intel GPU support, please refer to Getting Started Guide.

See also the tutorials here and here.

[Prototype] FlexAttention LLM first token processing on x86 CPUs

FlexAttention x86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific scaled_dot_product_attention operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.

[Prototype] FlexAttention LLM throughput mode optimization

The performance of FlexAttention on x86 CPUs for LLM inference throughput scenarios has been further improved by adopting the new C++ micro-GEMM template ability. This addresses the performance bottlenecks for large batch size scenarios present in PyTorch 2.6. With this enhancement, users can transparently benefit from better performance and a smoother experience when using FlexAttention APIs and torch.compile for LLM throughput serving on x86 CPUs.

[Prototype] Foreach Map

This feature uses torch.compile to allow users to apply any pointwise or user-defined function (e.g. torch.add) to lists of tensors, akin to the existing *torch.foreach** ops. The main advantage over the existing *torch.foreach** ops is that any mix of scalars or lists of tensors can be supplied as arguments, and even user-defined python functions can be lifted to apply to lists of tensors. Torch.compile will automatically generate a horizontally fused kernel for optimal performance.

See tutorial here.

[Prototype] Flex Attention for Inference

In release 2.5.0, FlexAttention* torch.nn.attention.flex_attention* was introduced for ML researchers who’d like to customize their attention kernels without writing kernel code. This update introduces a decoding backend optimized for inference, supporting GQA and PagedAttention, along with feature updates including nested jagged tensor support, performance tuning guides and trainable biases support.

[Prototype] Prologue Fusion Support in Inductor

Prologue fusion optimizes matrix multiplication (matmul) operations by fusing operations that come before the matmul into the matmul kernel itself, improving performance by reducing global memory bandwidth.

Read More

PyTorch 2.7 Release

PyTorch 2.7 Release

We are excited to announce the release of PyTorch® 2.7 (release notes)! This release features:

  • support for the NVIDIA Blackwell GPU architecture and pre-built wheels for CUDA 12.8 across Linux x86 and arm64 architectures.
  • torch.compile support for Torch Function Modes which enables users to override any *torch.** operation to implement custom user-defined behavior.
  • Mega Cache which allows users to have end-to-end portable caching for torch;
  • new features for FlexAttention – LLM first token processing, LLM throughput mode optimization and Flex Attention for Inference.

This release is composed of 3262 commits from 457 contributors since PyTorch 2.6. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.7. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Beta Prototype
Torch.Compile support for Torch Function Modes NVIDIA Blackwell Architecture Support
Mega Cache PyTorch Native Context Parallel
Enhancing Intel GPU Acceleration
FlexAttention LLM first token processing on x86 CPUs
FlexAttention LLM throughput mode optimization on x86 CPUs
Foreach Map
Flex Attention for Inference
Prologue Fusion Support in Inductor

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

This feature enables users to override any *torch.** operation to implement custom user-defined behavior. For example, ops can be rewritten to accommodate a specific backend. This is used in FlexAttention to re-write indexing ops.

See the tutorial for more information.

[Beta] Mega Cache

Mega Cache allows users to have end-to-end portable caching for torch. The intended use case is after compiling and executing a model, the user calls torch.compiler.save_cache_artifacts() which will return the compiler artifacts in a portable form. Later, potentially on a different machine, the user may call torch.compiler.load_cache_artifacts() with these artifacts to pre-populate the torch.compile caches in order to jump-start their cache.

See the tutorial for more information.

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

PyTorch 2.7 introduces support for NVIDIA’s new Blackwell GPU architecture and ships pre-built wheels for CUDA 12.8. For more details on CUDA 12.8 see CUDA Toolkit Release.

  • Core components and libraries including cuDNN, NCCL, and CUTLASS have been upgraded to ensure compatibility with Blackwell platforms.
  • PyTorch 2.7 includes Triton 3.3, which adds support for the Blackwell architecture with torch.compile compatibility.
  • To utilize these new features, install PyTorch with CUDA 12.8 using: pip install torch==2.7.0 –index-url https://download.pytorch.org/whl/cu128

More context can also be found here.

[Prototype] PyTorch Native Context Parallel

PyTorch Context Parallel API allows users to create a Python context so that every *torch.nn.functional.scaled_dot_product_attention() *call within will run with context parallelism. Currently, PyTorch Context Parallel supports 3 attention backends: 1. Flash attention; 2. Efficient attention; and 3. cuDNN attention.

As an example, this is used within TorchTitan as the Context Parallel solution for LLM training.

See tutorial here.

[Prototype] Enhancing Intel GPU Acceleration

This latest release introduces enhanced performance optimizations for Intel GPU architectures. These improvements accelerate workloads across various Intel GPUs through the following key enhancements:

  • Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux.
  • Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide a full graph mode quantization pipelines with enhanced computational efficiency.
  • Improve Scaled Dot-Product Attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
  • Enable AOTInuctor and torch.export on Linux to simplify deployment workflows.
  • Implement more Aten operators to enhance the continuity of operators execution on Intel GPU and increase the performance on Intel GPU in eager mode.
  • Enable profiler on both Windows and Linux to facilitate model performance analysis.
  • Expand the Intel GPUs support to Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics, and Intel® Arc™ B-Series graphics on both Windows and Linux.

For more information regarding Intel GPU support, please refer to Getting Started Guide.

See also the tutorials here and here.

[Prototype] FlexAttention LLM first token processing on x86 CPUs

FlexAttention x86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific scaled_dot_product_attention operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.

[Prototype] FlexAttention LLM throughput mode optimization

The performance of FlexAttention on x86 CPUs for LLM inference throughput scenarios has been further improved by adopting the new C++ micro-GEMM template ability. This addresses the performance bottlenecks for large batch size scenarios present in PyTorch 2.6. With this enhancement, users can transparently benefit from better performance and a smoother experience when using FlexAttention APIs and torch.compile for LLM throughput serving on x86 CPUs.

[Prototype] Foreach Map

This feature uses torch.compile to allow users to apply any pointwise or user-defined function (e.g. torch.add) to lists of tensors, akin to the existing *torch.foreach** ops. The main advantage over the existing *torch.foreach** ops is that any mix of scalars or lists of tensors can be supplied as arguments, and even user-defined python functions can be lifted to apply to lists of tensors. Torch.compile will automatically generate a horizontally fused kernel for optimal performance.

See tutorial here.

[Prototype] Flex Attention for Inference

In release 2.5.0, FlexAttention* torch.nn.attention.flex_attention* was introduced for ML researchers who’d like to customize their attention kernels without writing kernel code. This update introduces a decoding backend optimized for inference, supporting GQA and PagedAttention, along with feature updates including nested jagged tensor support, performance tuning guides and trainable biases support.

[Prototype] Prologue Fusion Support in Inductor

Prologue fusion optimizes matrix multiplication (matmul) operations by fusing operations that come before the matmul into the matmul kernel itself, improving performance by reducing global memory bandwidth.

Read More

PyTorch 2.7 Release

We are excited to announce the release of PyTorch® 2.7 (release notes)! This release features:

  • support for the NVIDIA Blackwell GPU architecture and pre-built wheels for CUDA 12.8 across Linux x86 and arm64 architectures.
  • torch.compile support for Torch Function Modes which enables users to override any *torch.** operation to implement custom user-defined behavior.
  • Mega Cache which allows users to have end-to-end portable caching for torch;
  • new features for FlexAttention – LLM first token processing, LLM throughput mode optimization and Flex Attention for Inference.

This release is composed of 3262 commits from 457 contributors since PyTorch 2.6. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.7. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Beta Prototype
Torch.Compile support for Torch Function Modes NVIDIA Blackwell Architecture Support
Mega Cache PyTorch Native Context Parallel
Enhancing Intel GPU Acceleration
FlexAttention LLM first token processing on X86 CPUs
FlexAttention LLM throughput mode optimization on X86 CPUs
Foreach Map
Flex Attention for Inference
Prologue Fusion Support in Inductor

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

This feature enables users to override any *torch.** operation to implement custom user-defined behavior. For example, ops can be rewritten to accommodate a specific backend. This is used in FlexAttention to re-write indexing ops.

See the tutorial for more information.

[Beta] Mega Cache

Mega Cache allows users to have end-to-end portable caching for torch. The intended use case is after compiling and executing a model, the user calls torch.compiler.save_cache_artifacts() which will return the compiler artifacts in a portable form. Later, potentially on a different machine, the user may call torch.compiler.load_cache_artifacts() with these artifacts to pre-populate the torch.compile caches in order to jump-start their cache.

See the tutorial for more information.

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

PyTorch 2.7 introduces support for NVIDIA’s new Blackwell GPU architecture and ships pre-built wheels for CUDA 12.8. For more details on CUDA 12.8 see CUDA Toolkit Release.

  • Core components and libraries including cuDNN, NCCL, and CUTLASS have been upgraded to ensure compatibility with Blackwell platforms.
  • PyTorch 2.7 includes Triton 3.3, which adds support for the Blackwell architecture with torch.compile compatibility.
  • To utilize these new features, install PyTorch with CUDA 12.8 using: pip install torch==2.7.0 –index-url https://download.pytorch.org/whl/cu128

More context can also be found here.

[Prototype] PyTorch Native Context Parallel

PyTorch Context Parallel API allows users to create a Python context so that every *torch.nn.functional.scaled_dot_product_attention() *call within will run with context parallelism. Currently, PyTorch Context Parallel supports 3 attention backends: 1. Flash attention; 2. Efficient attention; and 3. cuDNN attention.

As an example, this is used within TorchTitan as the Context Parallel solution for LLM training.

See tutorial here.

[Prototype] Enhancing Intel GPU Acceleration

This latest release introduces enhanced performance optimizations for Intel GPU architectures. These improvements accelerate workloads across various Intel GPUs through the following key enhancements:

  • Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux.
  • Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide a full graph mode quantization pipelines with enhanced computational efficiency.
  • Improve Scaled Dot-Product Attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
  • Enable AOTInuctor and torch.export on Linux to simplify deployment workflows.
  • Implement more Aten operators to enhance the continuity of operators execution on Intel GPU and increase the performance on Intel GPU in eager mode.
  • Enable profiler on both Windows and Linux to facilitate model performance analysis.
  • Expand the Intel GPUs support to Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics, and Intel® Arc™ B-Series graphics on both Windows and Linux.

For more information regarding Intel GPU support, please refer to Getting Started Guide.

See also the tutorials here and here.

[Prototype] FlexAttention LLM first token processing on X86 CPUs

FlexAttention X86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific scaled_dot_product_attention operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.

[Prototype] FlexAttention LLM throughput mode optimization

The performance of FlexAttention on x86 CPUs for LLM inference throughput scenarios has been further improved by adopting the new C++ micro-GEMM template ability. This addresses the performance bottlenecks for large batch size scenarios present in PyTorch 2.6. With this enhancement, users can transparently benefit from better performance and a smoother experience when using FlexAttention APIs and torch.compile for LLM throughput serving on x86 CPUs.

[Prototype] Foreach Map

This feature uses torch.compile to allow users to apply any pointwise or user-defined function (e.g. torch.add) to lists of tensors, akin to the existing *torch.foreach** ops. The main advantage over the existing *torch.foreach** ops is that any mix of scalars or lists of tensors can be supplied as arguments, and even user-defined python functions can be lifted to apply to lists of tensors. Torch.compile will automatically generate a horizontally fused kernel for optimal performance.

See tutorial here.

[Prototype] Flex Attention for Inference

In release 2.5.0, FlexAttention* torch.nn.attention.flex_attention* was introduced for ML researchers who’d like to customize their attention kernels without writing kernel code. This update introduces a decoding backend optimized for inference, supporting GQA and PagedAttention, along with feature updates including nested jagged tensor support, performance tuning guides and trainable biases support.

[Prototype] Prologue Fusion Support in Inductor

Prologue fusion optimizes matrix multiplication (matmul) operations by fusing operations that come before the matmul into the matmul kernel itself, improving performance by reducing global memory bandwidth.

Read More

Accelerating Whisper on Arm with PyTorch and Hugging Face Transformers

Automatic speech recognition (ASR) has revolutionized how we interact with technology, clearing the way for applications like real-time audio transcription, voice assistants, and accessibility tools. OpenAI Whisper is a powerful model for ASR, capable of multilingual speech recognition and translation.

A new Arm Learning Path is now available that explains how to accelerate Whisper on Arm-based cloud instances using PyTorch and Hugging Face transformers.

Why Run Whisper on Arm?

Arm processors are popular in cloud infrastructure for their efficiency, performance, and cost-effectiveness. With major cloud providers such as AWS, Azure, and Google Cloud offering Arm-based instances, running machine learning workloads on this architecture is becoming increasingly attractive.

What You’ll Learn

The Arm Learning Path provides a structured approach to setting up and accelerating Whisper on Arm-based cloud instances. Here’s what you cover:

1. Set Up Your Environment

Before running Whisper, you must set up your development environment. The learning path walks you through setting up an Arm-based cloud instance and installing all dependencies, such as PyTorch, Transformers, and ffmpeg.

2. Run Whisper with PyTorch and Hugging Face Transformers

Once the environment is ready, you will use the Hugging Face transformer library with PyTorch to load and execute Whisper for speech-to-text conversion. The tutorial provides a step-by-step approach for processing audio files and generating audio transcripts.

3. Measure and Evaluate Performance

To ensure efficient execution, you learn how to measure transcription speeds and compare different optimization techniques. The guide provides insights into interpreting performance metrics and making informed decisions on your deployment.

Try it Yourself

Upon completion of this tutorial, you know how to:

  • Deploy Whisper on an Arm-based cloud instance.
  • Implement performance optimizations for efficient execution.
  • Evaluate transcription speeds and optimize further based on results.

Try the live demo today and see audio transcription in action on Arm: Whisper on Arm Demo.

Read More