PyTorch 1.8 Release, including Compiler and Distributed Training updates, and New Mobile Tutorials

We are excited to announce the availability of PyTorch 1.8. This release is composed of more than 3,000 commits since 1.7. It includes major updates and new features for compilation, code optimization, frontend APIs for scientific computing, and AMD ROCm support through binaries that are available via pytorch.org. It also provides improved features for large-scale training for pipeline and model parallelism, and gradient compression. A few of the highlights include:

  1. Support for doing python to python functional transformations via torch.fx;
  2. Added or stabilized APIs to support FFTs (torch.fft), Linear Algebra functions (torch.linalg), added support for autograd for complex tensors and updates to improve performance for calculating hessians and jacobians; and
  3. Significant updates and improvements to distributed training including: Improved NCCL reliability; Pipeline parallelism support; RPC profiling; and support for communication hooks adding gradient compression.
    See the full release notes here.

Along with 1.8, we are also releasing major updates to PyTorch libraries including TorchCSPRNG, TorchVision, TorchText and TorchAudio. For more on the library releases, see the post here. As previously noted, features in PyTorch releases are classified as Stable, Beta and Prototype. You can learn more about the definitions in the post here.

New and Updated APIs

The PyTorch 1.8 release brings a host of new and updated API surfaces ranging from additional APIs for NumPy compatibility, also support for ways to improve and scale your code for performance at both inference and training time. Here is a brief summary of the major features coming in this release:

[Stable] Torch.fft support for high performance NumPy style FFTs

As part of PyTorch’s goal to support scientific computing, we have invested in improving our FFT support and with PyTorch 1.8, we are releasing the torch.fft module. This module implements the same functions as NumPy’s np.fft module, but with support for hardware acceleration and autograd.

[Beta] Support for NumPy style linear algebra functions via torch.linalg

The torch.linalg module, modeled after NumPy’s np.linalg module, brings NumPy-style support for common linear algebra operations including Cholesky decompositions, determinants, eigenvalues and many others.

[Beta] Python code Transformations with FX

FX allows you to write transformations of the form transform(input_module : nn.Module) -> nn.Module, where you can feed in a Module instance and get a transformed Module instance out of it.

This kind of functionality is applicable in many scenarios. For example, the FX-based Graph Mode Quantization product is releasing as a prototype contemporaneously with FX. Graph Mode Quantization automates the process of quantizing a neural net and does so by leveraging FX’s program capture, analysis and transformation facilities. We are also developing many other transformation products with FX and we are excited to share this powerful toolkit with the community.

Because FX transforms consume and produce nn.Module instances, they can be used within many existing PyTorch workflows. This includes workflows that, for example, train in Python then deploy via TorchScript.

Below is an FX transform example:

import torch
import torch.fx

def transform(m: nn.Module,
             tracer_class : type = torch.fx.Tracer) -> torch.nn.Module:
   # Step 1: Acquire a Graph representing the code in `m`
  
   # NOTE: torch.fx.symbolic_trace is a wrapper around a call to
   # fx.Tracer.trace and constructing a GraphModule. We'll
   # split that out in our transform to allow the caller to
   # customize tracing behavior.
   graph : torch.fx.Graph = tracer_class().trace(m)
  
   # Step 2: Modify this Graph or create a new one
   graph = ...
  
   # Step 3: Construct a Module to return
   return torch.fx.GraphModule(m, graph)

You can read more about FX in the official documentation. You can also find several examples of program transformations implemented using torch.fx here. We are constantly improving FX and invite you to share any feedback you have about the toolkit on the forums or issue tracker.

Distributed Training

The PyTorch 1.8 release added a number of new features as well as improvements to reliability and usability. Concretely, support for: Stable level async error/timeout handling was added to improve NCCL reliability; and stable support for RPC based profiling. Additionally, we have added support for pipeline parallelism as well as gradient compression through the use of communication hooks in DDP. Details are below:

[Beta] Pipeline Parallelism

As machine learning models continue to grow in size, traditional Distributed DataParallel (DDP) training no longer scales as these models don’t fit on a single GPU device. The new pipeline parallelism feature provides an easy to use PyTorch API to leverage pipeline parallelism as part of your training loop.

[Beta] DDP Communication Hook

The DDP communication hook is a generic interface to control how to communicate gradients across workers by overriding the vanilla allreduce in DistributedDataParallel. A few built-in communication hooks are provided including PowerSGD, and users can easily apply any of these hooks to optimize communication. Additionally, the communication hook interface can also support user-defined communication strategies for more advanced use cases.

Additional Prototype Features for Distributed Training

In addition to the major stable and beta distributed training features in this release, we also have a number of prototype features available in our nightlies to try out and provide feedback. We have linked in the draft docs below for reference:

  • (Prototype) ZeroRedundancyOptimizer – Based on and in partnership with the Microsoft DeepSpeed team, this feature helps reduce per-process memory footprint by sharding optimizer states across all participating processes in the ProcessGroup gang. Refer to this documentation for more details.
  • (Prototype) Process Group NCCL Send/Recv – The NCCL send/recv API was introduced in v2.7 and this feature adds support for it in NCCL process groups. This feature will provide an option for users to implement collective operations at Python layer instead of C++ layer. Refer to this documentation and code examples to learn more.
  • (Prototype) CUDA-support in RPC using TensorPipe – This feature should bring consequent speed improvements for users of PyTorch RPC with multiple-GPU machines, as TensorPipe will automatically leverage NVLink when available, and avoid costly copies to and from host memory when exchanging GPU tensors between processes. When not on the same machine, TensorPipe will fall back to copying the tensor to host memory and sending it as a regular CPU tensor. This will also improve the user experience as users will be able to treat GPU tensors like regular CPU tensors in their code. Refer to this documentation for more details.
  • (Prototype) Remote Module – This feature allows users to operate a module on a remote worker like using a local module, where the RPCs are transparent to the user. In the past, this functionality was implemented in an ad-hoc way and overall this feature will improve the usability of model parallelism on PyTorch. Refer to this documentation for more details.

PyTorch Mobile

Support for PyTorch Mobile is expanding with a new set of tutorials to help new users launch models on-device quicker and give existing users a tool to get more out of our framework. These include:

Our new demo apps also include examples of image segmentation, object detection, neural machine translation, question answering, and vision transformers. They are available on both iOS and Android:

In addition to performance improvements on CPU for MobileNetV3 and other models, we also revamped our Android GPU backend prototype for broader models coverage and faster inferencing:

Lastly, we are launching the PyTorch Mobile Lite Interpreter as a prototype feature in this release. The Lite Interpreter allows users to reduce the runtime binary size. Please try these out and send us your feedback on the PyTorch Forums. All our latest updates can be found on the PyTorch Mobile page

[Prototype] PyTorch Mobile Lite Interpreter

PyTorch Lite Interpreter is a streamlined version of the PyTorch runtime that can execute PyTorch programs in resource constrained devices, with reduced binary size footprint. This prototype feature reduces binary sizes by up to 70% compared to the current on-device runtime in the current release.

Performance Optimization

In 1.8, we are releasing the support for benchmark utils to enable users to better monitor performance. We are also opening up a new automated quantization API. See the details below:

(Beta) Benchmark utils

Benchmark utils allows users to take accurate performance measurements, and provides composable tools to help with both benchmark formulation and post processing. This expected to be helpful for contributors to PyTorch to quickly understand how their contributions are impacting PyTorch performance.

Example:

from torch.utils.benchmark import Timer

results = []
for num_threads in [1, 2, 4]:
    timer = Timer(
        stmt="torch.add(x, y, out=out)",
        setup="""
            n = 1024
            x = torch.ones((n, n))
            y = torch.ones((n, 1))
            out = torch.empty((n, n))
        """,
        num_threads=num_threads,
    )
    results.append(timer.blocked_autorange(min_run_time=5))
    print(
        f"{num_threads} thread{'s' if num_threads > 1 else ' ':<4}"
        f"{results[-1].median * 1e6:>4.0f} us   " +
        (f"({results[0].median / results[-1].median:.1f}x)" if num_threads > 1 else '')
    )

1 thread     376 us   
2 threads    189 us   (2.0x)
4 threads     99 us   (3.8x)

(Prototype) FX Graph Mode Quantization

FX Graph Mode Quantization is the new automated quantization API in PyTorch. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with torch.fx).

Hardware Support

[Beta] Ability to Extend the PyTorch Dispatcher for a new backend in C++

In PyTorch 1.8, you can now create new out-of-tree devices that live outside the pytorch/pytorch repo. The tutorial linked below shows how to register your device and keep it in sync with native PyTorch devices.

[Beta] AMD GPU Binaries Now Available

Starting in PyTorch 1.8, we have added support for ROCm wheels providing an easy onboarding to using AMD GPUs. You can simply go to the standard PyTorch installation selector and choose ROCm as an installation option and execute the provided command.

Thanks for reading, and if you are excited about these updates and want to participate in the future of PyTorch, we encourage you to join the discussion forums and open GitHub issues.

Cheers!

Team PyTorch

Read More