Introducing PyTorch Profiler – the new and improved performance tool

March 25, 2021

by Maxim Lukiyanov - Principal PM at Microsoft, Guoliang Hua - Principal Engineering Manager at Microsoft, Geeta Chauhan - Partner Engineering Lead at Facebook, Gisle Dankel - Tech Lead at Facebook PyTorch

Along with PyTorch 1.8.1 release, we are excited to announce PyTorch Profiler – the new and improved performance debugging profiler for PyTorch. Developed as part of a collaboration between Microsoft and Facebook, the PyTorch Profiler is an open-source tool that enables accurate and efficient performance analysis and troubleshooting for large-scale deep learning models.

Analyzing and improving large-scale deep learning model performance is an ongoing challenge that grows in importance as the model sizes increase. For a long time, PyTorch users had a hard time solving this challenge due to the lack of available tools. There were standard performance debugging tools that provide GPU hardware level information but missed PyTorch-specific context of operations. In order to recover missed information, users needed to combine multiple tools together or manually add minimum correlation information to make sense of the data. There was also the autograd profiler (torch.autograd.profiler) which can capture information about PyTorch operations but does not capture detailed GPU hardware-level information and cannot provide support for visualization.

The new PyTorch Profiler (torch.profiler) is a tool that brings both types of information together and then builds experience that realizes the full potential of that information. This new profiler collects both GPU hardware and PyTorch related information, correlates them, performs automatic detection of bottlenecks in the model, and generates recommendations on how to resolve these bottlenecks. All of this information from the profiler is visualized for the user in TensorBoard. The new Profiler API is natively supported in PyTorch and delivers the simplest experience available to date where users can profile their models without installing any additional packages and see results immediately in TensorBoard with the new PyTorch Profiler plugin. Below is the screenshot of PyTorch Profiler – automatic bottleneck detection.

Getting started

PyTorch Profiler is the next version of the PyTorch autograd profiler. It has a new module namespace torch.profiler but maintains compatibility with autograd profiler APIs. The Profiler uses a new GPU profiling engine, built using Nvidia CUPTI APIs, and is able to capture GPU kernel events with high fidelity. To profile your model training loop, wrap the code in the profiler context manager as shown below.

 with torch.profiler.profile(
    schedule=torch.profiler.schedule(
        wait=2,
        warmup=2,
        active=6,
        repeat=1),
    on_trace_ready=tensorboard_trace_handler,
    with_trace=True
) as profiler:
    for step, data in enumerate(trainloader, 0):
        print("step:{}".format(step))
        inputs, labels = data[0].to(device=device), data[1].to(device=device)

        outputs = model(inputs)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        profiler.step()

The schedule parameter allows you to limit the number of training steps included in the profile to reduce the amount of data collected and simplify visual analysis by focusing on what’s important. The tensorboard_trace_handler automatically saves profiling results to disk for analysis in TensorBoard.

To view results of the profiling session in TensorBoard, install PyTorch Profiler TensorBoard Plugin package.

pip install torch_tb_profiler

Visual Studio Code Integration

Microsoft Visual Studio Code is one of the most popular code editors for Python developers and data scientists. The Python extension for VS Code recently added the integration of TensorBoard into the code editor, including support for the PyTorch Profiler. Once you have VS Code and the Python extension installed, you can quickly open the TensorBoard Profiler plugin by launching the Command Palette using the keyboard shortcut CTRL + SHIFT + P (CMD + SHIFT + P on a Mac) and typing the “Launch TensorBoard” command.

This integration comes with a built-in lifecycle management feature. VS Code will install the TensorBoard package and the PyTorch Profiler plugin package (coming in mid-April) automatically if you don’t have them on your system. VS Code will also launch TensorBoard process for you and automatically look for any TensorBoard log files within your current directory. When you’re done, just close the tab and VS Code will automatically close the process. No more Terminal windows running on your system to provide a backend for the TensorBoard UI! Below is PyTorch Profiler Trace View running in TensorBoard.

Learn more about TensorBoard support in VS Code in this blog.

Feedback

Review PyTorch Profiler documentation, give Profiler a try and let us know about your experience. Provide your feedback on PyTorch Discussion Forum or file issues on PyTorch GitHub.

PyTorch for AMD ROCm™ Platform now available as Python package

March 24, 2021

by Niles Burbank – Director PM at AMD, Mayank Daga – Director, Deep Learning Software at AMD PyTorch

With the PyTorch 1.8 release, we are delighted to announce a new installation option for users of
PyTorch on the ROCm™ open software platform. An installable Python package is now hosted on
pytorch.org, along with instructions for local installation in the same simple, selectable format as
PyTorch packages for CPU-only configurations and other GPU platforms. PyTorch on ROCm includes full
capability for mixed-precision and large-scale training using AMD’s MIOpen & RCCL libraries. This
provides a new option for data scientists, researchers, students, and others in the community to get
started with accelerated PyTorch using AMD GPUs.

The ROCm Ecosystem

ROCm is AMD’s open source software platform for GPU-accelerated high performance computing and
machine learning. Since the original ROCm release in 2016, the ROCm platform has evolved to support
additional libraries and tools, a wider set of Linux® distributions, and a range of new GPUs. This includes
the AMD Instinct™ MI100, the first GPU based on AMD CDNA™ architecture.

The ROCm ecosystem has an established history of support for PyTorch, which was initially implemented
as a fork of the PyTorch project, and more recently through ROCm support in the upstream PyTorch
code. PyTorch users can install PyTorch for ROCm using AMD’s public PyTorch docker image, and can of
course build PyTorch for ROCm from source. With PyTorch 1.8, these existing installation options are
now complemented by the availability of an installable Python package.

The primary focus of ROCm has always been high performance computing at scale. The combined
capabilities of ROCm and AMD’s Instinct family of data center GPUs are particularly suited to the
challenges of HPC at data center scale. PyTorch is a natural fit for this environment, as HPC and ML
workflows become more intertwined.

Getting started with PyTorch for ROCm

The scope for this build of PyTorch is AMD GPUs with ROCm support, running on Linux. The GPUs
supported by ROCm include all of AMD’s Instinct family of compute-focused data center GPUs, along
with some other select GPUs. A current list of supported GPUs can be found in the ROCm Github
repository. After confirming that the target system includes supported GPUs and the current 4.0.1
release of ROCm, installation of PyTorch follows the same simple Pip-based installation as any other
Python package. As with PyTorch builds for other platforms, the configurator at https://pytorch.org/getstarted/locally/ provides the specific command line to be run.

PyTorch for ROCm is built from the upstream PyTorch repository, and is a full featured implementation.
Notably, it includes support for distributed training across multiple GPUs and supports accelerated
mixed precision training.

More information

A list of ROCm supported GPUs and operating systems can be found at
https://github.com/RadeonOpenCompute/ROCm
General documentation on the ROCm platform is available at https://rocmdocs.amd.com/en/latest/
ROCm Learning Center at https://developer.amd.com/resources/rocm-resources/rocm-learning-center/ General information on AMD’s offerings for HPC and ML can be found at https://amd.com/hpc

Feedback

An engaged user base is a tremendously important part of the PyTorch ecosystem. We would be deeply
appreciative of feedback on the PyTorch for ROCm experience in the PyTorch discussion forum and, where appropriate, reporting any issues via Github.

Announcing PyTorch Ecosystem Day

March 9, 2021

by Team PyTorch PyTorch

We’re proud to announce our first PyTorch Ecosystem Day. The virtual, one-day event will focus completely on our Ecosystem and Industry PyTorch communities!

PyTorch is a deep learning framework of choice for academics and companies, all thanks to its rich ecosystem of tools and strong community. As with our developers, our ecosystem partners play a pivotal role in the development and growth of the community.

We will be hosting our first PyTorch Ecosystem Day, a virtual event designed for our ecosystem and industry communities to showcase their work and discover new opportunities to collaborate.

PyTorch Ecosystem Day will be held on April 21, with both a morning and evening session, to ensure we reach our global community. Join us virtually for a day filled with discussions on new developments, trends, challenges, and best practices through keynotes, breakout sessions, and a unique networking opportunity hosted through Gather.Town .

Event Details

April 21, 2021 (Pacific Time)
Fully digital experience

Morning Session: (EMEA)
Opening Talks – 8:00 am-9:00 am PT
Poster Exhibition & Breakout Sessions – 9:00 am-12:00 pm PT
Evening Session (APAC/US)
Opening Talks – 3:00 pm-4:00 pm PT
Poster Exhibition & Breakout Sessions – 3:00 pm-6:00 pm PT
Networking – 9:00 am-7:00 pm PT

There are two ways to participate in PyTorch Ecosystem Day:

Poster Exhibition from the PyTorch ecosystem and industry communities covering a variety of topics. Posters are available for viewing throughout the duration of the event. To be part of the poster exhibition, please see below for submission details. If your poster is accepted, we highly recommend tending your poster during one of the morning or evening sessions or both!
Breakout Sessions are 40-min sessions freely designed by the community. The breakouts can be talks, demos, tutorials or discussions. Note: you must have an accepted poster to apply for the breakout sessions.

Call for posters now open! Submit your proposal today! Please send us the title and summary of your projects, tools, and libraries that could benefit PyTorch researchers in academia and industry, application developers, and ML engineers for consideration. The focus must be on academic papers, machine learning research, or open-source projects. Please no sales pitches. Deadline for submission is March 18, 2021.

Visit pytorchecosystemday.fbreg.com for more information and we look forward to welcoming you to PyTorch Ecosystem Day on April 21st!

New PyTorch library releases including TorchVision Mobile, TorchAudio I/O, and more

March 4, 2021

by Team PyTorch PyTorch

Today, we are announcing updates to a number of PyTorch libraries, alongside the PyTorch 1.8 release. The updates include new releases for the domain libraries including TorchVision, TorchText and TorchAudio as well as new version of TorchCSPRNG. These releases include a number of new features and improvements and, along with the PyTorch 1.8 release, provide a broad set of updates for the PyTorch community to build on and leverage.

Some highlights include:

TorchVision – Added support for PyTorch Mobile including Detectron2Go (D2Go), auto-augmentation of data during training, on the fly type conversion, and AMP autocasting.
TorchAudio – Major improvements to I/O, including defaulting to sox_io backend and file-like object support. Added Kaldi Pitch feature and support for CMake based build allowing TorchAudio to better support no-Python environments.
TorchText – Updated the dataset loading API to be compatible with standard PyTorch data loading utilities.
TorchCSPRNG – Support for cryptographically secure pseudorandom number generators for PyTorch is now stable with new APIs for AES128 ECB/CTR and CUDA support on Windows.

Please note that, starting in PyTorch 1.6, features are classified as Stable, Beta, and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can see the detailed announcement here.

TorchVision 0.9.0

[Stable] TorchVision Mobile: Operators, Android Binaries, and Tutorial

We are excited to announce the first on-device support and binaries for a PyTorch domain library. We have seen significant appetite in both research and industry for on-device vision support to allow low latency, privacy friendly, and resource efficient mobile vision experiences. You can follow this new tutorial to build your own Android object detection app using TorchVision operators, D2Go, or your own customer operators and model.

[Stable] New Mobile models for Classification, Object Detection and Semantic Segmentation

We have added support for the MobileNetV3 architecture and provided pre-trained weights for Classification, Object Detection and Segmentation. It is easy to get up and running with these models, just import and load them as you would any torchvision model:

import torch
import torchvision

# Classification
x = torch.rand(1, 3, 224, 224)
m_classifier = torchvision.models.mobilenet_v3_large(pretrained=True)
m_classifier.eval()
predictions = m_classifier(x)

# Quantized Classification
x = torch.rand(1, 3, 224, 224)
m_classifier = torchvision.models.quantization.mobilenet_v3_large(pretrained=True)
m_classifier.eval()
predictions = m_classifier(x)

# Object Detection: Highly Accurate High Resolution Mobile Model
x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
m_detector = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=True)
m_detector.eval()
predictions = m_detector(x)

# Semantic Segmentation: Highly Accurate Mobile Model
x = torch.rand(1, 3, 520, 520)
m_segmenter = torchvision.models.segmentation.deeplabv3_mobilenet_v3_large(pretrained=True)
m_segmenter.eval()
predictions = m_segmenter(x)

These models are highly competitive with TorchVision’s existing models on resource efficiency, speed, and accuracy. See our release notes for detailed performance metrics.

[Stable] AutoAugment

AutoAugment is a common Data Augmentation technique that can increase the accuracy of Scene Classification models. Though the data augmentation policies are directly linked to their trained dataset, empirical studies show that ImageNet policies provide significant improvements when applied to other datasets. We’ve implemented 3 policies learned on the following datasets: ImageNet, CIFA10 and SVHN. These can be used standalone or mixed-and-matched with existing transforms:

from torchvision import transforms

t = transforms.AutoAugment()
transformed = t(image)


transform=transforms.Compose([
   transforms.Resize(256),
   transforms.AutoAugment(),
   transforms.ToTensor()])

Other New Features for TorchVision

[Stable] All read and decode methods in the io.image package now support:
- Palette, Grayscale Alpha and RBG Alpha image types during PNG decoding
- On-the-fly conversion of image from one type to the other during read
[Stable] WiderFace dataset
[Stable] Improved FasterRCNN speed and accuracy by introducing a score threshold on RPN
[Stable] Modulation input for DeformConv2D
[Stable] Option to write audio to a video file
[Stable] Utility to draw bounding boxes
[Beta] Autocast support in all Operators
Find the full TorchVision release notes here.

TorchAudio 0.8.0

I/O Improvements

We have continued our work from the previous release to improve TorchAudio’s I/O support, including:

[Stable] Changing the default backend to “sox_io” (for Linux/macOS), and updating the “soundfile” backend’s interface to align with that of “sox_io”. The legacy backend and interface are still accessible, though it is strongly discouraged to use them.
[Stable] File-like object support in both “sox_io” backend, “soundfile” backend and sox_effects.
[Stable] New options to change the format, encoding, and bits_per_sample when saving.
[Stable] Added GSM, HTK, AMB, AMR-NB and AMR-WB format support to the “sox_io” backend.
[Beta] A new functional.apply_codec function which can degrade audio data by applying audio codecs supported by “sox_io” backend in an in-memory fashion.
Here are some examples of features landed in this release:

# Load audio over HTTP
with requests.get(URL, stream=True) as response:
    waveform, sample_rate = torchaudio.load(response.raw)
 
# Saving to Bytes buffer as 32-bit floating-point PCM
buffer_ = io.BytesIO()
torchaudio.save(
    buffer_, waveform, sample_rate,
    format="wav", encoding="PCM_S", bits_per_sample=16)
 
# Apply effects while loading audio from S3
client = boto3.client('s3')
response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY)
waveform, sample_rate = torchaudio.sox_effects.apply_effect_file(
    response['Body'],
    [["lowpass", "-1", "300"], ["rate", "8000"]])
 
# Apply GSM codec to Tensor
encoded = torchaudio.functional.apply_codec(
    waveform, sample_rate, format="gsm")

Check out the revamped audio preprocessing tutorial, Audio Manipulation with TorchAudio.

[Stable] Switch to CMake-based build

In the previous version of TorchAudio, it was utilizing CMake to build third party dependencies. Starting in 0.8.0, TorchaAudio uses CMake to build its C++ extension. This will open the door to integrate TorchAudio in non-Python environments (such as C++ applications and mobile). We will continue working on adding example applications and mobile integrations.

[Beta] Improved and New Audio Transforms

We have added two widely requested operators in this release: the SpectralCentroid transform and the Kaldi Pitch feature extraction (detailed in “A pitch extraction algorithm tuned for automatic speech recognition”). We’ve also exposed a normalization method to Mel transforms, and additional STFT arguments to Spectrogram. We would like to ask our community to continue to raise feature requests for core audio processing features like these!

Community Contributions

We had more contributions from the open source community in this release than ever before, including several completely new features. We would like to extend our sincere thanks to the community. Please check out the newly added CONTRIBUTING.md for ways to contribute code, and remember that reporting bugs and requesting features are just as valuable. We will continue posting well-scoped work items as issues labeled “help-wanted” and “contributions-welcome” for anyone who would like to contribute code, and are happy to coach new contributors through the contribution process.

Find the full TorchAudio release notes here.

TorchText 0.9.0

[Beta] Dataset API Updates

In this release, we are updating TorchText’s dataset API to be compatible with PyTorch data utilities, such as DataLoader, and are deprecating TorchText’s custom data abstractions such as Field. The updated datasets are simple string-by-string iterators over the data. For guidance about migrating from the legacy abstractions to use modern PyTorch data utilities, please refer to our migration guide.

The text datasets listed below have been updated as part of this work. For examples of how to use these datasets, please refer to our end-to-end text classification tutorial.

Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
Sequence tagging: UDPOS, CoNLL2000Chunking
Translation: IWSLT2016, IWSLT2017
Question answer: SQuAD1, SQuAD2

Find the full TorchText release notes here.

[Stable] TorchCSPRNG 0.2.0

We released TorchCSPRNG in August 2020, a PyTorch C++/CUDA extension that provides cryptographically secure pseudorandom number generators for PyTorch. Today, we are releasing the 0.2.0 version and designating the library as stable. This release includes a new API for encrypt/decrypt with AES128 ECB/CTR as well as CUDA 11 and Windows CUDA support.

Find the full TorchCSPRNG release notes here.

Thanks for reading, and if you are excited about these updates and want to participate in the future of PyTorch, we encourage you to join the discussion forums and open GitHub issues.

Cheers!

Team PyTorch

PyTorch 1.8 Release, including Compiler and Distributed Training updates, and New Mobile Tutorials

March 4, 2021

by Team PyTorch PyTorch

We are excited to announce the availability of PyTorch 1.8. This release is composed of more than 3,000 commits since 1.7. It includes major updates and new features for compilation, code optimization, frontend APIs for scientific computing, and AMD ROCm support through binaries that are available via pytorch.org. It also provides improved features for large-scale training for pipeline and model parallelism, and gradient compression. A few of the highlights include:

Support for doing python to python functional transformations via torch.fx;
Added or stabilized APIs to support FFTs (torch.fft), Linear Algebra functions (torch.linalg), added support for autograd for complex tensors and updates to improve performance for calculating hessians and jacobians; and
Significant updates and improvements to distributed training including: Improved NCCL reliability; Pipeline parallelism support; RPC profiling; and support for communication hooks adding gradient compression.
See the full release notes here.

Along with 1.8, we are also releasing major updates to PyTorch libraries including TorchCSPRNG, TorchVision, TorchText and TorchAudio. For more on the library releases, see the post here. As previously noted, features in PyTorch releases are classified as Stable, Beta and Prototype. You can learn more about the definitions in the post here.

New and Updated APIs

The PyTorch 1.8 release brings a host of new and updated API surfaces ranging from additional APIs for NumPy compatibility, also support for ways to improve and scale your code for performance at both inference and training time. Here is a brief summary of the major features coming in this release:

[Stable] `Torch.fft` support for high performance NumPy style FFTs

As part of PyTorch’s goal to support scientific computing, we have invested in improving our FFT support and with PyTorch 1.8, we are releasing the torch.fft module. This module implements the same functions as NumPy’s np.fft module, but with support for hardware acceleration and autograd.

See this blog post for more details
Documentation

[Beta] Support for NumPy style linear algebra functions via `torch.linalg`

The torch.linalg module, modeled after NumPy’s np.linalg module, brings NumPy-style support for common linear algebra operations including Cholesky decompositions, determinants, eigenvalues and many others.

Documentation

[Beta] Python code Transformations with FX

FX allows you to write transformations of the form transform(input_module : nn.Module) -> nn.Module, where you can feed in a Module instance and get a transformed Module instance out of it.

This kind of functionality is applicable in many scenarios. For example, the FX-based Graph Mode Quantization product is releasing as a prototype contemporaneously with FX. Graph Mode Quantization automates the process of quantizing a neural net and does so by leveraging FX’s program capture, analysis and transformation facilities. We are also developing many other transformation products with FX and we are excited to share this powerful toolkit with the community.

Because FX transforms consume and produce nn.Module instances, they can be used within many existing PyTorch workflows. This includes workflows that, for example, train in Python then deploy via TorchScript.

Below is an FX transform example:

import torch
import torch.fx

def transform(m: nn.Module,
             tracer_class : type = torch.fx.Tracer) -> torch.nn.Module:
   # Step 1: Acquire a Graph representing the code in `m`
  
   # NOTE: torch.fx.symbolic_trace is a wrapper around a call to
   # fx.Tracer.trace and constructing a GraphModule. We'll
   # split that out in our transform to allow the caller to
   # customize tracing behavior.
   graph : torch.fx.Graph = tracer_class().trace(m)
  
   # Step 2: Modify this Graph or create a new one
   graph = ...
  
   # Step 3: Construct a Module to return
   return torch.fx.GraphModule(m, graph)

You can read more about FX in the official documentation. You can also find several examples of program transformations implemented using torch.fx here. We are constantly improving FX and invite you to share any feedback you have about the toolkit on the forums or issue tracker.

Distributed Training

The PyTorch 1.8 release added a number of new features as well as improvements to reliability and usability. Concretely, support for: Stable level async error/timeout handling was added to improve NCCL reliability; and stable support for RPC based profiling. Additionally, we have added support for pipeline parallelism as well as gradient compression through the use of communication hooks in DDP. Details are below:

[Beta] Pipeline Parallelism

As machine learning models continue to grow in size, traditional Distributed DataParallel (DDP) training no longer scales as these models don’t fit on a single GPU device. The new pipeline parallelism feature provides an easy to use PyTorch API to leverage pipeline parallelism as part of your training loop.

[Beta] DDP Communication Hook

The DDP communication hook is a generic interface to control how to communicate gradients across workers by overriding the vanilla allreduce in DistributedDataParallel. A few built-in communication hooks are provided including PowerSGD, and users can easily apply any of these hooks to optimize communication. Additionally, the communication hook interface can also support user-defined communication strategies for more advanced use cases.

Additional Prototype Features for Distributed Training

In addition to the major stable and beta distributed training features in this release, we also have a number of prototype features available in our nightlies to try out and provide feedback. We have linked in the draft docs below for reference:

(Prototype) ZeroRedundancyOptimizer – Based on and in partnership with the Microsoft DeepSpeed team, this feature helps reduce per-process memory footprint by sharding optimizer states across all participating processes in the ProcessGroup gang. Refer to this documentation for more details.
(Prototype) Process Group NCCL Send/Recv – The NCCL send/recv API was introduced in v2.7 and this feature adds support for it in NCCL process groups. This feature will provide an option for users to implement collective operations at Python layer instead of C++ layer. Refer to this documentation and code examples to learn more.
(Prototype) CUDA-support in RPC using TensorPipe – This feature should bring consequent speed improvements for users of PyTorch RPC with multiple-GPU machines, as TensorPipe will automatically leverage NVLink when available, and avoid costly copies to and from host memory when exchanging GPU tensors between processes. When not on the same machine, TensorPipe will fall back to copying the tensor to host memory and sending it as a regular CPU tensor. This will also improve the user experience as users will be able to treat GPU tensors like regular CPU tensors in their code. Refer to this documentation for more details.
(Prototype) Remote Module – This feature allows users to operate a module on a remote worker like using a local module, where the RPCs are transparent to the user. In the past, this functionality was implemented in an ad-hoc way and overall this feature will improve the usability of model parallelism on PyTorch. Refer to this documentation for more details.

PyTorch Mobile

Support for PyTorch Mobile is expanding with a new set of tutorials to help new users launch models on-device quicker and give existing users a tool to get more out of our framework. These include:

Our new demo apps also include examples of image segmentation, object detection, neural machine translation, question answering, and vision transformers. They are available on both iOS and Android:

In addition to performance improvements on CPU for MobileNetV3 and other models, we also revamped our Android GPU backend prototype for broader models coverage and faster inferencing:

Android tutorial

Lastly, we are launching the PyTorch Mobile Lite Interpreter as a prototype feature in this release. The Lite Interpreter allows users to reduce the runtime binary size. Please try these out and send us your feedback on the PyTorch Forums. All our latest updates can be found on the PyTorch Mobile page

[Prototype] PyTorch Mobile Lite Interpreter

PyTorch Lite Interpreter is a streamlined version of the PyTorch runtime that can execute PyTorch programs in resource constrained devices, with reduced binary size footprint. This prototype feature reduces binary sizes by up to 70% compared to the current on-device runtime in the current release.

iOS/Android Tutorial

Performance Optimization

In 1.8, we are releasing the support for benchmark utils to enable users to better monitor performance. We are also opening up a new automated quantization API. See the details below:

(Beta) Benchmark utils

Benchmark utils allows users to take accurate performance measurements, and provides composable tools to help with both benchmark formulation and post processing. This expected to be helpful for contributors to PyTorch to quickly understand how their contributions are impacting PyTorch performance.

Example:

from torch.utils.benchmark import Timer

results = []
for num_threads in [1, 2, 4]:
    timer = Timer(
        stmt="torch.add(x, y, out=out)",
        setup="""
            n = 1024
            x = torch.ones((n, n))
            y = torch.ones((n, 1))
            out = torch.empty((n, n))
        """,
        num_threads=num_threads,
    )
    results.append(timer.blocked_autorange(min_run_time=5))
    print(
        f"{num_threads} thread{'s' if num_threads > 1 else ' ':<4}"
        f"{results[-1].median * 1e6:>4.0f} us   " +
        (f"({results[0].median / results[-1].median:.1f}x)" if num_threads > 1 else '')
    )

1 thread     376 us   
2 threads    189 us   (2.0x)
4 threads     99 us   (3.8x)

(Prototype) FX Graph Mode Quantization

FX Graph Mode Quantization is the new automated quantization API in PyTorch. It improves upon Eager Mode Quantization by adding support for functionals and automating the quantization process, although people might need to refactor the model to make the model compatible with FX Graph Mode Quantization (symbolically traceable with torch.fx).

Hardware Support

[Beta] Ability to Extend the PyTorch Dispatcher for a new backend in C++

In PyTorch 1.8, you can now create new out-of-tree devices that live outside the pytorch/pytorch repo. The tutorial linked below shows how to register your device and keep it in sync with native PyTorch devices.

Tutorial

[Beta] AMD GPU Binaries Now Available

Starting in PyTorch 1.8, we have added support for ROCm wheels providing an easy onboarding to using AMD GPUs. You can simply go to the standard PyTorch installation selector and choose ROCm as an installation option and execute the provided command.

Thanks for reading, and if you are excited about these updates and want to participate in the future of PyTorch, we encourage you to join the discussion forums and open GitHub issues.

Cheers!

Team PyTorch

The torch.fft module: Accelerated Fast Fourier Transforms with Autograd in PyTorch

March 4, 2021

by Mike Ruberry, Peter Bell, and Joe Spisak PyTorch

The Fast Fourier Transform (FFT) calculates the Discrete Fourier Transform in O(n log n) time. It is foundational to a wide variety of numerical algorithms and signal processing techniques since it makes working in signals’ “frequency domains” as tractable as working in their spatial or temporal domains.

As part of PyTorch’s goal to support hardware-accelerated deep learning and scientific computing, we have invested in improving our FFT support, and with PyTorch 1.8, we are releasing the torch.fft module. This module implements the same functions as NumPy’s np.fft module, but with support for accelerators, like GPUs, and autograd.

Getting started

Getting started with the new torch.fft module is easy whether you are familiar with NumPy’s np.fft module or not. While complete documentation for each function in the module can be found here, a breakdown of what it offers is:

fft, which computes a complex FFT over a single dimension, and ifft, its inverse
the more general fftn and ifftn, which support multiple dimensions
The “real” FFT functions, rfft, irfft, rfftn, irfftn, designed to work with signals that are real-valued in their time domains
The “Hermitian” FFT functions, hfft and ihfft, designed to work with signals that are real-valued in their frequency domains
Helper functions, like fftfreq, rfftfreq, fftshift, ifftshift, that make it easier to manipulate signals

We think these functions provide a straightforward interface for FFT functionality, as vetted by the NumPy community, although we are always interested in feedback and suggestions!

To better illustrate how easy it is to move from NumPy’s np.fft module to PyTorch’s torch.fft module, let’s look at a NumPy implementation of a simple low-pass filter that removes high-frequency variance from a 2-dimensional image, a form of noise reduction or blurring:

import numpy as np
import numpy.fft as fft

def lowpass_np(input, limit):
    pass1 = np.abs(fft.rfftfreq(input.shape[-1])) < limit
    pass2 = np.abs(fft.fftfreq(input.shape[-2])) < limit
    kernel = np.outer(pass2, pass1)
    
    fft_input = fft.rfft2(input)
    return fft.irfft2(fft_input * kernel, s=input.shape[-2:])

Now let’s see the same filter implemented in PyTorch:

import torch
import torch.fft as fft

def lowpass_torch(input, limit):
    pass1 = torch.abs(fft.rfftfreq(input.shape[-1])) < limit
    pass2 = torch.abs(fft.fftfreq(input.shape[-2])) < limit
    kernel = torch.outer(pass2, pass1)
    
    fft_input = fft.rfft2(input)
    return fft.irfft2(fft_input * kernel, s=input.shape[-2:])

Not only do current uses of NumPy’s np.fft module translate directly to torch.fft, the torch.fft operations also support tensors on accelerators, like GPUs and autograd. This makes it possible to (among other things) develop new neural network modules using the FFT.

Performance

The torch.fft module is not only easy to use — it is also fast! PyTorch natively supports Intel’s MKL-FFT library on Intel CPUs, and NVIDIA’s cuFFT library on CUDA devices, and we have carefully optimized how we use those libraries to maximize performance. While your own results will depend on your CPU and CUDA hardware, computing Fast Fourier Transforms on CUDA devices can be many times faster than computing it on the CPU, especially for larger signals.

In the future, we may add support for additional math libraries to support more hardware. See below for where you can request additional hardware support.

Updating from older PyTorch versions

Some PyTorch users might know that older versions of PyTorch also offered FFT functionality with the torch.fft() function. Unfortunately, this function had to be removed because its name conflicted with the new module’s name, and we think the new functionality is the best way to use the Fast Fourier Transform in PyTorch. In particular, torch.fft() was developed before PyTorch supported complex tensors, while the torch.fft module was designed to work with them.

PyTorch also has a “Short Time Fourier Transform”, torch.stft, and its inverse torch.istft. These functions are being kept but updated to support complex tensors.

Future

As mentioned, PyTorch 1.8 offers the torch.fft module, which makes it easy to use the Fast Fourier Transform (FFT) on accelerators and with support for autograd. We encourage you to try it out!

While this module has been modeled after NumPy’s np.fft module so far, we are not stopping there. We are eager to hear from you, our community, on what FFT-related functionality you need, and we encourage you to create posts on our forums at https://discuss.pytorch.org/, or file issues on our Github with your feedback and requests. Early adopters have already started asking about Discrete Cosine Transforms and support for more hardware platforms, for example, and we are investigating those features now.

We look forward to hearing from you and seeing what the community does with PyTorch’s new FFT functionality!

Prototype Features Now Available – APIs for Hardware Accelerated Mobile and ARM64 Builds

November 12, 2020

by Team PyTorch PyTorch

Today, we are announcing four PyTorch prototype features. The first three of these will enable Mobile machine-learning developers to execute models on the full set of hardware (HW) engines making up a system-on-chip (SOC). This gives developers options to optimize their model execution for unique performance, power, and system-level concurrency.

These features include enabling execution on the following on-device HW engines:

DSP and NPUs using the Android Neural Networks API (NNAPI), developed in collaboration with Google
GPU execution on Android via Vulkan
GPU execution on iOS via Metal

This release also includes developer efficiency benefits with newly introduced support for ARM64 builds for Linux.

Below, you’ll find brief descriptions of each feature with the links to get you started. These features are available through our nightly builds. Reach out to us on the PyTorch Forums for any comment or feedback. We would love to get your feedback on those and hear how you are using them!

NNAPI Support with Google Android

The Google Android and PyTorch teams collaborated to enable support for Android’s Neural Networks API (NNAPI) via PyTorch Mobile. Developers can now unlock high-performance execution on Android phones as their machine-learning models will be able to access additional hardware blocks on the phone’s system-on-chip. NNAPI allows Android apps to run computationally intensive neural networks on the most powerful and efficient parts of the chips that power mobile phones, including DSPs (Digital Signal Processors) and NPUs (specialized Neural Processing Units). The API was introduced in Android 8 (Oreo) and significantly expanded in Android 10 and 11 to support a richer set of AI models. With this integration, developers can now seamlessly access NNAPI directly from PyTorch Mobile. This initial release includes fully-functional support for a core set of features and operators, and Google and Facebook will be working to expand capabilities in the coming months.

Links

PyTorch Mobile GPU support

Inferencing on GPU can provide great performance on many models types, especially those utilizing high-precision floating-point math. Leveraging the GPU for ML model execution as those found in SOCs from Qualcomm, Mediatek, and Apple allows for CPU-offload, freeing up the Mobile CPU for non-ML use cases. This initial prototype level support provided for on device GPUs is via the Metal API specification for iOS, and the Vulkan API specification for Android. As this feature is in an early stage: performance is not optimized and model coverage is limited. We expect this to improve significantly over the course of 2021 and would like to hear from you which models and devices you would like to see performance improvements on.

Links

Prototype source workflows

ARM64 Builds for Linux

We will now provide prototype level PyTorch builds for ARM64 devices on Linux. As we see more ARM usage in our community with platforms such as Raspberry Pis and Graviton(2) instances spanning both at the edge and on servers respectively. This feature is available through our nightly builds.

We value your feedback on these features and look forward to collaborating with you to continuously improve them further!

Thank you,

Team PyTorch

Announcing PyTorch Developer Day 2020

November 1, 2020

by Team PyTorch PyTorch

Starting this year, we plan to host two separate events for PyTorch: one for developers and users to discuss core technical development, ideas and roadmaps called “Developer Day”, and another for the PyTorch ecosystem and industry communities to showcase their work and discover opportunities to collaborate called “Ecosystem Day” (scheduled for early 2021).

The PyTorch Developer Day (#PTD2) is kicking off on November 12, 2020, 8AM PST with a full day of technical talks on a variety of topics, including updates to the core framework, new tools and libraries to support development across a variety of domains. You’ll also see talks covering the latest research around systems and tooling in ML.

For Developer Day, we have an online networking event limited to people composed of PyTorch maintainers and contributors, long-time stakeholders and experts in areas relevant to PyTorch’s future. Conversations from the networking event will strongly shape the future of PyTorch. Hence, invitations are required to attend the networking event.

All talks will be livestreamed and available to the public.

Visit https://pytorchdeveloperday.fbreg.com/ to learn more. We look forward to welcoming you to PyTorch Developer Day on November 12th!

Thank you,

The PyTorch team

Adding a Contributor License Agreement for PyTorch

October 28, 2020

by Team PyTorch PyTorch

To ensure the ongoing growth and success of the framework, we’re introducing the use of the Apache Contributor License Agreement (CLA) for PyTorch. We care deeply about the broad community of contributors who make PyTorch such a great framework, so we want to take a moment to explain why we are adding a CLA.

Why Does PyTorch Need a CLA?

CLAs help clarify that users and maintainers have the relevant rights to use and maintain code contributed to an open source project, while allowing contributors to retain ownership rights to their code.

PyTorch has grown from a small group of enthusiasts to a now global community with over 1,600 contributors from dozens of countries, each bringing their own diverse perspectives, values and approaches to collaboration. Looking forward, clarity about how this collaboration is happening is an important milestone for the framework as we continue to build a stronger, safer and more scalable community around PyTorch.

The text of the Apache CLA can be found here, together with an accompanying FAQ. The language in the PyTorch CLA is identical to the Apache template. Although CLAs have been the subject of significant discussion in the open source community, we are seeing that using a CLA, and particularly the Apache CLA, is now standard practice when projects and communities reach a certain scale. Popular projects that have adopted some type of CLA include: Visual Studio Code, Flutter, TensorFlow, kubernetes, Ubuntu, Django, Python, Go, Android and many others.

What is Not Changing

PyTorch’s BSD license is not changing. There is no impact to PyTorch users. CLAs will only be required for new contributions to the project. For past contributions, no action is necessary. Everything else stays the same, whether it’s IP ownership, workflows, contributor roles or anything else that you’ve come to expect from PyTorch.

How the New CLA will Work

Moving forward, all contributors to projects under the PyTorch GitHub organization will need to sign a CLA to merge their contributions.

If you’ve contributed to other Facebook Open Source projects, you may have already signed the CLA, and no action is required. If you have not signed the CLA, a GitHub check will prompt you to sign it before your pull requests can be merged. You can reach the CLA from this link.

If you’re contributing as an individual, meaning the code is not something you worked on as part of your job, you should sign the individual contributor agreement. This agreement associates your GitHub username with future contributions and only needs to be signed once.

If you’re contributing as part of your employment, you may need to sign the corporate contributor agreement. Check with your legal team on filling this out. Also you will include a list of github ids from your company.

As always, we continue to be humbled and grateful for all your support, and we look forward to scaling PyTorch together to even greater heights in the years to come.

Thank you!

Team PyTorch

PyTorch 1.7 released w/ CUDA 11, New APIs for FFTs, Windows support for Distributed training and more

October 27, 2020

by Team PyTorch PyTorch

Today, we’re announcing the availability of PyTorch 1.7, along with updated domain libraries. The PyTorch 1.7 release includes a number of new APIs including support for NumPy-Compatible FFT operations, profiling tools and major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training. In addition, several features moved to stable including custom C++ Classes, the memory profiler, extensions via custom tensor-like objects, user async functions in RPC and a number of other features in torch.distributed such as Per-RPC timeout, DDP dynamic bucketing and RRef helper.

A few of the highlights include:

CUDA 11 is now officially supported with binaries available at PyTorch.org
Updates and additions to profiling and performance for RPC, TorchScript and Stack traces in the autograd profiler
(Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft
(Prototype) Support for Nvidia A100 generation GPUs and native TF32 format
(Prototype) Distributed training on Windows now supported
torchvision
- (Stable) Transforms now support Tensor inputs, batch computation, GPU, and TorchScript
- (Stable) Native image I/O for JPEG and PNG formats
- (Beta) New Video Reader API
torchaudio
- (Stable) Added support for speech rec (wav2letter), text to speech (WaveRNN) and source separation (ConvTasNet)

To reiterate, starting PyTorch 1.6, features are now classified as stable, beta and prototype. You can see the detailed announcement here. Note that the prototype features listed in this blog are available as part of this release.

Find the full release notes here.

Front End APIs

[Beta] NumPy Compatible torch.fft module

FFT-related functionality is commonly used in a variety of scientific fields like signal processing. While PyTorch has historically supported a few FFT-related functions, the 1.7 release adds a new torch.fft module that implements FFT-related functions with the same API as NumPy.

This new module must be imported to be used in the 1.7 release, since its name conflicts with the historic (and now deprecated) torch.fft function.

Example usage:

>>> import torch.fft
>>> t = torch.arange(4)
>>> t
tensor([0, 1, 2, 3])

>>> torch.fft.fft(t)
tensor([ 6.+0.j, -2.+2.j, -2.+0.j, -2.-2.j])

>>> t = tensor([0.+1.j, 2.+3.j, 4.+5.j, 6.+7.j])
>>> torch.fft.fft(t)
tensor([12.+16.j, -8.+0.j, -4.-4.j,  0.-8.j])

Documentation

[Beta] C++ Support for Transformer NN Modules

Since PyTorch 1.5, we’ve continued to maintain parity between the python and C++ frontend APIs. This update allows developers to use the nn.transformer module abstraction from the C++ Frontend. And moreover, developers no longer need to save a module from python/JIT and load into C++ as it can now be used it in C++ directly.

Documentation

[Beta] torch.set_deterministic

Reproducibility (bit-for-bit determinism) may help identify errors when debugging or testing a program. To facilitate reproducibility, PyTorch 1.7 adds the torch.set_deterministic(bool) function that can direct PyTorch operators to select deterministic algorithms when available, and to throw a runtime error if an operation may result in nondeterministic behavior. By default, the flag this function controls is false and there is no change in behavior, meaning PyTorch may implement its operations nondeterministically by default.

More precisely, when this flag is true:

Operations known to not have a deterministic implementation throw a runtime error;
Operations with deterministic variants use those variants (usually with a performance penalty versus the non-deterministic version); and
torch.backends.cudnn.deterministic = True is set.

Note that this is necessary, but not sufficient, for determinism within a single run of a PyTorch program. Other sources of randomness like random number generators, unknown operations, or asynchronous or distributed computation may still cause nondeterministic behavior.

See the documentation for torch.set_deterministic(bool) for the list of affected operations.

Performance & Profiling

[Beta] Stack traces added to profiler

Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the autograd profiler as before but with optional new parameters: with_stack and group_by_stack_n. Caution: regular profiling runs should not use this feature as it adds significant overhead.

Distributed Training & RPC

[Stable] TorchElastic now bundled into PyTorch docker image

Torchelastic offers a strict superset of the current torch.distributed.launch CLI with the added features for fault-tolerance and elasticity. If the user is not be interested in fault-tolerance, they can get the exact functionality/behavior parity by setting max_restarts=0 with the added convenience of auto-assigned RANK and MASTER_ADDR|PORT (versus manually specified in torch.distributed.launch).

By bundling torchelastic in the same docker image as PyTorch, users can start experimenting with TorchElastic right-away without having to separately install torchelastic. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators.

Usage examples and how to get started

[Beta] Support for uneven dataset inputs in DDP

PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using torch.nn.parallel.DistributedDataParallel to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training.

[Beta] NCCL Reliability – Async Error/Timeout Handling

In the past, NCCL training runs would hang indefinitely due to stuck collectives, leading to a very unpleasant experience for users. This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something like torchelastic (which can recover the training process from the last checkpoint), users can have much greater reliability for distributed training. This feature is completely opt-in and sits behind an environment variable that needs to be explicitly set in order to enable this functionality (otherwise users will see the same behavior as before).

[Beta] TorchScript `rpc_remote` and `rpc_sync`

torch.distributed.rpc.rpc_async has been available in TorchScript in prior releases. For PyTorch 1.7, this functionality will be extended the remaining two core RPC APIs, torch.distributed.rpc.rpc_sync and torch.distributed.rpc.remote. This will complete the major RPC APIs targeted for support in TorchScript, it allows users to use the existing python RPC APIs within TorchScript (in a script function or script method, which releases the python Global Interpreter Lock) and could possibly improve application performance in multithreaded environment.

[Beta] Distributed optimizer with TorchScript support

PyTorch provides a broad set of optimizers for training algorithms, and these have been used repeatedly as part of the python API. However, users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency in the context of large scale distributed training (e.g. Distributed Model Parallel) or any RPC-based training application). Users couldn’t do this with with distributed optimizer before because we need to get rid of the python Global Interpreter Lock (GIL) limitation to achieve this.

In PyTorch 1.7, we are enabling the TorchScript support in distributed optimizer to remove the GIL, and make it possible to run optimizer in multithreaded applications. The new distributed optimizer has the exact same interface as before but it automatically converts optimizers within each worker into TorchScript to make each GIL free. This is done by leveraging a functional optimizer concept and allowing the distributed optimizer to convert the computational portion of the optimizer into TorchScript. This will help use cases like distributed model parallel training and improve performance using multithreading.

Currently, the only optimizer that supports automatic conversion with TorchScript is Adagrad and all other optimizers will still work as before without TorchScript support. We are working on expanding the coverage to all PyTorch optimizers and expect more to come in future releases. The usage to enable TorchScript support is automatic and exactly the same with existing python APIs, here is an example of how to use this:

import torch.distributed.autograd as dist_autograd
import torch.distributed.rpc as rpc
from torch import optim
from torch.distributed.optim import DistributedOptimizer

with dist_autograd.context() as context_id:
  # Forward pass.
  rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3))
  rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1))
  loss = rref1.to_here() + rref2.to_here()

  # Backward pass.
  dist_autograd.backward(context_id, [loss.sum()])

  # Optimizer, pass in optim.Adagrad, DistributedOptimizer will
  # automatically convert/compile it to TorchScript (GIL-free)
  dist_optim = DistributedOptimizer(
     optim.Adagrad,
     [rref1, rref2],
     lr=0.05,
  )
  dist_optim.step(context_id)

[Beta] Enhancements to RPC-based Profiling

Support for using the PyTorch profiler in conjunction with the RPC framework was first introduced in PyTorch 1.6. In PyTorch 1.7, the following enhancements have been made:

Implemented better support for profiling TorchScript functions over RPC
Achieved parity in terms of profiler features that work with RPC
Added support for asynchronous RPC functions on the server-side (functions decorated with rpc.functions.async_execution).

Users are now able to use familiar profiling tools such as with torch.autograd.profiler.profile() and with torch.autograd.profiler.record_function, and this works transparently with the RPC framework with full feature support, profiles asynchronous functions, and TorchScript functions.

[Prototype] Windows support for Distributed Training

PyTorch 1.7 brings prototype support for DistributedDataParallel and collective communications on the Windows platform. In this release, the support only covers Gloo-based ProcessGroup and FileStore.

To use this feature across multiple machines, please provide a file from a shared file system in init_process_group.

# initialize the process group
dist.init_process_group(
    "gloo",
    # multi-machine example:
    # init_method = "file://////{machine}/{share_folder}/file"
    init_method="file:///{your local file path}",
    rank=rank,
    world_size=world_size
)

model = DistributedDataParallel(local_model, device_ids=[rank])

Mobile

PyTorch Mobile supports both iOS and Android with binary packages available in Cocoapods and JCenter respectively. You can learn more about PyTorch Mobile here.

[Beta] PyTorch Mobile Caching allocator for performance improvements

On some mobile platforms, such as Pixel, we observed that memory is returned to the system more aggressively. This results in frequent page faults as PyTorch being a functional framework does not maintain state for the operators. Thus outputs are allocated dynamically on each execution of the op, for the most ops. To ameliorate performance penalties due to this, PyTorch 1.7 provides a simple caching allocator for CPU. The allocator caches allocations by tensor sizes and, is currently, available only via the PyTorch C++ API. The caching allocator itself is owned by client and thus the lifetime of the allocator is also maintained by client code. Such a client owned caching allocator can then be used with scoped guard, c10::WithCPUCachingAllocatorGuard, to enable the use of cached allocation within that scope.
Example usage:

#include <c10/mobile/CPUCachingAllocator.h>
.....
c10::CPUCachingAllocator caching_allocator;
  // Owned by client code. Can be a member of some client class so as to tie the
  // the lifetime of caching allocator to that of the class.
.....
{
  c10::optional<c10::WithCPUCachingAllocatorGuard> caching_allocator_guard;
  if (FLAGS_use_caching_allocator) {
    caching_allocator_guard.emplace(&caching_allocator);
  }
  ....
  model.forward(..);
}
...

NOTE: Caching allocator is only available on mobile builds, thus the use of caching allocator outside of mobile builds won’t be effective.

torchvision

[Stable] Transforms now support Tensor inputs, batch computation, GPU, and TorchScript

torchvision transforms are now inherited from nn.Module and can be torchscripted and applied on torch Tensor inputs as well as on PIL images. They also support Tensors with batch dimensions and work seamlessly on CPU/GPU devices:

import torch
import torchvision.transforms as T

# to fix random seed, use torch.manual_seed
# instead of random.seed
torch.manual_seed(12)

transforms = torch.nn.Sequential(
    T.RandomCrop(224),
    T.RandomHorizontalFlip(p=0.3),
    T.ConvertImageDtype(torch.float),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
)
scripted_transforms = torch.jit.script(transforms)
# Note: we can similarly use T.Compose to define transforms
# transforms = T.Compose([...]) and 
# scripted_transforms = torch.jit.script(torch.nn.Sequential(*transforms.transforms))

tensor_image = torch.randint(0, 256, size=(3, 256, 256), dtype=torch.uint8)
# works directly on Tensors
out_image1 = transforms(tensor_image)
# on the GPU
out_image1_cuda = transforms(tensor_image.cuda())
# with batches
batched_image = torch.randint(0, 256, size=(4, 3, 256, 256), dtype=torch.uint8)
out_image_batched = transforms(batched_image)
# and has torchscript support
out_image2 = scripted_transforms(tensor_image)

These improvements enable the following new features:

support for GPU acceleration
batched transformations e.g. as needed for videos
transform multi-band torch tensor images (with more than 3-4 channels)
torchscript transforms together with your model for deployment
Note: Exceptions for TorchScript support includes Compose, RandomChoice, RandomOrder, Lambda and those applied on PIL images, such as ToPILImage.

[Stable] Native image IO for JPEG and PNG formats

torchvision 0.8.0 introduces native image reading and writing operations for JPEG and PNG formats. Those operators support TorchScript and return CxHxW tensors in uint8 format, and can thus be now part of your model for deployment in C++ environments.

from torchvision.io import read_image

# tensor_image is a CxHxW uint8 Tensor
tensor_image = read_image('path_to_image.jpeg')

# or equivalently
from torchvision.io import read_file, decode_image
# raw_data is a 1d uint8 Tensor with the raw bytes
raw_data = read_file('path_to_image.jpeg')
tensor_image = decode_image(raw_data)

# all operators are torchscriptable and can be
# serialized together with your model torchscript code
scripted_read_image = torch.jit.script(read_image)

[Stable] RetinaNet detection model

This release adds pretrained models for RetinaNet with a ResNet50 backbone from Focal Loss for Dense Object Detection.

[Beta] New Video Reader API

This release introduces a new video reading abstraction, which gives more fine-grained control of iteration over videos. It supports image and audio, and implements an iterator interface so that it is interoperable with other the python libraries such as itertools.

from torchvision.io import VideoReader

# stream indicates if reading from audio or video
reader = VideoReader('path_to_video.mp4', stream='video')
# can change the stream after construction
# via reader.set_current_stream

# to read all frames in a video starting at 2 seconds
for frame in reader.seek(2):
    # frame is a dict with "data" and "pts" metadata
    print(frame["data"], frame["pts"])

# because reader is an iterator you can combine it with
# itertools
from itertools import takewhile, islice
# read 10 frames starting from 2 seconds
for frame in islice(reader.seek(2), 10):
    pass
    
# or to return all frames between 2 and 5 seconds
for frame in takewhile(lambda x: x["pts"] < 5, reader):
    pass

Notes:

In order to use the Video Reader API beta, you must compile torchvision from source and have ffmpeg installed in your system.
The VideoReader API is currently released as beta and its API may change following user feedback.

torchaudio

With this release, torchaudio is expanding its support for models and end-to-end applications, adding a wav2letter training pipeline and end-to-end text-to-speech and source separation pipelines. Please file an issue on github to provide feedback on them.

[Stable] Speech Recognition

Building on the addition of the wav2letter model for speech recognition in the last release, we’ve now added an example wav2letter training pipeline with the LibriSpeech dataset.

[Stable] Text-to-speech

With the goal of supporting text-to-speech applications, we added a vocoder based on the WaveRNN model, based on the implementation from this repository. The original implementation was introduced in “Efficient Neural Audio Synthesis”. We also provide an example WaveRNN training pipeline that uses the LibriTTS dataset added to torchaudio in this release.

[Stable] Source Separation

With the addition of the ConvTasNet model, based on the paper “Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation,” torchaudio now also supports source separation. An example ConvTasNet training pipeline is provided with the wsj-mix dataset.

Cheers!

Team PyTorch

Getting started

Visual Studio Code Integration

Feedback

The ROCm Ecosystem

Getting started with PyTorch for ROCm

More information

Feedback

Event Details

There are two ways to participate in PyTorch Ecosystem Day:

TorchVision 0.9.0

[Stable] TorchVision Mobile: Operators, Android Binaries, and Tutorial

[Stable] New Mobile models for Classification, Object Detection and Semantic Segmentation

[Stable] AutoAugment

Other New Features for TorchVision

TorchAudio 0.8.0

I/O Improvements

[Stable] Switch to CMake-based build

[Beta] Improved and New Audio Transforms

Community Contributions

TorchText 0.9.0

[Beta] Dataset API Updates

[Stable] TorchCSPRNG 0.2.0

New and Updated APIs

[Stable] Torch.fft support for high performance NumPy style FFTs

[Beta] Support for NumPy style linear algebra functions via torch.linalg

[Beta] Python code Transformations with FX

Distributed Training

[Beta] Pipeline Parallelism

[Beta] DDP Communication Hook

Additional Prototype Features for Distributed Training

PyTorch Mobile

[Prototype] PyTorch Mobile Lite Interpreter

Performance Optimization

(Beta) Benchmark utils

(Prototype) FX Graph Mode Quantization

Hardware Support

[Beta] Ability to Extend the PyTorch Dispatcher for a new backend in C++

[Beta] AMD GPU Binaries Now Available

Getting started

Performance

Updating from older PyTorch versions

Future

NNAPI Support with Google Android

PyTorch Mobile GPU support

ARM64 Builds for Linux

Why Does PyTorch Need a CLA?

What is Not Changing

How the New CLA will Work

Front End APIs

[Beta] NumPy Compatible torch.fft module

[Beta] C++ Support for Transformer NN Modules

[Beta] torch.set_deterministic

Performance & Profiling

[Beta] Stack traces added to profiler

Distributed Training & RPC

[Stable] TorchElastic now bundled into PyTorch docker image

[Beta] Support for uneven dataset inputs in DDP

[Beta] NCCL Reliability – Async Error/Timeout Handling

[Beta] TorchScript rpc_remote and rpc_sync

[Beta] Distributed optimizer with TorchScript support

[Beta] Enhancements to RPC-based Profiling

[Prototype] Windows support for Distributed Training

Mobile

[Beta] PyTorch Mobile Caching allocator for performance improvements

torchvision

[Stable] Transforms now support Tensor inputs, batch computation, GPU, and TorchScript

[Stable] Native image IO for JPEG and PNG formats

[Stable] RetinaNet detection model

[Beta] New Video Reader API

torchaudio

[Stable] Speech Recognition

[Stable] Text-to-speech

[Stable] Source Separation

Navigation

Computer Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2023 Vedere AI. All Rights Reserved.

[Stable] `Torch.fft` support for high performance NumPy style FFTs

[Beta] Support for NumPy style linear algebra functions via `torch.linalg`

[Beta] TorchScript `rpc_remote` and `rpc_sync`