Performance experiments with Stable Diffusion

This is a companion to the main blog “Accelerated Stable Diffusion with PyTorch 2”, containing detailed information on benchmarking setup and results of individual experiments. It is mainly aimed at a hands-on reader who would want to reproduce or develop further the work we described in the main text. Please see the main text for all the context and the summary of results.

Appendix 1: benchmarked versions definition

Here we define precisely what we mean by “original code” and “optimized code” in the main text.

Original code

Lives in https://github.com/sgrigory/stablediffusion2 on original-benchmark branch, specifically in this commit. This is almost the same code as in https://github.com/Stability-AI/stablediffusion, with minimal modifications necessary for benchmarking. In particular, the code is able to turn off xFormers attention when the environment variable USE_XFORMERS is set to False.

This code uses PyTorch 1.12 and the original custom implementation of attention.

Optimized code

The optimized version is the code living here. It has all the optimizations we mentioned in the main text:

  • nn.MultiheadAttention in CrossAttention instead of custom attention implementation
  • Compilation with torch.compile
  • Other minor optimizations in PyTorch-related code.

The first optimization (using nn.MultiheadAttention in CrossAttention) schematically boils down to the following pseudocode:

class CrossAttention(nn.Module):
    def __init__(self, ...):
        # Create matrices: Q, K, V, out_proj
        ...
    def forward(self, x, context=None, mask=None):
       # Compute out = SoftMax(Q*K/sqrt(d))V
       # Return out_proj(out)
       …

gets replaced with

class CrossAttention(nn.Module):
    def __init__(self, ...):
        self.mha = nn.MultiheadAttention(...)
    def forward(self, x, context):
	return self.mha(x, context, context)

See the full diff here.

We have also introduced the following CLI flags:

  • --disable_math, --disable_mem_efficient, --disable_flash to allow turning specific attention backends off
  • --compile to turn on PyTorch compilation

The optimized version uses PyTorch 2.0.0.dev20230111+cu117

Flags added to both code versions

In both code versions we have added the following CLI options to txt2img.py.

  • --skip_first to use a “warm-up” iteration before starting to measure time. See the end of section “Benchmarking setup and results summary” in the main text on why this was necessary
  • --time_file <FILENAME> to write runtime in seconds in text format to the specified file

Prompts

Now it should already be clear how to run the 5 configurations mentioned in the main text. For completeness we provide the prompts which can be used to run each of them. This assumes you have

  • installed dependencies from the original version into conda environment ldm-original
  • installed dependencies from the optimized version into conda environment ldm
  • downloaded model weights into /tmp/model.ckpt
  • converted model weights to the new architecture and saved them into /tmp/model_native_mha.ckpt

(see Colab for a bash script which does that)

Prompts for 5 configurations:

# Run optimized with memory-efficient attention and compilation
conda activate ldm
git checkout optmize-w-compile
python scripts/txt2img.py --prompt "A photo" --seed 1 --plms --config configs/stable-diffusion/v2-inference_native_mha.yaml --ckpt /tmp/model_native_mha.ckpt --n_iter 2 --n_samples 1 --compile --skip_first

# Run optimized with memory-efficient attention
conda activate ldm
git checkout optmize-w-compile
python stable-diffusion/scripts/txt2img.py --prompt "A photo" --seed 1 --plms --config stable-diffusion/configs/stable-diffusion/v2-inference_native_mha.yaml --ckpt /tmp/model_native_mha.ckpt --n_iter 2 --n_samples 1 --skip_first

# Run optimized without memory-efficient or flash attention
conda activate ldm
git checkout optmize-w-compile
python stable-diffusion/scripts/txt2img.py --prompt "A photo" --seed 1 --plms --config stable-diffusion/configs/stable-diffusion/v2-inference_native_mha.yaml --ckpt /tmp/model_native_mha.ckpt --n_iter 2 --n_samples 1 --disable_mem_efficient --disable_flash --skip_first 

# Run original code with xFormers
conda activate ldm-original
git checkout original-benchmark
python stable-diffusion-original/scripts/txt2img.py --prompt "A photo" --seed 1 --plms --config stable-diffusion-original/configs/stable-diffusion/v2-inference.yaml --ckpt /tmp/model.ckpt --n_iter 2 --n_samples 1 --skip_first

# Run original code without xFormers
conda activate ldm-original
git checkout original-benchmark
USE_XFORMERS=False python stable-diffusion-original/scripts/txt2img.py --prompt "A photo" --seed 1 --plms --config stable-diffusion-original/configs/stable-diffusion/v2-inference.yaml --ckpt /tmp/model.ckpt --n_iter 2 --n_samples 1 --skip_first

Appendix 2: per-run data

Plots with per-run benchmark data can be found here. Each plot shows all the runs for a particular GPU (P100, V100, T4, A10, A100) and batch size (1, 2, or 4). The bar charts in the main text are obtained from this data by averaging. The file names are self-explanatory, for example “original_vs_optimized_A10_n_samples_2_n_iter_2_sd2.png” contains runs for A10 GPU, batch size 2 and number of iterations 2.

Appendix 3: Accelerated Stable Diffusion 1

Before the work on Stable Diffusion 2 described in the main text, we also applied similar optimizations to Stable Diffusion 1 by CompVis prior to the release of Stable Diffusion 2. The original implementation of SD1 does not integrate with xFormers yet, and so the speedup from just using the PyTorch optimized attention instead of custom implementation is significant. It should be noted that the HuggingFace Diffusers port of SD1 allows integration with xFormers, so an interesting open question which we didn’t explore would be how the performance of SD1 with PyTorch optimized attention compares to HuggingFace SD1+xFormers.

We benchmarked two versions of SD1, original and optimized:

  • As the original version we took the first SD release, and placed it here with minimal modifications to simplify benchmarking. It uses PyTorch 1.11 and custom implementation of attention.
  • The optimized version is the code living here. It uses nn.MultiheadAttention in CrossAttention and PyTorch 2.0.0.dev20221220+cu117.

Here are the results for different GPU architectures and batch size 2:

Version

T4 P100 V100 A100
Original SD1 (runtime in s)

70.9 71.5 20.3 14.4
Optimized SD1 (runtime in s)

52.7 (-25.6%) 57.5 (-19.5%) 14.3 (-29.3%) 10.4 (27.9%)

Same as for SD2, we used Meta hardware for P100, V100, A100 benchmarks. The T4 benchmark was done in Google Colab here.

We didn’t apply compilation to SD1, and so didn’t include a “warm-up” iteration in these benchmarks, as we did for SD2.

Both applying torch.compile to SD1 and benchmarking HuggingFace version of SD1 with PyTorch 2 optimisations would be a great exercise for the reader – try it and let us know if you get interesting results.

Read More

Accelerated Stable Diffusion with PyTorch 2

Accelerated Stable Diffusion with PyTorch 2

TL;DR: PyTorch 2.0 nightly offers out-of-the-box performance improvement for Stable Diffusion 2.1 by using the new torch.compile() compiler and optimized implementations of Multihead Attention integrated with PyTorch 2.

Introduction

Stable Diffusion (SD) is a great example of Generative AI, producing high quality images from text prompts. However, as well as for other diffusion-based models, its generation is rather slow, due to the iterative nature of the sampling process by which the images are produced. This makes it important to optimize the code running inside the sampling loop.

We took SD 2.1 from Stability AI as a starting point and accelerated its text-to-image generation using two optimizations available in PyTorch 2: compilation and fast attention implementation. Together with a few minor memory processing improvements in the code these optimizations give up to 49% inference speedup relative to the original SD implementation without xFormers, and 39% inference speedup relative to using SD with xFormers (excluding the compilation time), depending on the GPU architecture and batch size. Importantly, the speedup comes without a need to install xFormers or any other extra dependencies.

The table below shows the improvement in runtime between the original implementation with xFormers installed and our optimized version with PyTorch-integrated memory efficient attention (originally developed for and released in the xFormers library) and PyTorch compilation. The compilation time is excluded.

Runtime improvement in % compared to original+xFormers

See the absolute runtime numbers in section “Benchmarking setup and results summary”

GPU Batch size 1 Batch size 2 Batch size 4
P100 (no compilation) -3.8 0.44 5.47
T4 2.12 10.51 14.2
A10 -2.34 8.99 10.57
V100 18.63 6.39 10.43
A100 38.5 20.33 12.17

One can notice the following:

  • The improvements are significant for powerful GPUs like A100 and V100. For those GPUs the improvement is most pronounced for batch size 1
  • For less powerful GPUs we observe smaller speedups (or in two cases slight regressions). The batch size trend is reversed here: improvement is larger for larger batches

In the following sections we describe the applied optimizations and provide detailed benchmarking data, comparing SD performance with various optimization features on/off.

Specifically, we benchmark 5 configurations and the plots below compare their absolute performance for different GPUs and batch sizes. For definitions of these configurations see section “Benchmarking setup and results”.

Benchmark of Stable Diffusion 2 versions across GPU architectures, batch size 1

Benchmark of Stable Diffusion 2 versions across GPU architectures, batch size 2

Benchmark of Stable Diffusion 2 versions across GPU architectures, batch size 4

If you prefer looking directly at the code, see the Google Colab which runs the benchmark on T4.

Optimizations

Here we’ll go into more detail about the optimizations introduced into the SD code. At the moment they rely on features only available in the nightlies, so we pinned the PyTorch version to a recent nightly (see here). Once the PyTorch 2.0 release comes out, these optimizations won’t have to rely on nightlies any more.

Optimized Attention

One part of the code which we optimized was the scaled dot-product attention. Attention is known to be a heavy operation: naive implementation materializes the attention matrix, leading to time and memory complexity quadratic in sequence length. In Stable Diffusion attention (CrossAttention) appears as part of Transformer blocks in multiple parts of the U-Net. Since the U-Net runs at every sampling step, this becomes a critical point to optimize. In PyTorch 2 optimized attention implementation is integrated into torch.nn.MultiheadAttention, and so we used it to replace the custom attention implementation in CrossAttention.

The optimized implementation of attention was available already in PyTorch 1.13 (see here) and widely adopted (see e.g. HuggingFace transformers library example). In particular, it integrates memory-efficient attention from the xFormers library and flash attention from https://arxiv.org/abs/2205.14135. PyTorch 2.0 expands this to additional attention functions such as cross attention and custom kernels for further acceleration, making it applicable to SD.

Flash attention is available on GPUs with compute capability SM 7.5 or SM 8.x – for example, on T4, A10, and A100, which are included in our benchmark (you can check compute capability of each NVIDIA GPU here). However, in our tests on A100 the memory efficient attention performed better than flash attention for the particular case of SD, due to the small number of attention heads and small batch size. PyTorch understands this and chooses memory efficient attention over flash attention for SD when both are available (see the logic here). For full control over the attention backends (memory-efficient attention, flash attention, “vanilla math”, or any future ones), power users can enable and disable them manually with the help of the context manager torch.backends.cuda.sdp_kernel.

Compilation

Compilation is a new feature of PyTorch 2.0, enabling significant speedups with a very simple user experience. To invoke the default behavior, simply wrap a PyTorch module or a function into torch.compile:

model = torch.compile(model)

PyTorch compiler then turns Python code into a set of instructions which can be executed efficiently without Python overhead. The compilation happens dynamically the first time the code is executed. With the default behavior, under the hood PyTorch utilized TorchDynamo to compile the code and TorchInductor to further optimize it. See this tutorial for more details.

Although the one-liner above is enough for compilation, certain modifications in the code can squeeze a larger speedup. In particular, one should avoid so-called graph breaks – places in the code which PyTorch can’t compile. As opposed to previous PyTorch compilation approaches (like TorchScript), PyTorch 2 compiler doesn’t break in this case. Instead it falls back on eager execution – so the code runs, but with reduced performance. We introduced a few minor changes to the SD code to eliminate graph breaks (here and here). See this doc to learn more about graph breaks and how to eliminate them.

Note that compilation requires GPU compute capability >= SM 7.0 to run in non-eager mode. This covers all GPUs in our benchmarks – T4, V100, A10, A100 – except for P100 (see the full list).

Other optimizations

In addition, we have improved efficiency of some memory operations – e.g. creating a tensor on GPU directly rather than creating it on CPU and later moving to GPU (see here and here). The places where such optimizations were necessary were determined by line-profiling and looking at CPU/GPU traces and Flame Graphs.

Benchmarking setup and results summary

We have two versions of SD code to compare: original and optimized. On top of this, several optimization features (xFormers, PyTorch memory efficient attention, compilation) can be turned on/off. Overall, as mentioned in the introduction, we will be benchmarking 5 configurations:

  • Original code without xFormers
  • Original code with xFormers
  • Optimized code with vanilla math attention backend and no compilation
  • Optimized code with memory-efficient attention backend and no compilation
  • Optimized code with memory-efficient attention backend and compilation

As the original version we took the SD 2.1 release, and placed it here with minimal modifications necessary for benchmarking. It uses PyTorch 1.12 and a custom implementation of attention.

The optimized version is the code living here. It uses nn.MultiheadAttention in CrossAttention and PyTorch 2.0.0.dev20230111+cu117. It also has a few other minor optimizations in PyTorch-related code.

Please see the appendix “Benchmarked versions definition” in the companion page for the precise definition of the 5 configurations and prompts triggering each of them.

The table below shows runtime of each version of the code in seconds, and the percentage improvement compared to the original with xFormers. The compilation time is excluded.

Runtimes for batch size 1. In parenthesis – relative improvement with respect to the “Original with xFormers” row

Configuration P100 T4 A10 V100 A100
Original without xFormers 30.4s (-19.3%) 29.8s (-77.3%) 13.0s (-83.9%) 10.9s (-33.1%) 8.0s (-19.3%)
Original with xFormers 25.5s (0.0%) 16.8s (0.0%) 7.1s (0.0%) 8.2s (0.0%) 6.7s (0.0%)
Optimized with vanilla math attention, no compilation 27.3s (-7.0%) 19.9s (-18.7%) 13.2s (-87.2%) 7.5s (8.7%) 5.7s (15.1%)
Optimized with mem. efficient attention, no compilation 26.5s (-3.8%) 16.8s (0.2%) 7.1s (-0.8%) 6.9s (16.0%) 5.3s (20.6%)
Optimized with mem. efficient attention and compilation 16.4s (2.1%) 7.2s (-2.3%) 6.6s (18.6%) 4.1s (38.5%)

Runtimes for batch size 2

Configuration P100 T4 A10 V100 A100
Original without xFormers 58.0s (-21.6%) 57.6s (-84.0%) 24.4s (-95.2%) 18.6s (-63.0%) 12.0s (-50.6%)
Original with xFormers 47.7s (0.0%) 31.3s (0.0%) 12.5s (0.0%) 11.4s (0.0%) 8.0s (0.0%)
Optimized with vanilla math attention, no compilation 49.3s (-3.5%) 37.9s (-21.0%) 17.8s (-42.2%) 12.7s (-10.7%) 7.8s (1.8%)
Optimized with mem. efficient attention, no compilation 47.5s (0.4%) 31.2s (0.5%) 12.2s (2.6%) 11.5s (-0.7%) 7.0s (12.6%)
Optimized with mem. efficient attention and compilation 28.0s (10.5%) 11.4s (9.0%) 10.7s (6.4%) 6.4s (20.3%)

Runtimes for batch size 4

Configuration P100 T4 A10 V100 A100
Original without xFormers 117.9s (-20.0%) 112.4s (-81.8%) 47.2s (-101.7%) 35.8s (-71.9%) 22.8s (-78.9%)
Original with xFormers 98.3s (0.0%) 61.8s (0.0%) 23.4s (0.0%) 20.8s (0.0%) 12.7s (0.0%)
Optimized with vanilla math attention, no compilation 101.1s (-2.9%) 73.0s (-18.0%) 28.3s (-21.0%) 23.3s (-11.9%) 14.5s (-13.9%)
Optimized with mem. efficient attention, no compilation 92.9s (5.5%) 61.1s (1.2%) 23.9s (-1.9%) 20.8s (-0.1%) 12.8s (-0.9%)
Optimized with mem. efficient attention and compilation 53.1s (14.2%) 20.9s (10.6%) 18.6s (10.4%) 11.2s (12.2%)

To minimize fluctuations and external influence on the performance of the benchmarked code, we ran each version of the code one after another, and then repeated this sequence 10 times: A, B, C, D, E, A, B, … So the results of a typical run would look like the one in the picture below. For results of all runs please see appendix “Per-run data” in the companion page. Note that one shouldn’t rely on comparison of absolute run times between different graphs, but comparison of run times inside one graph is pretty reliable, thanks to our benchmarking setup.

Stable Diffusion 2.1 benchmarks

Each run of txt2img.py generates several batches, which is regulated by the CLI parameter --n_iter. In the benchmarks we used n_iter = 2, but introduced an additional “warm-up” iteration, which doesn’t contribute to the run time. This was necessary for the runs with compilation, because compilation happens the first time the code runs, and so the first iteration is much longer than all subsequent. To make comparison fair, we also introduced this additional “warm-up” iteration to all other runs, which is turned on by CLI option --skip_first provided to the modified txt2img.py.

The numbers in the table above are for number of iterations 2 (plus a “warm-up one”), prompt ”A photo”, seed 1, PLMS sampler, and autocast turned on. See the companion page for precise CLI commands in appendix “Benchmarked versions definition” and detailed results of individual runs in appendix “Per-run data”.

The P100, V100, and A100 benchmarks were done on Meta internal infrastructure. The T4 benchmarks were done in Google Colab Pro (see the Google Colab notebook). The A10 benchmarks were done on g5.4xlarge AWS instances with 1 GPU.

Conclusions and next steps

We have shown that new features of PyTorch 2 – compiler and optimized attention implementation – give performance improvements exceeding or comparable with what previously required installation of an external dependency (xFormers). PyTorch achieved this, in particular, by integrating memory efficient attention from xFormers into its codebase. This is a significant improvement for user experience, given that xFormers, being a state-of-the-art library, in many scenarios requires custom installation process and long builds.

There are a few natural directions in which this work can be continued:

  • There are new implementations of SD, including a port to HuggingFace diffusers library. It would be interesting to benchmark against them. Note that diffusers also require installing xFormers in order to use memory efficient attention
  • The optimizations we implemented and described here are only benchmarked for text-to-image inference so far. It would be interesting to see how they affect training. PyTorch compilation can be directly applied to training; enabling training with PyTorch optimized attention is on the roadmap
  • We intentionally minimized changes to the original SD code. Further profiling and optimization can probably bring more improvements
  • At the moment compilation is applied only to the U-Net model inside the sampler. Since there is a lot happening outside of U-Net (e.g. operations directly in the sampling loop), it would be beneficial to compile the whole sampler. However, this would require analysis of the compilation process to avoid recompilation at every sampling step
  • Current code only applies compilation within the PLMS sampler, but it should be trivial to extend it to other samplers
  • Besides text-to-image generation, SD 2.1 has other pipelines – image-to-image and inpainting. It would be interesting to measure how their performance improves from PyTorch 2 optimizations

Try some of this in the Colab or on a GPU of your choice. See if you can further increase the performance of SD, and share the results! This is your chance to get a preview of PyTorch 2.0 and experience the features coming in the next release.

As a note, if you want access to new PyTorch features which come after this post is published, just tweak the PyTorch and TorchVision versions in environment.yaml.

Resources

Acknowledgements

We would like to thank Geeta Chauhan, Natalia Gimelshein, Patrick Labatut, Bert Maher, Mark Saroufim, Michael Voznesensky and Francisco Massa for their valuable advice and early feedback on the text.

Special thanks to Yudong Tao for creating the first version of Stable Diffusion with PyTorch native attention.

For more information, visit this page with additional resources.

Read More

PyTorch Trace Analysis for the Masses

PyTorch Trace Analysis for the Masses

Introduction

We are excited to announce the public release of Holistic Trace Analysis (HTA), an open source performance analysis and visualization Python library for PyTorch users. HTA takes as input Kineto traces collected by the PyTorch profiler, which are complex and challenging to interpret, and up-levels the performance information contained in these traces. It was initially developed internally at Meta to understand and debug performance problems for large-scale distributed training jobs on GPUs. The multidisciplinary team has made a number of enhancements to HTA’s features and scaled them to support state-of-the-art ML workloads.

ML researchers and systems engineers often struggle to computationally scale up their models because they are not aware of the performance bottlenecks in their workloads. The resources requested for a job (e.g. GPUs, memory) are often misaligned with the resources actually required due to lack of visibility “under the hood”. To achieve the best performance from the hardware stack, it is imperative to understand the resource utilization and bottlenecks for distributed training workloads.

The initial HTA implementation was specifically targeted at Deep Learning Based Recommendation Models (DLRM). To make the features in HTA generic and applicable to use cases such as analyzing Vision and NLP models, we decided to refactor the HTA codebase and make the library available to the larger community. This new codebase has implemented several important ideas which lead to significant efficiency and performance improvements.

In this blog, we present several features implemented in the open source version of HTA, which can be used as a Python script as well as interactively in a Jupyter notebook. HTA provides the following features:

  1. Breakdown by Dimensions
    1. Temporal: Breakdown of GPU time in terms of time spent in computation, communication, memory events, and idle time on a single node and across all ranks.
    2. Idle Time: Breakdown of GPU idle time into waiting for the host, waiting for another kernel or attributed to an unknown cause.
    3. Kernel: Find kernels with the longest duration on each rank.
    4. Communication Computation Overlap: Calculate the percentage of time when communication overlaps computation.
  2. Statistical Analysis
    1. Kernel Duration Distribution: Distribution of average time taken by longest kernels across different ranks.
    2. CUDA Kernel Launch: Distributions of GPU kernels with very small duration, large duration, and excessive launch time.
    3. Augmented Counters (Memory bandwidth, Queue length): Augmented trace files which provide insights into memory copy bandwidth and number of outstanding operations on each CUDA stream.
  3. Patterns
    1. Frequent CUDA Kernels: Find the CUDA kernels most frequently launched by any given PyTorch or user defined operator.
  4. Trace Comparison
    1. Trace Diff: A trace comparison tool to identify and visualize the differences between traces.

HTA source code is available to users via Github. Users can request new features or build their own analysis using the core libraries and data structures provided in the codebase in addition to the features mentioned above.

GPU Training Performance Debugging 101

To understand the GPU performance in distributed training jobs, we consider how the model operators interact with the GPU devices and how such interactions are reflected in certain measurable metrics.

At a high level, we can break down the GPU operations in a model execution into three broad categories, henceforth referred to as kernel types:

  1. Computation (COMP) – Compute kernels execute compiled routines for matrix multiplication and similar numeric calculations. They are responsible for all of the number-crunching necessary for model execution.
  2. Communication (COMM) – Communication kernels are routines which are responsible for exchanging and synchronizing data between different GPU devices in a distributed training job. The NVIDIA Collective Communication Library (NCCL) is a widely used communication library and all its kernels have the prefix “nccl”. Example NCCL kernels include NCCL_AllGather, NCCL_ReduceScatter, NCCL_AllReduce, etc.
  3. Memory (MEM) – Memory kernels manage the memory allocations/deallocations on the GPU devices and data movement between the memory space on the host and the GPUs. The memory kernels include Memcpy_H2D, Memcpy_D2H, Memcpy_D2D, Memset, etc. Here, H represents the Host and D represents the GPU Device. Thus, H2D, D2H, D2D stands for Host to Device, Device to Host and Device to Device respectively.

Because a modern GPU device like the NVIDIA A100 GPU is a massively parallel device which is capable of running multiple kernels simultaneously, it is possible to overlap the computation, communication, and memory kernels to reduce the model execution time. One common technique to achieve the overlap is to utilize multiple CUDA streams. A CUDA stream is a sequence of operations that execute on a GPU device in the order in which they are issued by the host code. Different CUDA streams can be interleaved and even run concurrently, thus achieving the effect of kernel overlap.

To help understand the above concepts, Figure 1 provides a timeline of the GPU kernels in a sample distributed training job on 8 GPUs for one iteration. In the figure below, each rank represents one GPU and the kernels on each GPU run on 6 CUDA streams. In the right column of the figure, you can see names of the GPU kernels used. In the middle of the figure, you see the overlap between compute and communicate kernels. This figure is created using the plot_timeline example notebook available in HTA.

Figure 1. An example of the execution timeline of GPU Kernels across multiple ranks

Figure 1. An example of the execution timeline of GPU Kernels across multiple ranks

The performance of multiple GPU training jobs is affected by multiple factors. Among these factors, how does a model execution create and orchestrate the GPU kernels plays a critical role. HTA provides insights on how the model execution interacts with the GPU devices and highlights the opportunities for performance improvement.

With the features we built in HTA, we aim to provide users insights into “what is happening under the hood in a distributed GPU training?” We briefly describe these features in the next few paragraphs.

Features in Holistic Trace Analysis

For most users, understanding the performance of GPU training jobs is nontrivial. Thus, we built this library to simplify the task of trace analysis and provide the user useful insights by examining the model execution traces. As the first step, we developed features which are important and generic enough so that most users can benefit from this library.

Temporal Breakdown: We begin by asking whether the GPU is spending time on computation, communication, memory events, or is it idle? To answer this question, the temporal breakdown feature presents a breakdown in terms of these categories. To achieve high training efficiency the code should maximize time used by computation kernels and minimize idle time and non-compute time (time used by communication or memory kernels). This is accomplished by implementing concurrent execution of computation kernels with communication or memory kernels. Note that, during concurrent execution of computation kernels with communication/memory kernels the time spent by communication/memory kernels is accounted for under compute time.

Figure 2: Temporal Breakdown across 8 GPUs

Figure 2: Temporal Breakdown across 8 GPUs

Kernel Breakdown: It is natural to ask which kernels are taking the most amount of time. The next feature breaks down the time spent within each kernel type (COMM, COMP, MEM) and sorts them by duration. We present this information for each kernel type and for each rank as a pie chart. See figure 3 below.

Figure 3: Pie chart of top computation and communication kernels

Figure 3: Pie chart of top computation and communication kernels

Kernel Duration Distribution: Subsequently, one can also ask – for any given kernel, what is the distribution of the time spent across the ranks? To answer this, HTA generates bar graphs for the average duration of a given kernel across all ranks. Additionally, the error bars in the bar graphs show the minimum and maximum amount of time taken by a given kernel on a given rank. Figure 4 below shows a discrepancy between average duration on rank 0 as compared to other ranks. This anomalous behavior on rank 0 guides the user on where to look for possible bugs.

Figure 4: Average duration of NCCL AllReduce Kernel across 8 ranks

Figure 4: Average duration of NCCL AllReduce Kernel across 8 ranks

Communication Computation Overlap: In distributed training, a significant amount of time is spent in communication and synchronization events among multiple GPU devices. To achieve high GPU efficiency (i.e. TFLOPS/GPU) it is vital to keep the GPU doing actual computation work. In other words, a GPU should not be blocked because of waiting for data from other GPUs. One way to measure the extent to which computation is blocked by data dependencies is to calculate the computation-communication overlap. Higher GPU efficiency is observed if communication events overlap computation events. Lack of communication and computation overlap will lead to the GPU being idle, thus the efficiency would be low. Thus, the communication computation overlap feature calculates the percentage of time communication and computation overlap in a job for each rank and generates a bar graph representation. See figure below. More precisely, we measure the following ratio

(time spent in computation while communicating) / (time spent in communication)

Figure 5: Communication computation overlap

Figure 5: Communication computation overlap

Augmented Counters (Queue length, Memory bandwidth): To aid in debugging, HTA calculates the memory bandwidth statistics for D2H, H2D and D2D memory copy (memcpy) and memory set (memset) events. Additionally, HTA also computes the number of outstanding CUDA operations on each CUDA stream. We refer to this as queue length. When the queue length on a stream is 1024 or larger new events cannot be scheduled on that stream and the CPU will stall until the GPU events have processed. Additionally, HTA generates a new trace file containing tracks with the memory bandwidth and queue length time series. See Figure 6 below.

Figure 6: Memory Bandwidth and Queue Length

Figure 6: Memory Bandwidth and Queue Length

These primary features give us a peek into the system performance and help answer “what is happening in the system?”. As HTA evolves, we hope to address “why is X happening?” and also suggest possible solutions to overcome the bottlenecks.

Installation and Usage

Installation

For installing the HTA please refer to the README. In brief, the user is required to clone the repo and install the necessary Python packages via pip.

Usage

This version of Holistic Trace Analysis is currently in beta and we recommend using HTA in a Jupyter notebook. A demo notebook is provided for your convenience. To get started, import the hta package in a Jupyter notebook, create a TraceAnalysis object and off we go in exactly two lines of code.

from hta.trace_analysis import TraceAnalysis
analyzer = TraceAnalysis(trace_dir = /trace/folder/path)

Requirements

  • All trace files for a training or inference job must be stored in a unique folder.
  • Trace files are in json or gzipped json format.

FAQ

Q. How can I install HTA?

Please see the README in the root directory of the repository.

Q. Is there any documentation on the features and API in HTA?

The documentation and detailed API is available here.

Q. Can you implement feature X?

Depending on how widely the feature is needed and the level of effort required to implement it we would consider developing the feature. Please open a Github Issue and tag it with the feature-request label.

Q. Can I modify the code?

Please do and send a PR along the way, if you think it would be useful for others.

Q. How can I collect traces in PyTorch?

Please refer to this tutorial here.

Q. Can HTA be used at production scale?

Yes, please see a use case study here.

Read More

Compromised PyTorch-nightly dependency chain between December 25th and December 30th, 2022.

If you installed PyTorch-nightly on Linux via pip between December 25, 2022 and December 30, 2022, please uninstall it and torchtriton immediately, and use the latest nightly binaries (newer than Dec 30th 2022).

$ pip3 uninstall -y torch torchvision torchaudio torchtriton
$ pip3 cache purge

PyTorch-nightly Linux packages installed via pip during that time installed a dependency, torchtriton, which was compromised on the Python Package Index (PyPI) code repository and ran a malicious binary. This is what is known as a supply chain attack and directly affects dependencies for packages that are hosted on public package indices.

NOTE: Users of the PyTorch stable packages are not affected by this issue.**

How to check if your Python environment is affected

The following command searches for the malicious binary in the torchtriton package (PYTHON_SITE_PACKAGES/triton/runtime/triton) and prints out whether your current Python environment is affected or not.

python3 -c "import pathlib;import importlib.util;s=importlib.util.find_spec('triton'); affected=any(x.name == 'triton' for x in (pathlib.Path(s.submodule_search_locations[0] if s is not None else '/' ) / 'runtime').glob('*'));print('You are {}affected'.format('' if affected else 'not '))"

The malicious binary is executed when the triton package is imported, which requires explicit code to do and is not PyTorch’s default behavior.

The Background

At around 4:40pm GMT on December 30 (Friday), we learned about a malicious dependency package (torchtriton) that was uploaded to the Python Package Index (PyPI) code repository with the same package name as the one we ship on the PyTorch nightly package index. Since the PyPI index takes precedence, this malicious package was being installed instead of the version from our official repository. This design enables somebody to register a package by the same name as one that exists in a third party index, and pip will install their version by default.

This malicious package has the same name torchtriton but added in code that uploads sensitive data from the machine.

What we know

torchtriton on PyPI contains a malicious triton binary which is installed at PYTHON_SITE_PACKAGES/triton/runtime/triton. Its SHA256 hash is listed below.

SHA256(triton)= 2385b29489cd9e35f92c072780f903ae2e517ed422eae67246ae50a5cc738a0e

The binary’s main function does the following:

  • Get system information
    • nameservers from /etc/resolv.conf
    • hostname from gethostname()
    • current username from getlogin()
    • current working directory name from getcwd()
    • environment variables
  • Read the following files
    • /etc/hosts
    • /etc/passwd
    • The first 1,000 files in $HOME/*
    • $HOME/.gitconfig
    • $HOME/.ssh/*
  • Upload all of this information, including file contents, via encrypted DNS queries to the domain *.h4ck[.]cfd, using the DNS server wheezy[.]io

The binary’s file upload functionality is limited to files less than 99,999 bytes in size. It also uploads only the first 1,000 files in $HOME (but all files < 99,999 bytes in the .ssh directory).

Steps taken towards mitigation

  • torchtriton has been removed as a dependency for our nightly packages and replaced with pytorch-triton (pytorch/pytorch#91539) and a dummy package registered on PyPI (so that this issue doesn’t repeat)
  • All nightly packages that depend on torchtriton have been removed from our package indices at https://download.pytorch.org until further notice
  • We have reached out to the PyPI security team to get proper ownership of the torchtriton package on PyPI and to delete the malicious version

Read More

Torchserve Performance Tuning, Animated Drawings Case-Study

Torchserve Performance Tuning, Animated Drawings Case-Study

Serving models in production

In this post we discuss performance tuning of Torchserve for serving your models in production. One of the biggest challenges in the life cycle of a ML project is deploying models in production. This requires a reliable serving solution along with solutions that address the MLOps needs. A robust serving solution needs to provide support for multi model serving, model versioning, metric logging, monitoring and scaling to serve the peak traffic. In this post, we will have an overview of Torchserve and how to tune its performance for production use-cases. We discuss the Animated Drawings app from Meta that can turn your human figure sketches to animations and how it could serve the peak traffic with Torchserve. The Animated Drawing’s workflow is below.

https://ai.facebook.com/blog/using-ai-to-bring-childrens-drawings-to-life/

Many AI systems and tools are designed to handle realistic images of humans, children’s drawings add a level of complexity and unpredictability as they are often constructed in abstract, fanciful ways. These types of morphological and stylistic variations can confuse even state-of-the-art AI systems that excel at spotting objects in photorealistic images and drawings.
Meta AI researchers are working to overcome this challenge so that AI systems will be better able to recognize drawings of human figures in the wildly varied ways that children create them. This great blog post provides more details about the Animated Drawings and the approach taken.

Torchserve

Fig1. Overall flow of Torchserve performance tuning

Once you have trained your model, it needs to be integrated into a larger system to have a full-fledged application, we use the term “model serving” to refer to this integration. Basically model serving is making your trained model available to run inferences and subsequent use of the model.

Torchserve is the Pytorch preferred solution for serving models in production. It is a performant and scalable tool that wraps your model in a HTTP or HTTPS API. It has a frontend implemented in Java that handles multiple tasks from assigning workers for serving models to handling the connection between client and server. Torchserve has a Python backend that is responsible for handling the inference service.

Torchserve supports multi model serving and versioning for AB test, dynamic batching, logging and metrics. It exposes four APIs for inference, explanations, management and metrics.

Inference API is listening on port 8080 and accessible through localhost by default, this can be configured in Torchserve configuration and enable getting predictions from the model.

Explanation API uses Captum under the hood to provide explanations of the model that is being served and listens to the port 8080 as well.

Management API allows to register or unregister and describe a model. It also enables users to scale up or down the number of workers that serve the model.

Metric API by default listens to port 8082 and enables us to monitor the model that is being served.

Torchserve let you scale your model serving and handle the peak traffic by supporting batch inference and multiple workers that serve your model. Scaling can be done through management API and settings through a configuration file. Also, metric API helps you to monitor your model serving through default and customizable metrics.

Other advanced settings such as the length of the queue for the received requests, maximum wait time for a batch of inputs and many other properties are configurable through a config file that can be passed to Torchserve when it is started.

Steps to serve your model with Torchserve

  1. Install Torchserve, model archiver and its requirements.
  2. Choose a default handler that fits your task (e.g image classification, etc) or author a custom handler.
  3. Package your model artifacts (trained model checkpoint and all other necessary files for loading and running your model) and the handler into a “.mar” file using Torcharchive and place it in the model store.
  4. Start serving your model.
  5. Run inference.
    We will discuss model handlers and metrics in more detail here.

Model handlers

Torchserve uses a handler in the backend to load the models, preprocess the received data, run inference and post-process the response. Handler in torchserve is a python script that all the model initialization, preprocessing, inference and post processing logic goes into.

Torchserve provides an out of the box handler for a number of applications like image classification, segmentation, object detection and text classification. It also supports custom handlers, in case your use case is not supported in default handlers.

It provides a great flexibility in custom handlers, this potentially make Torchserve as multi-framework serving tool. Custom handlers let you define your custom logic to initialize a model that can be used also to load models from other frameworks such as ONNX.

Torchserve handler is made of four main functions, initialize, preprocess, inference and postprocess that each return a list. The code snippet below shows an example of a custom handler.Custom handlers inherit from BaseHandler in Torchserve and can overwrite any of the main functions. Here is an example of the handler used for loading the Detectron2 model for figure detection, this model has been exported to Torchscript and uses model.half() to run the inference with FP16, details are explained in another section in this post.


class MyModelHandler(BaseHandler):
    def initialize(self, context):
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        serialized_file = self.manifest["model"]["serializedFile"]
        model_pt_path = os.path.join(model_dir, serialized_file)

        self.device = torch.device(
        "cuda:" + str(properties.get("gpu_id"))
        if torch.cuda.is_available() and properties.get("gpu_id") is not None
        else "cpu"
        )
        self.model = torch.jit.load(model_pt_path, map_location=self.device)

        self.model = self.model.half()

    def preprocess(self, data):

        inputs = []
        for request in batch:

            request_body = request.get("body")

            input_ = io.BytesIO(request_body)
            image = cv2.imdecode(np.fromstring(input_.read(), np.uint8), 1)
            input = torch.Tensor(image).permute(2, 0, 1)
            input = input.to(self.device)
            input = input.half()
            inputs.append({"image": input})

        return inputs

    def inference(self,inputs):
        predictions = self.model(**inputs)
        return predictions

    def postprocess(self, output):
        responses = []
        for inference_output in inference_outputs:
            responses_json = {
            'classes': inference_output['pred_classes'].tolist(),
            'scores': inference_output['scores'].tolist(),
            "boxes": inference_output['pred_boxes'].tolist()
            }
            responses.append(json.dumps(responses_json))

        return responses

Metrics

An essential component in serving models in production is the ability to monitor them. Torchserve collects system level metrics regularly and allows adding custom metrics as well.

System level metrics consist of CPU utilization, available and used disk space and memory on the host machine along with number of requests with different response codes (e.g 200-300, 400-500 and above 500). Custom metrics can be added to the metrics as explained here. TorchServe logs these two sets of metrics to different log files. Metrics are collected by default at:

  • System metrics – log_directory/ts_metrics.log
  • Custom metrics – log directory/model_metrics.log

As mentioned before, Torchserve also exposes metric API, that by default listens to port 8082 and enables users to query and monitor the collected metrics. The default metrics endpoint returns Prometheus formatted metrics. You can query metrics using curl requests or point a Prometheus Server to the endpoint and use Grafana for dashboards.

While serving a model you can query metrics using curl request as follows:

curl http://127.0.0.1:8082/metrics

In case you are looking into exporting the logged metrics, please refer to this example that uses mtail to export metrics to Prometheus. Tracking these metrics in a dashboard allows you to monitor performance regressions that may have been sporadic or hard to spot during an offline benchmark run.

What to consider for tuning performance of a model in production

The workflow suggested in Fig 1, is the general idea on how to approach model deployment in production with Torchserve.

In many cases serving models in production is optimized based on throughput or latency service level agreement (SLA)s. Usually real-time applications are more concerned about latency whereas off-line applications may care more about higher throughput.

There are a number of main factors contributing to the performance of a serving model in production. In particular, we are focusing on serving Pytorch models with Torchserve here, however most of these factors generalize to all models from other frameworks as well.

  • Model optimizations: this is a pre-step for deploying models into production. This is a very broad discussion that we will get into in a series of future blogs. This includes techniques like quantization, pruning to decrease the size of the model, using Intermediate representations (IR graphs) such as Torchscript in Pytorch, fusing kernels and many others. Currently torchprep provides many of these techniques as a CLI tool.
  • Batch inference: it refers to feeding multiple inputs into a model, while it is essential during training, it can be very helpful to manage the cost at inference time as well. Hardware accelerators are optimized for parallelism and batching helps to saturate the compute capacity and often leads to higher throughput. The main difference in inference is you can’t wait too long to get a batch filled from clients, something we call dynamic batching
  • Number of Workers : Torchserve uses workers to serve models. Torchserve workers are Python processes that hold a copy of the model weights for running inference. Too few workers means you’re not benefitting from enough parallelism but too many can cause worker contention and degrade end to end performance.

  • Hardware : choosing the appropriate hardware based on the model, application and latency, throughput budget. This could be one of the supported hardwares in Torchserve, CPU, GPU, AWS Inferentia. Some hardware configurations are intended for best in class performance and others are better suited for cost effective inference. From our experiments we’ve found that GPUs shine best at larger batch sizes whereas the right CPUs and AWS Inferentia can be far more cost effective for lower batch sizes and low latency.

Best Practices for Performance tuning on Torchserve

To get the best performance out of your model while serving it with Torchserve, we are sharing some of the best practices here. Torchserve provides a benchmark suite that provides helpful insight to make informed decisions on different choices as detailed below.

  • Optimize your model as the first step, Pytorch model optimization tutorials. Model optimization choices are also closely tied to the hardware of choice. We will discuss it in more detail in another blog post.
  • Deciding the hardware for model deployment can be closely related to the latency and throughput budget and cost per inference. Depending on the size of model and application it can vary, for some models like computer vision models it has been historically not affordable to run in production on CPU. However, by having optimizations such IPEX as recently added to Torchserve this has been much more affordable and cost beneficial and you can learn more in this investigative case study
  • Workers in Torchserve are Python processes that provide parallelism, setting the number of workers should be done carefully. By default Torchserve launch number of workers equal to VCPUs or available GPUs on the host, this can add a considerable amount of time to the Torchserve start.

    Torchserve exposes a config property to set the number of workers. To provide an efficient parallelism through multiple workers and avoiding them to compete over resources, as a baseline we recommend following setting on CPU and GPU:

    CPU : In the handler, torch.set_num_threads(1) then set the number of workers to num physical cores / 2. But the the best threading configurations can be achieved by leveraging the Intel CPU launcher script.

    GPU: number of available GPUs can be set through number_gpus in config.properties. Torchserve uses round robin to assign workers to GPUs. We recommend setting the number of workers as follows. Number of worker = (Number of available GPUs) / (Number of Unique Models). Note that GPUs that are pre-Ampere do not provide any resource isolation with Multi Instance GPUs.

  • Batch size can directly affect the latency and the throughput. To better utilize the compute resources batch size needs to be increased. However, there is a tradeoff between latency and throughput. Larger batch sizes can increase the throughput but results in a higher latency as well. Batch size can be set in Torchserve in two ways, either through model config in config.properties or while registering the model using Management API.

In the next section, we are going to use Torchserve benchmark suite to decide the best combination of model optimization, hardware, workers, and batch size.

Animated Drawings Performance Tuning

To use the Torchserve benchmark suite, first we need to have an archived file, “.mar” file as discussed above, that contains the model, handler and all other artifacts to load and run inference. Animated Drawings uses Detectron2’s implementation of Mask-RCNN for an object detection model.

How to run benchmark suite

The Automated benchmark suite in Torchserve let you benchmark multiple models with different setting including batch size and number of worker and finally generate a report for you. To get started:

git clone https://github.com/pytorch/serve.git

cd serve/benchmarks

pip install -r requirements-ab.txt

apt-get install apache2-utils

Model level settings can be configured in a yaml file similar to


Model_name:
    eager_mode:
        benchmark_engine: "ab"
        url: "Path to .mar file"
        workers:
            - 1
            - 4
        batch_delay: 100
        batch_size:
            - 1
            - 2
            - 4
            - 8
        requests: 10000
        concurrency: 10
        input: "Path to model input"
        backend_profiling: False
        exec_env: "local"
        processors:
            - "cpu"
            - "gpus": "all"

This yaml file will be referenced in the benchmark_config_template.yaml file that includes other settings for generating reports, this can optionally work with AWS cloud watch for logs as well.

python benchmarks/auto_benchmark.py --input benchmark_config_template.yaml

Running the benchmarks, results will be written in “csv” file that can be found in “_ /tmp/benchmark/ab_report.csv_” and full report “/tmp/ts_benchmark/report.md”. It will include items such as Torchserve average latency, model P99 latency, throughput, number of concurrency, number of requests, handler time, and some other metrics. Here we focus on some of the important ones that we track to tune the performance which are, concurrency, model P99 latency, throughput. We look at these numbers specifically in combination with batch size, the used device, number of workers and if any model optimization has been done.

The latency SLA for this model has been set to 100 ms, this is real-time application and as we discussed earlier, latency is more of a concern and throughput ideally should be as high as possible while it does not violate the latency SLA.

Through searching the space, over different batch sizes (1-32), number of workers (1-16) and devices (CPU,GPU), we have run a set of experiments that summarized the best ones in the table below.

Device Concurrency # Requests #workers Batch size Payload/image Optimization Throughput Latency P99
CPU 10 1000 1 1 small N/A 3.45 305.3 ms
CPU 1 1000 1 1 small N/A 3.45 291.8 ms
GPU 10 1000 1 1 small N/A 41.05 25.48 ms
GPU 1 1000 1 1 small N/A 42.21 23.6 ms
GPU 10 1000 1 4 small N/A 54.78 73.62 ms
GPU 10 1000 1 4 small model.half() 78.62 50.69 ms
GPU 10 1000 1 8 small model.half() 85.29 94.4 ms

The latency of this model on CPU with all of the tried settings in terms of batch size, concurrency and number of workers did not meet the SLA, in fact ~13x higher.

Moving the model serving to GPU, immediately could improve the latency ~**13x **from 305 ms down to 23.6 ms.

One of the simplest optimizations that we could do for the model was lowering its precision to fp16, it is one liner (model.half()) and could reduce the model P99 latency **by **32% and increase the throughput by almost the same amount.

There could be other optimization done by Torchscripting the model and using optimize_for_inference or other tricks including onnx or tensorrt runtime optimizations which leverage aggressive fusions are out of the scope of this post. We will discuss model optimizations in a separate post.

We found both on CPU and GPU , setting **number of workers=1 **worked the best in this case.

  • Moving the model to GPU, using number of workers = 1, and batch size = 1 increased the Throughput ~12x compared to CPU and latency ~13x.
  • Moving the model to GPU, using model.half(), number of workers = 1, and batch size = 8 yielded best results in terms of Throughput and tolerable latency. Throughput increased ~25x compared to CPU with latency still meeting the SLA (94.4ms).

Note: if you are running the benchmark suite, make sure you are setting a proper batch_delay and set the concurrency of the request to a number proportional to your batch size. Concurrency here means the number of concurrent requests being sent to the server.

Conclusion

In this post, we have discussed the considerations and knobs that Torchserve expose to tune the performance in production. We have discussed the Torchserve benchmark suite as a means to tune the performance and get insights on possible choices for model optimizations, hardware choice and cost in general. We used Animated Drawings app which uses Detectron2’s Mask-RCNN model as a case-study to showcase the performance tuning with benchmark suite.

For more details on Performance tuning in Torchserve please refer to our documentation here.
Also feel free to open a ticket on Torchserve repo for any further questions and feedback.

Acknowledgement

We would like to thank Somya Jain (Meta), Christopher Gustave (Meta) for their great support and guidance throughout many steps of this blog and providing insights to Sketch Animator workflow. Also, special thanks to Li Ning from AWS for the great efforts to make performance tuning much easier on Torchserve with automated benchmark suite.

Read More

Scaling Vision Model Training Platforms with PyTorch

Scaling Vision Model Training Platforms with PyTorch

TL;DR: We demonstrate the use of PyTorch with FairScale’s FullyShardedDataParallel (FSDP) API in writing large vision transformer models. We discuss our techniques for scaling and optimizing these models on a GPU cluster. The goal of this platform scaling effort is to enable research at scale. This blog does not discuss model accuracy, new model architectures, or new training recipes.

1. Introduction

Latest vision research [1, 2] demonstrates model scaling as a promising research direction. In this project, we aim to enable our platforms to train massive vision transformer (ViT) [3] models. We present our work on scaling the largest trainable ViT from 1B to 120B parameters in FAIR vision platforms. We wrote ViT in PyTorch and leveraged its support for large-scale, distributed training on a GPU cluster.

In the rest of this blog, we will first discuss the main challenges, namely scalability, optimization, and numerical stability. Then we will discuss how we tackle them with techniques including data and model parallelism, automatic mixed precision, kernel fusion, and bfloat16. Finally, we present our results and conclude.

2. Main Challenges

2.1 Scalability

The key scalability challenge is to efficiently shard a model’s operations and state across multiple GPUs. A 100B parameter model requires ~200GB of RAM just for parameters, assuming fp16 representation. So, it is impossible to fit the model on a single GPU (A100 has at most 80GB RAM). Therefore, we need some way to efficiently shard a model’s data (input, parameters, activations, and optimizer state) across multiple GPUs.

Another aspect of this problem is to scale without significantly changing the training recipe. E.g. Certain representation learning recipes use a global batch size of up to 4096 beyond which we start to see accuracy degradation. We cannot scale to more than 4096 GPUs without using some form of tensor or pipeline parallelism.

2.2 Optimization

The key optimization challenge is to maintain high GPU utilization even as we scale the number of model parameters and flops. When we scale models to teraflops and beyond, we start to hit major bottlenecks in our software stack that super-linearly increase training time and reduce accelerator utilization. We require hundreds or thousands of GPUs to run just a single experiment. Improvements in accelerator utilization can lead to significant reductions in cost and improve fleet utilization. It enables us to fund more projects and run more experiments in parallel.

2.3 Numerical Stability

The key stability challenge is to avoid numerical instability and divergence at large scale. We empirically observed in our experiments that the training instability gets severe and hard to deal with when we scale up model sizes, data, batch sizes, learning rate, etc. Vision Transformers particularly face training instability even at a lower parameter threshold. E.g., we find it challenging to train even ViT-H (with just 630M parameters) in mixed-precision mode without using strong data augmentation. We need to study the model properties and training recipes to make sure that the models train stably and converge.

3. Our Solutions

Figure 1 depicts our solutions to each of the challenges.

3.1 Addressing scaling challenges with data parallelism and model parallelism

We apply various forms of data and model parallelism to enable fitting very large models in GPU memory.

We use FairScale’s FullyShardedDataParallel (FSDP) API [4], based on PyTorch, to shard parameters, gradients, and optimizer state across multiple GPUs, thereby reducing the memory footprint per GPU. This process consists of the following three steps:

  • Step 1: We wrapped the entire model in a single FSDP instance. This shards the model parameters at the end of a forward pass and gathers parameters at the beginning of a forward pass. This enabled us to scale ~3x from 1.5B to 4.5B parameters.

  • Step 2: We experimented with wrapping individual model layers in separate FSDP instances. This nested wrapping further reduced the memory footprint by sharding and gathering parameters of individual model layers instead of an entire model. The peak memory is then determined by an individually wrapped transformer block in GPU memory in this mode instead of the entire model.

  • Step 3: We used activation-checkpoint to reduce the memory consumption by activations. It saves the input tensors and discards the intermediate activation tensors during the forward pass. These are recomputed during the backward pass.

In addition, we experimented with model-parallelism techniques such as pipeline parallelism [5], which allow us to scale to more GPUs without increasing the batch size.

3.2 Addressing optimization challenges with advanced AMP and kernel fusion

Advanced AMP

Automatic Mixed Precision (AMP) [6] training refers to training models using a lower precision of bits than FP32 or the default but still maintaining accuracy. We experimented with three levels of AMP as described below:

  • AMP O1: This refers to training in mixed precision where weights are in FP32 and some operations are in FP16. With AMP O1, the ops that might impact accuracy remain in FP32 and are not autocasted to FP16.

  • AMP O2: This refers to training in mixed precision but with more weights and ops in FP16 than in O1. Weights do not implicitly remain in FP32 and are cast to FP16. A copy of the master weights is maintained in the FP32 precision that is used by the optimizer. If we want the normalization layer weights in FP32 then we need to explicitly use layer wrapping to ensure that.

  • Full FP16: This refers to training in full FP16 where weights and operations are in FP16. FP16 is challenging to enable for training due to convergence issues.

We found that AMP O2 with LayerNorm wrapping in FP32 leads to the best performance without sacrificing accuracy.

Kernel Fusion

  • To reduce GPU kernel launch overhead and increase GPU work granularity, we experimented with kernel fusions, including fused dropout and fused layer-norm, using the xformers library [7].

3.3 Addressing stability challenges by studying ops numerical stability and training recipes

BFloat16 in general but with LayerNorm in FP32

The bfloat16 (BF16) [8] floating-point format provides the same dynamic range as FP32 with a memory footprint identical to FP16. We found that we could train models in the BF16 format using the same set of hyperparameters as in FP32, without special parameter tuning. Nevertheless, we found that we need to keep LayerNorm in FP32 mode in order for the training to converge.

3.4 Final training recipe

A summary of the final training recipe.

  1. Wrap the outer model in an FSDP instance. Enable parameter sharding after the forward pass.
  2. Wrap individual ViT blocks with activation checkpointing, nested FSDP wrapping, and parameter flattening.
  3. Enable mixed precision mode (AMP O2) with bfloat16 representation. Maintain the optimizer state in FP32 precision to enhance numerical stability.
  4. Wrap normalization layers like LayerNorm in FP32 for better numerical stability.
  5. Maximize the Nvidia TensorCore utilization by keeping matrix dimensions to be multiple of 8. For More details check Nvidia Tensor Core Performance Guide.

4. Results

In this section, we show the scaling results of ViT on three types of tasks: (1) image classification, (2) object detection (3) video understanding. Our key result is that we are able to train massive ViT backbones across these vision tasks after applying the discussed scaling and optimization techniques. This enables vision research at a much larger scale. We trained the models to convergence to verify that we maintain the current baselines even with all the optimizations. A common trend in Figures 2, 3, 4 is that we are able to train up to 25B-param models with an epoch time of less than 4 hours on 128 A100 GPUs. The 60B and 120B models are relatively slower to train.

Figure 2 shows the image-classification scaling result. It plots the epoch time for training ViTs on ImageNet using 128 A100-80GB GPUs with different model sizes.

Figure 2: Image-classification scaling result.

Figure 3 shows the object-detection scaling result. It plots the epoch time for training ViTDet [9] with different ViT backbones on COCO using 128 A100-80GB GPUs.

Figure 3: Object-detection scaling result.

Figure 4 shows the video-understanding scaling result. It plots the epoch time for training MViTv2 [10] models on Kinetics 400 [11] using 128 V100 (32 GB) GPUs in FP32.

Figure 4: Video-understanding scaling result.

Figure 5 shows the optimization result with the ViT-H model in Figure 2 on 8 A100-40GB GPUs.
Three versions are used: (1) the baseline uses PyTorch’s DDP [12] with AMP O1, (2) FSDP + AMP-O2 + other optimizations, and (3) FSDP + FP16 + other optimizations. These optimizations altogether speed up the training by up to 2.2x.

Figure 5: Training speedups from various optimizations.

5. Concluding Remarks

We have demonstrated the use of PyTorch with FairScale’s FullyShardedDataParallel (FSDP) API in writing large vision transformer models. We discuss our techniques for scaling and optimizing these models on a GPU cluster. We hope that this article can motivate others to develop large-scale ML models with PyTorch and its ecosystem.

References

[1] Masked Autoencoders Are Scalable Vision Learners

[2] Revisiting Weakly Supervised Pre-Training of Visual Perception Models

[3] An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

[4] fairscale.nn.FullyShardedDataParallel

[5] Pipeline parallelism in PyTorch

[6] Automatic Mixed Precision (AMP) in PyTorch

[7] xformers

[8] The bfloat16 numerical format

[9] Exploring Plain Vision Transformer Backbones for Object Detection

[10] MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

[11] https://www.deepmind.com/open-source/kinetics

[12] Getting Started with Distributed Data Parallel (DDP)

Read More

Efficient Large-Scale Training with Pytorch FSDP and AWS

Efficient Large-Scale Training with Pytorch FSDP and AWS

Cutting-edge AI models are becoming extremely large. The cost and overhead of training these models is increasing rapidly, and involves large amounts of engineering and guesswork to find the right training regime. FSDP reduces these costs significantly by enabling you to train much larger models with the same amount of resources. FSDP lowers the memory footprint on your GPUs, and is usable via a lightweight configuration that requires substantially less effort, typically with just a few lines of code.

The main performance gains in FSDP come from maximizing the overlap between network communication and model computation, and eliminating the memory redundancy inherent in traditional data parallel training (DDP). PyTorch FSDP can train models approximately 4x larger on the same server resources as DDP and 20x larger if we combine activation checkpointing and activation offloading.

Since PyTorch 1.12, FSDP is now in beta status, and has added a number of new features that can be tuned to further accelerate your model training.

In this series of blog posts, we will explain multiple performance optimizations you can run with FSDP to boost your distributed training speed and model sizes within the context of your available server resources. We use the HuggingFace T5 3B, 11B and DeepVit, in fine-tuning mode, as the running examples throughout the series.

As a preview of some of the optimizations discussed in this series, we show the before and after performance scaled in Flops below (Note that these results can vary based on your server resources and model architecture).

*T5 3B Performance measured on AWS A100 and A10 servers. Original with no optimizations and Tuned with the applied optimization

*T5 11B Performance measured on A100 servers. Original with no optimizations and Tuned with the applied optimization

In this first post, we will provide a quick overview of FSDP and how it can make training large- scale AI models more efficient. We will highlight briefly the multiple performance options available, and dive deeper into the details on these in upcoming posts. We will then conclude with an overview on how to leverage AWS parallel cluster for large- scale training with FSDP.

Optimization T5 Model Throughput Improvement
Mixed Precision 3 B 5x
11 B 10x
Activation Checkpointing (AC) 3 B 10x
11 B 100x
Transformer Wrapping Policy 3 B 2x
11 B Unable to run the experiment without the Transformer wrapping policy.
Full Shard Strategy 3 B 1.5x
11 B Not able to run with Zero2

Performance optimization gains on T5 models over non-optimized.

In our experiments with the T5 3B model, using the transformer wrapping policy resulted in >2x higher throughput measured in TFLOPS versus the default wrapping policy. Activation checkpointing resulted in 10x improvement by reinvesting the freed memory from the checkpoints into larger batch size. Mixed precision with BFloat16 resulted in ~5x improvement versus FP32 and finally the full sharding strategy versus zero2 (DDP) resulted in 1.5x improvement.

We ran similar experiments for a larger model, T5 11B, but the larger model size resulted in some changes to the experiment space. Specifically, we found that two optimizations, transformer wrapping policy and activation checkpointing, were needed to enable us to run these experiments on 3 nodes (each node had 8 A100 gpus with 80 GB of memory). With these optimizations, we could fit a batch size of 50 and get higher throughput compared to removing each one of them. Thus rather than running on/off solely for a single optimization test as with the 3B model, the larger model experiments were done with 1 of 3 optimizations turned on/off while always running the other two in order to allow a usable batch size for both test states for each item.

Based on TFLOP comparisons, with the 11B model, we saw even more payoff from the optimizations. Mixed precision(~10x improvement) and activation checkpointing (~100x improvement) had a much larger impact with the 11B model compared to the 3B parameter model. With mixed precision we could fit ~2x larger batch sizes and with activation checkpointing >15x batch sizes (from 3 with no activation checkpointing to 50 with activation checkpointing) which translated into large throughput improvements.

We also have observed that for these larger models > 3B, using Zero2 sharding strategy would result in minimal room left in memory for the batch data, and had to go with very small batch sizes (e.g 1-2) that essentially makes full sharding strategy a necessity to enable fitting larger batches sizes.

Note – this tutorial assumes a basic understanding of FSDP. To learn more about basics of FSDP please refer to the getting started and advanced FSDP tutorials.

What is FSDP? How does it make Large-Scale Training More Efficient

FSDP expands upon distributed data parallel, by parallelizing not just data, but the model parameters, the optimizer states and gradients associated with the model. Specifically – each GPU only stores a subset of the entire model and the associated subset of optimizer states and gradients.

To show the evolution of distributed training, we can start from the beginning, where AI models were simply trained on a single GPU.

DDP (Distributed Data Parallel) was the initial step up from training with only a single GPU, and was an effort to address the data and model size growth, where multiple GPUs each housed their own copy of the same model. The gain here is that the data for each batch could be split and processed independently on each GPU, all at the same time,thus parallelizing the processing of the data set and increasing training speed by the increasing number of GPUs. The tradeoff is the need to communicate the gradients between each GPU to synchronize the models after the backward pass.

FSDP expands on scaling models by removing the redundancy of optimizer calculations and state storage, as well as gradient and memory storage of model parameters that are present in DDP (DDP = Distributed Data Parallel). This redundancy reduction, along with increased communication overlap where model parameter communication takes place at the same time as model computation, is what allows FSDP to train much larger models with the same resources as DDP.

A key point is that this efficiency also allows for AI models that are larger than a single GPU to be trained. The model size available for training is now increased to the aggregate memory of all GPUs, rather than the size of a single GPU. (And as a point of note, FSDP can go beyond aggregated GPU memory by leveraging CPU memory as well, though we will not directly cover this aspect here).

As discussed in a previous blog post, with DDP the largest model that we could train on 32, A100 gpus with 40 GB memory (4 nodes) was up to 3B parameters, and batch size of 128, with the help of activation checkpointing. By contrast, using FSDP we were able to train up to 81B model size, combining activation checkpointing, along with activation and parameter offloading. In another experiment, we benchmarked a 1T parameter model with FSDP using 512 gpus.

For intuition on the parameter level workings of FSDP, below we show an animation detailing how the model parameters are sharded and communicated assuming a two GPU scenario and a simple 8 parameter model:

Above – the animations walk through the steps involved with the initial sharding of the model amongst ranks, and we start the all_gathers and forward pass

We continue through the model with the forward pass. After each FSDP unit completes, non-locally owned params are dropped to free memory, and optionally activations can be checkpointed. This continues until we finish the forward pass and compute the loss.

During the backward pass, another all_gather is used to load the parameters and the gradients are computed. These gradients are then reduce_scattered so that the local owners of each param can aggregate and prepare to update the weights.

Finally, each rank passes the summed gradients through the optimizer states and updates the weights to complete the mini-batch.

With the model now distributed across the entire set of available GPUs, the logical question is how data moves through the model given this sharding of model parameters.

This is accomplished by FSDP coordinating with all GPUs to effectively share (communicate) the respective parts of the model. The model is decomposed into FSDP units and parameters within each unit are flattened and then sharded across all GPUs. Within each FSDP unit, GPU’s are assigned interleaving ownership of individual model parameters.

By interleaving, we mean the following – assuming 2 gpus with an id of 1 and 2, the FSDP unit ownership pattern would be [12121212], rather than a contiguous chunk of [111222].

During training, an all_gather is initiated and the locally owned model parameters within a FSDP unit are shared by the owner GPU with the other non-owners, when they need it, on a ‘just in time’ type basis. FSDP prefetches parameters to overlap all_gather communication with computation.

When those requested parameters arrive, the GPU uses the delivered parameters, in combination with the parameters it already owns, to create a fully populated FSDP unit. Thus there is a moment where each GPU hits peak memory usage while holding a fully populated FSDP unit.

It then processes the data through the FSDP unit, and drops the parameters it received from other GPU’s to free up memory for the next unit…the process continues over and over proceeding through the entire model to complete the forward pass.The process is then repeated (in general) for the backward pass.(note – this is a simplified version for understanding..there is additional complexity but this should help construct a basic mental model of the FSDP process).

This eliminates much of the memory redundancy present in DDP, but imposes the cost of higher amounts of network communication to shuttle these requested parameters back and forth amongst all the GPUs.Overlapping the communication timing with the computation taking place is the basis of many of the performance improvements we’ll discuss in this series. The key gains are frequently based on the fact that communication can often take place at the same time as computation.As you can surmise, having high communication speed is vital for FSDP performance.

How do I optimize my training with FSDP?

There are four main performance improvements we will cover – the transformer wrapper, activation checkpointing, mixed precision, and selecting the proper sharding strategy. The flowchart below will help as a checklist for tuning options that we will discuss in this post.

Wrapping policy – for transformers, use Transformer wrapping policy

The first performance optimization is leveraging the FSDP transformer wrapper for transformer models.

One of the pre-defined wrapping policy is size_based_autowrap_policy. With size_based_autowrap_policy, FSDP will traverse the module structure from bottom to top, a new FSDP unit will be created once the current unit has at least the min_num_params specified within the size policy (this defaults to 1e8, or 100M). If the module can not be created as an FSDP unit, FSDP will continue to check its parent module. This size based wrapping policy may not be ideal for some model structures, PyTorch distributed team is actively working on a new default wrapping policy in the next release which is based on size and also module execution order, users can simply tune the size and achieve the optimized performance.

In the current release, you can greatly improve your performance when running Transformer models by using the ‘transformer wrapper’. You will need to provide the appropriate layer class for your model. Here, layer class is the class that houses the Multi-Head Attention and Feed Forward Network.

FSDP will then form the FSDP units around the layer class rather than arbitrary breaks based on parameter size. By sharding the model around layer classes that are uniformly repeated within the transformer, FSDP can create uniform FSDP units that better balance the overlap of computation and communication. By contrast, size based wrapping can produce very uneven or skewed shards for models, which then have uneven matching of compute vs communication overlap. As discussed earlier, the main driver of FSDP high performance is the overlap of communication and computation, and hence why the Transformer wrapper provides improved performance. Note that the Transformer wrapper can also be used for non-transformer models if these models have a list of uniform layers.

Let’s compare the performance difference on a T5, 3B parameter model when running under the default wrapper and the transformer wrapper.

For default wrapping, we don’t need to take any action – we simply pass the model to FSDP as shown:

model = FSDP(
      model,
      device_id=torch.cuda.current_device(),
  )

In this case FSDP will simply wrap the whole model in a single FSDP unit.

Running on an NVIDIA A100-SXM4–40GB with 8 GPUs, we are able to reach 2.3 TFlops and 95% GPU memory utilization with a batch size of 14.

However, since T5 is a transformer model, we are better served to leverage the transformer wrapper for this model.

To use that, we need to isolate the layer class for the transformer, and then pass it in to create our transformer wrapper.

from transformers.models.t5.modeling_t5 import T5Block

And now we can create our Transformer wrapper:

transformer_auto_wrapper_policy = functools.partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls={
            T5Block,  # < ---- Your Transformer layer class
        },
    )

With our model aware wrapper ready, we can initialize FSDP:

# invoke FSDP with your transformer wrapper policy:
model = FSDP(
        model,
        auto_wrap_policy=transformer_auto_wrapper_policy,
        device_id=torch.cuda.current_device(),  # streaming init
    )

Running this wrapped model, we can see some substantial performance gains.We can fit nearly double the batch size, going to 28, and with better memory and communication efficiency, we see a TFlops increase to 5.07 from 2.3.

Thus, we’ve increased our training throughput by over 200% (2.19x) due to providing greater model info to FSDP! The transformer wrapping policy results in more fine-grained and balanced FSDP units each holding a layer class, which leads to a more effective communication-computation overlap.

Above: Graphical comparison of TFlops based on wrapper type

If you are training a Transformer model, it pays to configure your training with FSDP using the transformer wrapper. For more information on how to isolate your layer class, please see our in depth video on Transformer wrapping here, where we walk through a number of transformers showing where the layer class can be found.

Mixed precision – use BF16 if you have an Ampere architecture GPU

FSDP supports a flexible mixed precision policy that gives you granular control over parameters, gradients and buffer data types. This lets you easily leverage BFloat16 or FP16 to increase your training speed by up to 70%.

*Note that BFloat 16 is only available on Ampere type GPUs. On AWS this is available with p4dn and g5 instances.

By way of comparison, we can show a 77% speed improvement when comparing fully tuned BFloat16 vs FP32 on an 8B DeepVit model.

We have obtained even greater acceleration using BFloat16 in fine-tuning a 3B HuggingFace T5 model as shown in the figures below. We observed that because of the lower precision the validation loss of BFloat16 is slightly behind in the first few epochs, but it is able to catch up and results in the same final accuracy as FP32.

To use mixed precision, we create a policy with our desired data types, and pass it in during the FSDP initialization.

To create our policy, we need to import the MixedPrecision class, and then define our custom policy using our customized class:

from torch.distributed.fsdp import MixedPrecision
bfSixteen = MixedPrecision(
   param_dtype=torch.bfloat16,
   # Gradient communication precision.
   reduce_dtype=torch.bfloat16,
   # Buffer precision.
   buffer_dtype=torch.bfloat16,
)
model = FSDP(
       model,
       auto_wrap_policy=transformer_auto_wrapper_policy,
       mixed_precision=bfloatPolicy)

You can mix and match the precision for parameters, gradients and buffers as you prefer:

comboPolicy = MixedPrecision(
        # Param precision
        param_dtype=torch.bfloat16,
        # Gradient communication precision.
        reduce_dtype=torch.float32,
        # Buffer precision.
        buffer_dtype=torch.float32,
    )

For training with FP16, you will need to also use the ShardedGradScaler, which we will cover in subsequent posts. For BFloat16, it is a drop-in replacement.

AnyPrecision Optimizer – going beyond mixed precision with full BF16 training

Mixed precision training, both in FSDP and elsewhere, maintains the working weights in the reduced datatype (BF16 or FP16) while keeping the master weights in full FP32. The reason for the master weights in FP32 is that running in pure BF16 will result in ‘weight stagnation’, where very small weight updates are lost due to the lower precision, and the accuracy flatlines over time while FP32 weights can continue to improve from these small updates.

In order to resolve this dilemma, we can use the new AnyPrecision optimizer available in TorchDistX (Torch Distributed Experimental) that allows you to successfully train and keep the master weights in pure BF16 instead of FP32. In addition, unlike the typical storage of optimizer states in FP32, AnyPrecision is able to maintain states in pure BF16 as well.

AnyPrecision enables pure BF16 training by maintaining an extra buffer that tracks the precision lost during the weight updates and re-applies that during the next update…effectively resolving the weight stagnation issue without requiring FP32.

As a comparison of the throughput gains available with pure BF16 training using AnyPrecision, we ran experiments using FSDP with the T5 11B model with regular FP32 training, Mixed Precision training with BF16, and pure BF16 training using the AnyPrecision optimizer on 3 nodes with A100 gpus as mentioned previously.

As shown above, training with AnyPrecision and pure BF16 resulted in 2x the throughput vs Mixed Precision, and over 20x improvement vs FP32.

The potential tradeoff is the impact on final accuracy – in the cases we tested, the accuracy was equal or better than FP32 due to a regularization effect from the slightly reduced precision, but your results may vary.

AnyPrecision optimizer is available for you to test with here, and is a drop in replacement for AdamW optimizer.

Activation checkpointing – increasing throughput by trading compute for memory

FSDP supports activation checkpointing once the model has been sharded, and makes it easy to implement. The graph above shows ~4x throughput improvement using activation checkpointing.

Activation checkpointing is where the intermediate activations are freed during the forward pass, and a checkpoint is left as a placeholder. This generally increases available GPU memory by over 30%.

The tradeoff is that during the backward pass, these previously removed intermediate activations must be re-calculated again using information in the checkpoint (duplicate compute), but by leveraging the increased GPU memory, one can increase the batch size such that the net throughput can increase substantially.

# verify we have FSDP activation support ready by importing:
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
   checkpoint_wrapper,
   CheckpointImpl,
   apply_activation_checkpointing_wrapper,
)

The steps required to implement activation checkpointing is to first import the FSDP checkpointing functions. We need declare our checkpointer wrapper type which is non-reentrant and create a check function to identify which layer to wrap as follows

non_reentrant_wrapper = partial(
    checkpoint_wrapper,
    offload_to_cpu=False,
    checkpoint_impl=CheckpointImpl.NO_REENTRANT,
)
check_fn = lambda submodule: isinstance(submodule, T5Block)
apply_activation_checkpointing_wrapper(
       model, checkpoint_wrapper_fn=non_reentrant_wrapper, check_fn=check_fn
   )

Important note – this must be run after the model has been initialized with FSDP.

However, hopefully you’ve seen how some initial tuning with FSDP options can have a large impact on your training performance.

With that, we turn our attention from how to scale within FSDP, to how to scale your server hardware for FSDP using AWS.

Large Scale Training with FSDP on AWS – For multi-node prioritize high speed network

AWS provides several services that can be used to run distributed training with FSDP: Amazon EC2 Accelerated Computing instances, AWS ParallelCluster, and Amazon Sagemaker.

In this series of blog posts, we used Amazon EC2 p4d instances in a single-instance multi-GPU configuration and in a multi-instance configuration using AWS ParallelCluster and SageMaker in order to run our training jobs.

Here, we’ll focus specifically on AWS parallel cluster and provide an overview of how to utilize it for training purposes.

AWS ParallelCluster Setup

AWS ParallelCluster is an open source, cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters on AWS. AWS ParallelCluster uses yaml configuration files to provision all the necessary resources. It also supports multiple instance types, job submission queues, shared file systems like Amazon EFS (NFS) or Amazon FSx for Lustre, and job schedulers like AWS Batch and Slurm.

Workflow on Clusters

The high level idea is to have a cluster that has a head node which controls the compute nodes. The actual training job runs on the compute nodes. Overall steps to run a training job on a cluster are as follows:

  1. Set up an AWS ParallelCuster (we discuss below)
  2. Connect to the head node, and import the training code/ setup the environment.
  3. Pull the data and place it in a shared folder that compute nodes can access (FSx Lustre drive).
  4. Run the training job using a job scheduler (in this case Slurm).

Setup AWS ParallelCuster

To setup AWS ParallelCluster,

  1. Deploy a network stack. This step is optional since you could use your account default VPC and let AWS ParallelCluster create your subnets and security groups. However, we prefer to compartmentalize our desired network infrastructure and do this deployment via a CloudFormation stack.

    Since we deploy a public and a private subnet, we want to create them into an Availability Zone that contains our target instances, in this case p4d. We consult their availability in the region we use (us-east-1) through the following AWS CLI command:

    aws ec2 describe-instance-type-offerings --location-type availability-zone --filters Name=instance-type,Values=p4d.24xlarge --region us-east-1 --output table

    We see three availability zones containing p4d instances, we pick one of them (us-east-1c, yours may be different) when deploying our network stack. This can be done with the AWS Console or the AWS CLI. In our case we use the latter as follows

    aws cloudformation create-stack --stack-name VPC-Large-Scale --capabilities CAPABILITY_IAM --template-body file://VPC-Large-Scale.yaml --parameters ParameterKey=SubnetsAZ,ParameterValue=us-east-1c

    CloudFormation will deploy our new VPC, subnets, security groups and endpoints on our behalf. Once done, you can retrieve the IDs of the public and private subnets by querying the stack outputs and the values PublicSubnet and PrivateSubnet.

    For example, using the AWS CLI for the private subnet:

    aws cloudformation describe-stacks --stack-name VPC-Large-Scale --query "Stacks[0].Outputs[?OutputKey=='PrivateSubnet'].OutputValue" --output text

  2. Create ParallelCluster, The cluster configuration file specifies the resources for our cluster. These resources include instance type for Head node, compute nodes, access to S3 buckets, shared storage where our data will be located. We will use Amazon FSx for Lustre that offers a fully managed shared storage service with Lustre.

    Here is an example of a cluster configuration file. We can use AWs ParallelCluster CLI to create the cluster. Please note that the private and public subnet IDs will need to be replaced by the ones you retrieved earlier. You will be able to control the cluster using the AWS ParallelCluster CLI to start, stop, pause, etc.

    pcluster create-cluster --cluster-name my-hpc-cluster --cluster-configuration cluster.yaml
    
  3. SSH to Head node – once the cluster is ready, we can connect to the Head node using the SSH protocol, pull our training code with and place the data in the shared storage specified in the cluster configuration file.

    pcluster ssh --cluster-name cluster -i your-key_pair
    
  4. Launch the training job – now that we have the data and training code, we can launch the slurm job for training. Here is an example of a slurm script to launch the job using torchrun.

More details on how to set up the cluster is out of the scope of this post, however we will have a separate post on it.

What’s next?

With this post we provided a high level overview of FSDP and how it efficiently scales distributed AI training. The flowchart included will help provide a checklist for you to review tuning options discussed such as the transformer wrapper and activation checkpointing.

In the next posts, we will continue with the T5 model and go deeper into each of the topics above, specifically with sharding strategy and other optimizations to provide more insight and details. For now, a good reference for the sharding strategy is in our video tutorial here:

If you have questions or find an issue, please find the authors Less, Hamid and Geeta or open an issue on PyTorch github.

Special thanks to:

Pytorch Distributed team, Shen Li, Rohan Varma, Yanli Zhao, Andrew Gu, Anjali Sridhar, Ana Simoes, Pierre-Yves Aquilanti, Sundar Ranganathan, and the broader AWS team for supporting us with providing infrastructure and technical support for running the large scale experiments.

Resources:

FSDP video series

Getting started with FSDP

Advanced tutorial on FSDP

API documentation

Read More

Scaling PyTorch FSDP for Training Foundation Models on IBM Cloud

Scaling PyTorch FSDP for Training Foundation Models on IBM Cloud

Large model training using a cloud native approach is of growing interest for many enterprises given the emergence and success of foundation models. Some AI practitioners may assume that the only way they can achieve high GPU utilization for distributed training jobs is to run them on HPC systems, such as those inter-connected with Infiniband and may not consider Ethernet connected systems. We demonstrate how the latest distributed training technique, Fully Sharded Data Parallel (FSDP) from PyTorch, successfully scales to models of size 10B+ parameters using commodity Ethernet networking in IBM Cloud.

PyTorch FSDP Scaling

As models get larger, the standard techniques for data parallel training work only if the GPU can hold a full replica of the model, along with its training state (optimizer, activations, etc.). However, GPU memory increases have not kept up with the model size increases and new techniques for training such models have emerged (e.g., Fully Sharded Data Parallel, DeepSpeed), which allow us to efficiently distribute the model and data over multiple GPUs during training. In this blog post, we demonstrate a path to achieve remarkable scaling of model training to 64 nodes (512 GPUs) using PyTorch native FSDP APIs as we increase model sizes to 11B.

What is Fully Sharded Data Parallel?

FSDP extends the distributed data parallel training (DDP) approach by sharding model parameters, gradient and optimizer states into K FSDP units, determined by using a wrapping policy. FSDP achieves large model training efficiency in terms of resources and performance by significantly reducing the memory footprint on each GPU and overlapping computation and communication.

Resource efficiency is achieved with memory footprint reduction by having all GPUs own a portion of each FSDP unit. To process a given FSDP unit, all GPUs share their locally owned portion via all_gather communication calls.

Performance efficiency is accomplished by overlapping all_gather communication calls for upcoming FSDP units with computation of the current FSDP unit. Once the current FSDP unit has been processed, the non-locally owned parameters are dropped, freeing memory for the upcoming FSDP units. This process achieves training efficiency by the overlap of computation and communication, while also reducing the peak memory needed by each GPU.

In what follows, we demonstrate how FSDP allows us to keep hundreds of GPUs highly utilized throughout a distributed training job, while running over standard Ethernet networking (system description towards the end of the blog). We chose the T5 architecture for our experiments and leveraged the code from the FSDP workshop. In each of our experiments, we start with a single node experiment to create a baseline and report the metric seconds/iteration normalized by the batch size as well as compute the teraflops based on the Megatron-LM paper (see Appendix for details of teraflop computation for T5). Our experiments aim to maximize the batch size (while avoiding cudaMalloc retries) to take full advantage of overlap in computation and communications, as discussed below. Scaling is defined as the ratio of the seconds/iteration normalized by batch size for N nodes versus a single node, representing how well we can utilize the additional GPUs as more nodes are added.

Experimental Results

Our first set of experiments using the T5-3B configuration (mixed precision with BF16, activation checkpointing, and transformer wrapping policy) demonstrated scaling efficiency of 95% as we increased the number of GPUs from 8 to 512 (1 to 64 nodes, respectively). We achieved these results without any modifications to the existing FSDP APIs. We observed that, for this scale, over Ethernet based network, there is sufficient bandwidth to enable continuous overlap of communication and computation.

However, when we increased the T5 model size to 11B, the scaling efficiency declined substantially to 20%. The PyTorch profiler shows that overlap of communication and computation was very limited. Further investigation into the network bandwidth usage revealed that the poor overlap is being caused by latency in the communication of individual packets and not the bandwidth required (in fact, our peak bandwidth utilization is 1/4th of that available). This led us to hypothesize that if we can increase the compute time by increasing the batch size, we can better overlap communication and computation. However, given we are already at maximum GPU memory allocation, we must identify opportunities to rebalance the memory allocation to allow for increase in batch size. We identified that the model state was being allocated a lot more memory than was needed. The primary function of these reservations is to have pre-reserved memory ready to aggressively send/receive tensors during the communication periods and too few buffers can result in increased wait times, whereas too many buffers result in smaller batch sizes.

To achieve better efficiency, the PyTorch distributed team introduced a new control knob, the rate_limiter which controls how much memory is allocated for send/receive of tensors, alleviating the memory pressure and providing room for higher batch sizes. In our case, the rate_limiter could increase the batch size from 20 to 50, thus increasing compute time by 2.5x and allowing for much greater overlap of communication and computation. With this fix, we increased the scaling efficiency to >75% (at 32 nodes)!

Continued investigation into the factors limiting scaling efficiency uncovered that the rate limiter was creating a recurring pipeline bubble of GPU idle time. This was due to the rate limiter using a block and flush approach for the allocation and release of each set of memory buffers. By waiting for the entire block to complete before initiating a new all_gather, the GPU was idling at the start of each block, while waiting for the new set of all_gather parameters to arrive. This bubble was alleviated by moving to a sliding window approach. Upon the completion of a single all_gather step and its computation (rather than a block of them), the memory is freed and the next all_gather is immediately issued in a much more uniform manner. This improvement eliminated the pipeline bubble and boosted the scaling efficiencies to >90% (at 32 nodes).

Figure 1: Scaling of T5-XL (3B) and T5-XXL (11B) from 1 node to 64 nodes

Figure 2: TFLOPs/sec usage for T5-XL(3B) and T5-XXL (11B) as we increase number of nodes

IBM Cloud AI System and Middleware

The AI infrastructure used for this work is a large-scale AI system on IBM Cloud consisting of nearly 200 nodes, each node with 8 NVIDIA A100 80GB cards, 96 vCPUs, and 1.2TB CPU RAM. The GPU cards within a node are connected via NVLink with a card-to-card bandwidth of 600GBps. Nodes are connected by 2 x 100Gbps Ethernet links with SRIOV based TCP/IP stack, providing a usable bandwidth of 120Gbps.

The IBM Cloud AI System has been production-ready since May of 2022 and is configured with the OpenShift container platform to run AI workloads. We also built a software stack for production AI workloads that provide end-to-end tools for training workloads. The middleware leverages Ray for pre and post processing workloads and PyTorch for training of models. We also integrate a Kubernetes native scheduler, MCAD, that manages multiple jobs with job queuing, gang scheduling, prioritization, and quota management. A multi-NIC CNI discovers all available network interfaces and handles them as a single NIC pool enabling optimized use of the network interfaces in Kubernetes. Finally, CodeFlare CLI supports a single pane for observability of the full stack using a desktop CLI (e.g., GPU utilization, application metrics like loss, gradient norm).

Figure 3: Foundation Model Middleware Stack

Conclusion and Future Work

In conclusion, we demonstrated how we can achieve remarkable scaling of FSDP APIs over non-InfiniBand networks. We identified the bottleneck that had limited scaling to less than 20% efficiency for 11B parameter model training. After identifying the issue, we were able to correct this with a new rate limiter control to ensure a more optimal balance of reserved memory and communication overlap relative to compute time. With this improvement, we were able to achieve 90% scaling efficiency (a 4.5x improvement), at 256 GPUs and 80% at 512 GPUs for training of the 11B parameter model. In addition, the 3B parameter model scales extremely well with 95% efficiency even as we increase the number of GPUs to 512.

This is a first in the industry to achieve such scaling efficiencies for up to 11B parameter models using Kubernetes with vanilla Ethernet and PyTorch native FSDP API’s. This improvement enables users to train huge models on a Hybrid Cloud platform in a cost efficient and sustainable manner.

We plan on continuing to investigate scaling with decoder only models and increasing the size of these models to 100B+ parameters. From a system design perspective, we are exploring capabilities such as RoCE and GDR that can improve latencies of communications over Ethernet networks.

Acknowledgements

This blog was possible because of contributions from both PyTorch Distributed and IBM Research teams.

From the PyTorch Distributed team, we would like to thank Less Wright, Hamid Shojanazeri, Geeta Chauhan, Shen Li, Rohan Varma, Yanli Zhao, Andrew Gu, Anjali Sridhar, Chien-Chin Huang, and Bernard Nguyen.

From the IBM Research team, we would like to thank Linsong Chu, Sophia Wen, Lixiang (Eric) Luo, Marquita Ellis, Davis Wertheimer, Supriyo Chakraborty, Raghu Ganti, Mudhakar Srivatsa, Seetharami Seelam, Carlos Costa, Abhishek Malvankar, Diana Arroyo, Alaa Youssef, Nick Mitchell.

Appendix

Teraflop computation

The T5-XXL (11B) architecture has two types of T5 blocks, one is an encoder and the second is a decoder. Following the approach of Megatron-LM, where each matrix multiplication requires 2m×k×n FLOPs, where the first matrix is of size m×k and the second is k×n. The encoder block consists of self-attention and feed forward layers, whereas the decoder block consists of self-attention, cross-attention, and feed forward layers.

The attention (both self and cross) block consists of a QKV projection, which requires 6Bsh2 operations, an attention matrix computation requiring 2Bs2h operations, an attention over values which needs 2Bs2h computations, and the post-attention linear projection requires 2Bsh2 operations. Finally, the feed forward layer requires 15Bsh2 operations.

The total for an encoder block is 23Bsh2+4Bs2h, whereas for a decoder block, it comes to 31Bsh2+8Bs2h. With a total of 24 encoder and 24 decoder blocks and 2 forward passes (as we discard the activations) and one backward pass (equivalent to two forward passes), the final FLOPs computation comes to be 96×(54Bsh2+ 12Bs2h) + 6BshV. Here, B is the batch size per GPU, s is sequence length, h is hidden state size, and V is vocabulary size.
We repeat a similar computation for T5-XL (3B) architecture, which is slightly different.

Read More

Get Started with PyTorch 2.0 Summary and Overview

Introducing PyTorch 2.0, our first steps toward the next generation 2-series release of PyTorch. Over the last few years we have innovated and iterated from PyTorch 1.0 to the most recent 1.13 and moved to the newly formed PyTorch Foundation, part of the Linux Foundation.

To complement the PyTorch 2.0 announcement and conference, we have also posted a comprehensive introduction and technical overview within the Get Started menu at https://pytorch.org/get-started/pytorch-2.0.

We also wanted to ensure you had all the information to quickly leverage PyTorch 2.0 in your models so we added the technical requirements, tutorial, user experience, Hugging Face benchmarks and FAQs to get you started today!

Finally we are launching a new “Ask the Engineers: 2.0 Live Q&A” series that allows you to go deeper on a range of topics with PyTorch subject matter experts. We hope this content is helpful for the entire community and level of users/contributors.

https://pytorch.org/get-started/pytorch-2.0

Read More

Accelerating Hugging Face and TIMM models with PyTorch 2.0

torch.compile() makes it easy to experiment with different compiler backends to make PyTorch code faster with a single line decorator torch.compile(). It works either directly over an nn.Module as a drop-in replacement for torch.jit.script() but without requiring you to make any source code changes. We expect this one line code change to provide you with between 30%-2x training time speedups on the vast majority of models that you’re already running.


opt_module = torch.compile(module)

torch.compile supports arbitrary PyTorch code, control flow, mutation and comes with experimental support for dynamic shapes. We’re so excited about this development that we call it PyTorch 2.0.

What makes this announcement different for us is we’ve already benchmarked some of the most popular open source PyTorch models and gotten substantial speedups ranging from 30% to 2x https://github.com/pytorch/torchdynamo/issues/681.

There are no tricks here, we’ve pip installed popular libraries like https://github.com/huggingface/transformers, https://github.com/huggingface/accelerate and https://github.com/rwightman/pytorch-image-models and then ran torch.compile() on them and that’s it.

It’s rare to get both performance and convenience, but this is why the core team finds PyTorch 2.0 so exciting. The Hugging Face team is also excited, in their words:

Ross Wightman the primary maintainer of TIMM: “PT 2.0 works out of the box with majority of timm models for inference and train workloads and no code changes”

Sylvain Gugger the primary maintainer of transformers and accelerate: “With just one line of code to add, PyTorch 2.0 gives a speedup between 1.5x and 2.x in training Transformers models. This is the most exciting thing since mixed precision training was introduced!”

This tutorial will show you exactly how to replicate those speedups so you can be as excited as to PyTorch 2.0 as we are.

Requirements and Setup

For GPU (newer generation GPUs will see drastically better performance)

pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117

For CPU

pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Optional: Verify Installation

git clone https://github.com/pytorch/pytorch
cd tools/dynamo
python verify_dynamo.py

Optional: Docker installation

We also provide all the required dependencies in the PyTorch nightly
binaries which you can download with

docker pull ghcr.io/pytorch/pytorch-nightly

And for ad hoc experiments just make sure that your container has access
to all your GPUs

docker run --gpus all -it ghcr.io/pytorch/pytorch-nightly:latest /bin/bash

Getting started

a toy exmaple

Let’s start with a simple example and make things more complicated step
by step. Please note that you’re likely to see more significant speedups the newer your GPU is.

import torch
   def fn(x, y):
       a = torch.sin(x).cuda()
       b = torch.sin(y).cuda()
       return a + b
   new_fn = torch.compile("inductor")(fn)
   input_tensor = torch.randn(10000).to(device="cuda:0")
   a = new_fn()

This example won’t actually run faster but it’s a good educational.

example that features torch.cos() and torch.sin() which are examples of pointwise ops as in they operate element by element on a vector. A more famous pointwise op you might actually want to use would be something like torch.relu().

Pointwise ops in eager mode are suboptimal because each one would need to read a tensor from memory, make some changes and then write back those changes.

The single most important optimization that PyTorch 2.0 does for you is fusion.

So back to our example we can turn 2 reads and 2 writes into 1 read and 1 write which is crucial especially for newer GPUs where the bottleneck is memory bandwidth (how quickly you can send data to a GPU) instead of compute (how quickly your GPU can crunch floating point operations)

The second most important optimization that PyTorch 2.0 does for you is CUDA graphs

CUDA graphs help eliminate the overhead from launching individual kernels from a python program.

torch.compile() supports many different backends but one that we’re particularly excited about is Inductor which generates Triton kernels https://github.com/openai/triton which are written in Python yet outperform the vast majority of handwritten CUDA kernels. Suppose our example above was called trig.py we can actually inspect the code generated triton kernels by running.

TORCHINDUCTOR_TRACE=1 python trig.py

@pointwise(size_hints=[16384], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
   @triton.jit
   def kernel(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
       xnumel = 10000
       xoffset = tl.program_id(0) * XBLOCK
       xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
       xmask = xindex < xnumel
       x0 = xindex
       tmp0 = tl.load(in_ptr0 + (x0), xmask)
       tmp1 = tl.sin(tmp0)
       tmp2 = tl.sin(tmp1)
       tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)

And you can verify that fusing the two sins did actually occur because the two sin operations occur within a single Triton kernel and the temporary variables are held in registers with very fast access.

a real model

As a next step let’s try a real model like resnet50 from the PyTorch hub.

import torch
   model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
   opt_model = torch.compile("inductor",model)
   model(torch.randn(1,3,64,64))

If you actually run you may be surprised that the first run is slow and that’s because the model is being compiled. Subsequent runs will be faster so it’s common practice to warm up your model before you start benchmarking it.

You may have noticed how we also passed in the name of a compiler explicitly here with “inductor” but it’s not the only available backend, you can run in a REPL torch._dynamo.list_backends() to see the full list of available backends. For fun you should try out aot_cudagraphs or nvfuser.

Hugging Face models

Let’s do something a bit more interesting now, our community frequently
uses pretrained models from transformers https://github.com/huggingface/transformers or TIMM https://github.com/rwightman/pytorch-image-models and one of our design goals for PyTorch 2.0 was that any new compiler stack needs to work out of the box with the vast majority of models people actually run.

So we’re going to directly download a pretrained model from the Hugging Face hub and optimize it


import torch
   from transformers import BertTokenizer, BertModel
   # Copy pasted from here https://huggingface.co/bert-base-uncased
   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
   model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
   model = torch.compile(model) # This is the only line of code that we changed
   text = "Replace me by any text you'd like."
   encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
   output = model(**encoded_input)

If you remove the to(device="cuda:0") from the model and encoded_input then PyTorch 2.0 will generate C++ kernels that will be optimized for running on your CPU. You can inspect both Triton or C++ kernels for BERT, they’re obviously more complex than the trigonometry example we had above but you can similarly skim it and understand if you understand PyTorch.

The same code also works just fine if used with https://github.com/huggingface/accelerate and DDP

Similarly let’s try out a TIMM example

import timm
   import torch
   model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
   opt_model = torch.compile(model, "inductor")
   opt_model(torch.randn(64,3,7,7))

Our goal with PyTorch was to build a breadth-first compiler that would speed up the vast majority of actual models people run in open source. The Hugging Face Hub ended up being an extremely valuable benchmarking tool for us, ensuring that any optimization we work on actually helps accelerate models people want to run.

So please try out PyTorch 2.0, enjoy the free perf and if you’re not seeing it then please open an issue and we will make sure your model is supported https://github.com/pytorch/torchdynamo/issues

After all, we can’t claim we’re created a breadth-first unless YOUR models actually run faster.

Read More