PyTorch + vLLM = ♥️

June 25, 2025

by Simon Mo, Woosuk Kwon, Kaichao You, The PyTorch Team @Meta PyTorch

Key takeaways:

PyTorch and vLLM are both critical to the AI ecosystem and are increasingly being used together for cutting-edge generative AI applications, including inference, post-training, and agentic systems at scale.
With the shift of the PyTorch Foundation to an umbrella foundation, we are excited to see projects being both used and supported by a wide range of customers, from hyperscalers to startups and everyone in between.
vLLM is leveraging the broader PyTorch ecosystem to accelerate innovation, benefiting from projects such as torch.compile, TorchAO, FlexAttention, and collaborating to support heterogeneous hardware and complex parallelism.
The teams (and others) are collaborating to build out PyTorch native support and integration for large-scale inference and post-training.

Even prior to vLLM joining the PyTorch Foundation, we’ve seen organic and widespread adoption of PyTorch and vLLM together in some of the top companies deploying LLMs at scale in the world. Interestingly, the projects share many commonalities, including: strong visionary leaders; a broad organically-attained multilateral governance structure with committers from several entities, including both industry and academia; and an overwhelming focus on the developer experience.

Additionally, over the last year plus, we’ve seen the two projects underpin many of the most popular open source LLMs, including the various Llama and DeepSeek models. With such similarities and how complementary the projects are, it’s really exciting to see all of the different integration points.

PyTorch → vLLM Integrations

The overall goal for integrating at various points is to unlock performance and bring new capabilities to users. This includes optimization and support for Llama models, but also broader open models.

torch.compile: torch.compile is a compiler that optimizes PyTorch code, delivering fast performance with minimal user effort. While manually tuning a model’s performance can take days, weeks, or even months, this approach becomes impractical for a large number of models. Instead, torch.compile provides a convenient solution for optimizing model performance. vLLM uses torch.compile by default to generate optimized kernels for the majority of its models. Recent benchmarks show significant speedups with torch.compile, ranging from 1.05x to 1.9x speedups on CUDA for popular models like Llama4, Qwen3, and Gemma3.

TorchAO: We’re excited to announce that TorchAO is now officially supported as a quantization solution in vLLM. This integration brings high-performance inference capabilities using Int4, Int8, and FP8 data types, with upcoming support for MXFP8, MXFP4, and NVFP4 optimizations specifically designed for B200 GPUs. Additionally, we’re working on planned FP8 inference support for AMD GPUs, expanding hardware compatibility for high-performance quantized inference.

TorchAO’s quantization APIs are powered by a robust collection of high-performance kernels, including those from PyTorch Core, FBGEMM, and gemlite. TorchAO techniques are designed to compose torch.compile. This means simpler implementations and automatic performance gains from PT2. Write less code, get better performance

One of the most exciting aspects of this integration is the seamless workflow it enables: vLLM users can now perform float8 training using TorchTitan, Quantization-Aware Training (QAT) using TorchTune, then directly load and deploy their quantized models through vLLM for production inference. This end-to-end pipeline significantly streamlines the path from model training and fine-tuning to deployment, making advanced quantization techniques more accessible to developers.

FlexAttention: vLLM now includes FlexAttention – a new attention backend designed for flexibility. FlexAttention provides a programmable attention framework that allows developers to define custom attention patterns, making it easier to support novel model designs without extensive backend modifications.

This backend, enabled by torch.compile, produces JIT fused kernels. This allows for flexibility while maintaining performance for non-standard attention patterns. FlexAttention is currently in early development within vLLM and not yet ready for production use. We’re continuing to invest in this integration and plan to make it a robust part of vLLM’s modeling toolkit. The goal is to simplify support for emerging attention patterns and model architectures, making it easier to bridge the gap between research innovations and deployment-ready inference.

Heterogeneous hardware: The PyTorch team worked with different hardware vendors and provided solid support for different types of hardware backends, including NVIDIA GPU, AMD GPU, Intel GPU, Google TPU, etc. vLLM inference engine leverages PyTorch as a proxy talking to different hardware, and this significantly simplified the support for heterogeneous hardware.

In addition, PyTorch engineers work closely with other vLLM contributors to support the next generation of NVIDIA GPUs. For example, we have thoroughly tested FlashInfer support in vLLM on Blackwell, conducted performance comparison, and debugged accuracy issues.

The PyTorch team also worked with AMD to enhance the support for vLLM + Llama4, such as day 0 llama 4 support on AMD as well as Llama4 perf optimization on MI300x.

Parallelism: At Meta, we leverage different types of parallelism and their combination in production. Pipeline parallelism (PP) is one important type. The original PP in vLLM has hard dependencies on Ray. However, not all the users leverage Ray to manage their service and coordinate different hosts. The PyTorch team developed the PP with plain torchrun support, and further optimized its approach to overlap the computation between microbatches. In addition, PyTorch engineers also developed the Data Parallelism for vision encoder, which is critical to the multi-modal models’ performance.

Continuous integration (CI): With vLLM critical for the PyTorch ecosystem, we are collaborating to ensure that CI between the projects has good test coverage, is well funded and overall the community can rely on all of these integrations. Just integrating APIs isn’t enough; it’s also important that CI is in place to ensure that nothing breaks over time as vLLM and PyTorch both release new versions and features. More concretely, we are testing the combination of vLLM main and PyTorch nightlies, which we believe will give us and the community the signal needed to monitor the state of the integration between the two projects. At Meta, we started moving some of our development effort on top of vLLM main to stress test various correctness and performance aspects of vLLM. As well, performance dashboards for vLLM v1 are now available on hud.pytorch.org.

What’s next..

This is just the beginning. We are working together to build out the following advanced capabilities:

1. Large Scale Model Inference: The primary goal is to ensure vLLM runs efficiently at scale on cloud offerings, demonstrates key capabilities (prefill-decode disagg, multi-node parallelism, performant kernels and comms, context-aware routing and fault-tolerance) to scale to thousands of nodes, and becomes a stable foundation for enterprises to build on. In Q2, Meta engineers have prototyped disagg integration on top of the vLLM engine and KV connector APIs. The team is working with the community to try out new strategies and will plan to upstream the most successful ones to push further what can be done with vLLM.

Hardware: H100 GPUs, 96 GB HBM2e, AMD Genoa CPUs, CUDA 12.4

2. Post-training with reinforcement learning: Inference time compute is quickly becoming critical for LLMs and agentic systems. We are working on end-to-end native post-training that incorporates RL at large scale with vLLM as the inference backbone of the system.

Cheers!

-Team PyTorch (at Meta) & Team vLLM

FlagGems Joins the PyTorch Ecosystem: Triton-Powered Operator Library for Universal AI Acceleration

June 25, 2025

by FlagGems Team PyTorch

In the race to accelerate large language models across diverse AI hardware, FlagGems delivers a high-performance, flexible, and scalable solution. Built on Triton language, FlagGems is a plugin-based PyTorch operator and kernel library designed to democratize AI compute. Its mission: to enable a write-once, JIT-everywhere experience, so developers can deploy optimized kernels effortlessly across a wide spectrum of hardware backends. FlagGems recently joined the PyTorch Ecosystem upon acceptance by the PyTorch Ecosystem Working Group.

With over 180 operators already implemented—spanning native PyTorch ops and widely used custom ops for large models—FlagGems is evolving fast to keep pace with the generative AI frontier.

To view the PyTorch Ecosystem, see the PyTorch Landscape and learn more about how projects can join the PyTorch Ecosystem.

Key Features

Extensive Operator Library: 180+ PyTorch-compatible operators and growing
Performance Optimized: Select operators hand-tuned for speed
Torch.compile Independent: Fully functional in eager mode
Pointwise Operator Codegen: Auto-generates kernels for arbitrary input types and layouts
Fast Kernel Dispatching: Per-function runtime dispatch logic
C++ Triton Dispatcher: In development for even faster execution
Multi-Backend Ready: Works across 10+ hardware platforms with a backend-neutral runtime API

Architecture

FlagGems extends the PyTorch dispatch system out-of-tree through a multi-backend library powered by Triton. It intercepts ATen operator calls and provides backend-specific Triton implementations, making it easy to support alternative GPUs and domain-specific accelerators (DSAs).

Plug-and-Play

Registers with PyTorch’s dispatch system
Intercepts ATen operator calls
Seamlessly replaces CUDA operator implementations

Write Once, Compile Anywhere

Unified operator library code
Compilable on any backend with Triton support
Supports GPUs and heterogeneous chips like DSAs

Getting Started in 3 Steps

Install dependencies

pip install torch>=2.2.0  # 2.6.0 preferred

pip install triton>=2.2.0 # 3.2.0 preferred

Install FlagGems

git clone https://github.com/FlagOpen/FlagGems.git

cd FlagGems

pip install --no-build-isolation .

or editable install:

pip install --no-build-isolation -e .

Enable FlagGems in your project

import flag_gems

flag_gems.enable()  # Replaces supported PyTorch ops globally

Prefer finer control? Use a managed context:

with flag_gems.use_gems():

    output = model.generate(**inputs)

Need explicit ops?

out = flag_gems.ops.slice_scatter(inp, dim=0, src=src, start=0, end=10, step=1)

Automatic Codegen for Pointwise Ops

With the @pointwise_dynamic decorator, FlagGems can auto-generate efficient kernels with broadcast, fusion, and memory layout support. Here’s an example implementing fused GeLU and element-wise multiplication:

@pointwise_dynamic(promotion_methods=[(0, 1, “DEFAULT”)])

@triton.jit

def gelu_tanh_and_mul_kernel(x, y):

    x_fp32 = x.to(tl.float32)

    x_gelu = 0.5 * x_fp32 * (1 + tanh(x_fp32 * 0.79788456 * (1 + 0.044715 * pow(x_fp32, 2))))

    return x_gelu * y

Performance Validation

FlagGems includes built-in testing and benchmarking:

Accuracy Testing

cd tests

pytest test_<op_name>_ops.py --ref cpu

Performance Benchmarks

cd benchmark

pytest test_<op_name>_perf.py -s  # CUDA microbenchmarks

pytest test_<op_name>_perf.py -s --mode cpu  # End-to-end comparison

Benchmark Results

FlagGems Speedup

Initial benchmark results of FlagGems showcase its performance against PyTorch’s native operator implementations. The results represent the average measured speedups, a value greater than 1 indicating that FlagGems is faster than the native PyTorch operator. For a vast majority of operators, FlagGems either matches or significantly surpasses the performance of PyTorch’s native implementations.

For a significant portion of the 180+ operators, FlagGems achieves a speedup close to 1.0, indicating performance on par with the native PyTorch implementations.

Some of the core operations like LAYERNORM, CROSS_ENTROPY_LOSS, ADDMM and SOFTMAX also show impressive speedups.

Multi-Backend Support

FlagGems is vendor-flexible and backend-aware:

Set the desired backend with:

export GEMS_VENDOR=<vendor>

Check active backend in Python:

import flag_gems

print(flag_gems.vendor_name)

Summary

FlagGems delivers a unified kernel library for large models acceleration that bridges software portability and hardware performance. With broad backend support, a growing op set, and advanced codegen features, it’s your go-to Triton playground for pushing the limits of AI compute.

Presenting Flux Fast: Making Flux go brrr on H100s

June 25, 2025

by Joel Schlosser (Meta) and Sayak Paul (Hugging Face) PyTorch

In our earlier post, diffusion-fast, we showed how the Stable Diffusion XL (SDXL) pipeline can be optimized up to 3x using native PyTorch code. Back then, SDXL was an open SoTA pipeline for image generation. Quite unsurprisingly, a lot has changed since then, and it’s safe to say that Flux is now one of the most capable open-weight models in the space.

In this post, we’re excited to show how we enabled ~2.5x speedup on Flux.1-Schnell and Flux.1-Dev using (mainly) pure PyTorch code and a beefy GPU like H100.

If you cannot wait to get started with the code, you can find the repository here.

Overview of the optimizations

The pipelines shipped in the Diffusers library try to be as `torch.compile`-friendly as possible. This means:

No graph breaks wherever possible
No recompilations wherever possible
None-to-minimal CPU<->GPU syncs to reduce inductor cache lookup overhead

Therefore, it already gives us a reasonable starting point. For this project, we took the same underlying principles used in the diffusion-fast project and applied them to the FluxPipeline. Below, we share an overview of the optimizations we applied (details in the repository):

`torch.compile` with “fullgraph=True” and “max-autotune” mode, ensuring the use of CUDAgraphs
Combining q,k,v projections for attention computation. This is particularly helpful during quantization as it thickens the dimensionality, improving compute density
`torch.channels_last` memory format for the decoder output
Flash Attention v3 (FA3) with (unscaled) conversion of inputs to `torch.float8_e4m3fn`
Dynamic float8 activation quantization and quantization of Linear layer weights via torchao’s `float8_dynamic_activation_float8_weight`
Some flags for tuning Inductor performance on this model:
- conv_1x1_as_mm = True
- epilogue_fusion = False
- coordinate_descent_tuning = True
- coordinate_descent_check_all_directions = True
torch.export + Ahead-of-time Inductor (AOTI) + CUDAGraphs

Most of these optimizations are self-explanatory, barring these two:

Inductor flags. Interested readers can check out this blog post for more details.
With AoT compilation, we aim to eliminate the framework overhead and obtain a compiled binary that can be exported through `torch.export`. With CUDAGraphs, we want to enable optimization of kernel launches. More details are available in this post.

Unlike LLMs, diffusion models are heavily compute-bound, so optimizations from gpt-fast don’t exactly carry over here. The figure below shows the impact of each of the optimizations (applied incrementally from left-right) to Flux.1-Schnell on an H100 700W GPU:

For Flux.1-Dev on H100, we have the following

Below is a visual comparison of the images obtained with different optimizations applied to Flux.1-Dev:

It should be noted that only FP8 quantization is lossy in nature so, for most of these optimizations, the image quality should stay identical. However, in this case, we see very negligible differences in the case of FP8.

Note on CUDA syncs

During our investigations, we found out that at the first step of the denoising loop, there’s a CPU<->GPU sync point caused by this step in the scheduler. We could get rid of it by adding `self.scheduler.set_begin_index(0)` at the beginning of the denoising loop (PR).

It actually makes a bigger deal when torch.compile is used, since the CPU has to wait for the sync before it can do a Dynamo cache lookup and then launch instructions on the GPU, and this cache lookup is a bit slow. Hence, the takeaway message is that it’s always wise to profile your pipeline implementation and to try to eliminate these syncs as much as possible to benefit compilation.

Conclusion

The post went over a recipe to optimize Flux for Hopper architectures using native PyTorch code. The recipe tries to balance between simplicity and performance. Other kinds of optimizations are also likely possible (such as using fused MLP kernels and fused adaptive LayerNorm kernels), but for the purpose of simplicity, we didn’t go over them.

Another crucial point is that GPUs with the Hopper architecture are generally costly. So, to provide reasonable speed-memory trade-offs on consumer GPUs, there are other (often `torch.compile`-compatible) options available in the Diffusers library, too. We invite you to check them here and here.

We invite you to try these techniques out on other models and share the results. Happy optimizing!

Fault Tolerant Llama: training with 2000 synthetic failures every ~15 seconds and no checkpoints on Crusoe L40S

June 20, 2025

by Tristan Rice, Howard Huang PyTorch

Collaborators: Less Wright, Howard Huang, Chien-Chin Huang, Crusoe: Martin Cala, Ethan Petersen

tl;dr: we used torchft and torchtitan to train a model in a real-world environment with extreme synthetic failure rates to prove reliability and correctness of fault tolerant training

Training loss across 1200 failures with no checkpoints.

NOTE: Each small spike is a non-participating worker recovering which affects the metrics but not the model

Introduction

We want to demonstrate torchft in worst case scenarios by running a training job with the most extreme failure rates possible.

Most LLM pre-training uses sharded models using FSDP. torchft supports sharded models using HSDP2, which combines a sharded model with the fault tolerant DDP all reduce from torchft. We’ve integrated torchft into torchtitan so you can use fault tolerance out of the box. torchft+titan also support other sharding/parallelisms within each replica group, such as tensor parallelism (TP), pipeline parallelism (PP) and more.

Here’s the structure of a training job with torchft:

The structure of the training job. torchft’s fault tolerant DDP implementation is used across the replica groups to synchronize the gradients. Standard FSDP2 and other parallelisms are used within each replica group.

torchft uses a global Lighthouse server and per replica group Managers to do the real time coordination of workers. The Lighthouse knows the state of all workers and which ones are healthy via heartbeats.

torchft implements a few different algorithms for fault tolerance. The two most primary ones are:

Fault Tolerant HSDP: An extension of FSDPv2 that uses a fault tolerant all-reduce. This exactly emulates standard HSDP training with per step all_reduce of the gradients and per step fault tolerance. This works best for large scale training with fast backend networks such as infiniband.
LocalSGD/DiLoCo: A fault tolerant implementation of semi-sync training. These algorithms minimize communication overhead by synchronizing at specified intervals instead of every step like HSDP. This is often used in communication limited training scenarios such as over ethernet/TCP or in geographically separate locations (federated learning or multidatacenter training).

We’re always keeping an eye out for new algorithms, such as our upcoming support for streaming DiLoCo. If you have a new use case you’d like to collaborate on, please reach out!

Cluster Setup

Crusoe graciously lent us a cluster of 300 L40S GPUs. The GPUs were split up across 30 hosts, each with 10 NVIDIA L40S GPUs.

For the model, we used torchtitan with a Llama 3 model with 1B parameters to match the hardware available.

NVIDIA L40S GPUs are typically used for inference and thus gave us an opportunity to test torchft in a non-traditional environment where things such as DiLoCo really shine due to the lower TCP-only (no infiniband/nvlink) network bottleneck. The L40S has 48GB of VRAM (closer to consumer GPUs) so we used a smaller model and batch size. The average step time for training was ~9s each.

To maximize performance with the limited network, we trained the model in a 30x1x10 configuration. We had 30 replica groups (fault tolerant domains), each with 1 host and 10 gpus/workers. torchft can have many, many hosts in each replica group, but for this cluster, a single host/10 gpus per replica group had the best performance due to limited network bandwidth. We ran with 30 replica groups, as more groups stressed the coordination and reconfiguration algorithms more.

For network communication, we used NCCL for all communication (i.e., FSDP) within each replica group and Gloo for communication across replica groups. Gloo, while often not as performant, initializes much faster and can also fail much faster, which is important for quick detection of failures. torchft does support fault tolerance using NCCL for IB clusters with some caveats but wasn’t used in this demo. Since we wanted to maximize the total number of failures and recoveries, we used Gloo since it can reinitialize in <1s for our use case, and we were able to set the timeout on all operations at 5s.

For the fault tolerance algorithms, we did the bulk of the testing with Fault Tolerant HSDP, as it stresses the communication and quorum layers the most. For the final test, we used DiLoCo, which is a better fit for the ethernet based cluster.

Recovering with No Checkpoints

Traditional machine learning achieves “fault tolerance” by reloading from checkpoints when an error occurs. This involves a complete stop-the-world operation where all workers restart and load from the most recently persisted checkpoint.

With torchft, we instead focus on isolating failures to an individual group of GPUs. When an error occurs within that group we can restart that group asynchronously and all other groups can reconfigure and continue training without that group.

When that group recovers through a restart or the scheduler replaces the machines, those workers no longer have a valid copy of the weights and optimizer states. If we tried to recover from a checkpoint, the other groups would have already moved on. Instead, we rely on an asynchronous weight transfer at runtime. This does a peer-to-peer transfer of the weights from a healthy replica.

Since we’re always recovering from another worker – it turns out that we actually don’t need any checkpoints as long as we can guarantee that at least one group is healthy. For this demonstration, we turned off checkpointing entirely as a persistent checkpoint save and load is much longer than our P2P recovery time.

Here’s a diagram showing how a recovering replica (replica 1) can join the quorum and recover from a healthy peer (replica 0) without having any downtime or impacting the healthy worker training:

torchft adapts a number of concepts from distributed databases:

The quorum operation determines which workers are healthy using frequent heartbeats and guarantees that we can quickly determine which workers are alive, exchange metadata in a fault tolerant way, and enforce no split-brain conditions.
To ensure consistency and identify when we need to recover a worker, we effectively treat training with traditional database semantics. Traditional databases use “transactions” where each operation is either committed (entirely applied) or rolledback (discarded). torchft treats each training step the same way. Each training step within a replica group is handled as a distributed transaction, where we ensure all workers commit the step by stepping the optimizer or if an error occurs they all rollback by discarding the gradients.

For more details, please see the torchft README, which has links to the documentation, design docs, and presentations.

Training Loop Integration

TorchFT has already been integrated with TorchTitan, and thus, enabling it is just a matter of setting a configuration flag. For a typical model, torchft provides wrappers which automatically call hooks into torchft’s Manager to provide fault tolerance.

from torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo

# Instantiate your model and optimizer as normal
m = nn.Linear(2, 3)
optimizer = optim.AdamW(m.parameters())

# Setup torchft Manager and wrap the model and optimizer.
manager = Manager(
    pg=ProcessGroupGloo(),
    load_state_dict=lambda state_dict: m.load_state_dict(state_dict),
    state_dict=lambda: m.state_dict(),
)
m = DistributedDataParallel(manager, m)
optimizer = Optimizer(manager, optimizer)

for batch in dataloader:
    # When you call zero_grad, we start the asynchronous quorum operation 
    # and perform the async weights recovery if necessary.
    optimizer.zero_grad()

    out = m(batch)
    loss = out.sum()

    # The gradient allreduces will be done via torchft's fault tolerant 
    # ProcessGroupGloo wrapper.
    loss.backward()

    # The optimizer will conditionally step depending on if any errors occured. 
    # The batch will be discarded if the gradient sync was interrupted.
    optimizer.step()

Fault Tolerant Scheduling

We can use standard ML job schedulers such as Slurm since the semantics for the workers within a replica group are the same as a normal job. If an error occurs on any of the workers within a group we expect the entire group to restart simultaneously. Within each replica group, the application is a completely standard training job using standard non-fault tolerant operations.

To achieve fault tolerance on a traditional scheduler, we run multiple of these jobs. Each replica group ran on Slurm as a separate training job with the Lighthouse and a monitoring script running on the head node. All the cross-group communication is done via torchft’s managed ProcessGroup and quorum APIs. To restart groups on failure and inject failures we used a small script using the torchx Python API.

The monitoring script looks something like this:

from torchx.runner import get_runner

NUM_REPLICA_GROUPS = 30

with get_runner() as runner:
    while True:
        jobs = runner.list(scheduler)
        active_replicas = {
            parse_replica_id(job.name)
            for job in jobs
            if not job.is_terminal()
        }

        missing_replicas = set(range(NUM_REPLICA_GROUPS)) - active_replicas

        for replica_id in missing_replicas:
            app_def = make_app_def(replica_id=replica_id)
            app_handle = runner.run(
                app_def, 
                scheduler="slurm", 
                cfg={"partition": "batch"},
            )
            print("launched:", replica_id, app_handle)

        time.sleep(5.0)

The failures were injected by cancelling the specific replica group’s Slurm job using scancel. In a real world scenario we would expect the failure to be triggered by an error in the training process which would crash that replica group in isolation rather than an external failure.

Metrics and Logs

To ensure we had a consistent view of the job, we avoided injecting failures into one replica group to make it simpler to track metrics and quorum events for the job. That one group was able to consistently log the number of participants, step success/failures, and the loss.

Since we’re doing per step fault tolerance, the number of participants and thus batch size changes per step depending on which workers are healthy.

The loss is averaged across all workers/replica groups in the job using an allreduce across replica groups.

Note: the small little spikes in the loss graphs below are due to how we average the loss across all hosts, including recovering workers, which have out of date weights, which leads to incorrectly higher loss on those steps.

Runs

We ran three different runs showcasing various failure scenarios and features of torchft.

Run 1: Injected Failure Every 60s for 1100 Failures

This run lasted a little over 19 hours and 6249 steps. On average, each step took 10.9 seconds.

For the initial run, we injected a failure every 60 seconds with a very repeatable pattern. We initially had a bad machine in the cluster, so we briefly shrunk the world size to 25 hosts until the machine was replaced, and we scaled the job back up with zero downtime.

With the failure every 60s we expected to be able to do ~5 steps between each failure without any issue. Looking at the results, we see that there were 6249 steps and 5145 successful commits. torchft is designed to be as safe as possible, and if any errors occurred, it will discard the step via “should_commit” prior to running the optimizer.

For the overall step efficiency, we have:

5145 successful steps / 6249 total steps = 82.3%

With a step time of ~11 seconds and a failure every 60 seconds we should be able to complete 5 out of every 6 steps (83.3%) and that matches almost exactly with the measured performance.

We averaged 29.6 participating replica groups per step, so the total training efficiency of this was 81.2%. Not bad for over 1000 failures.

Run 2: Injected Failure Every 15s for 1015 Failures

We wanted to see how much further we could push this and also make it even more challenging. For the second run, we ran with a failure injected between 0-30 seconds with a failure on average every 15 seconds.

This failure rate is extreme compared to training jobs, which typically have mean time between failures in the 10s of minutes to hours range, but lets us validate that we can recover no matter when the error happens and lets us run a huge amount of test cycles to gain confidence in our implementation.

By randomizing the failure interval, we cause failures to happen while workers are still initializing rather than in steady state and are much more likely to hit edge cases. We’re happy to report that torchft behaved as expected with no unrecoverable errors.

As you can see, this job is behaving much more erratically. Rather than the very close to 30 machines we had with a 60 second failure rate, with a failure every 15 seconds we’re anywhere from 1 machine to 30 machines available on each step.

On average, we had 18.9 (18.9/30 = 63%) workers healthy and participating on any given step and an average step time of 15.46 seconds.

Out of the first 888 steps, 268 of those steps were committed successfully, which gives us a 30.2% step efficiency.

This gives us training efficiency of just 13.4%, which in any normal training job would be terrible but it’s remarkable that the model is converging despite a crash every 15 seconds! Just loading a model from a checkpoint often takes longer than 1 minute.

The loss converges slower as compared to our 60s MTBF run, but that’s expected as many more batches are being discarded due to errors.

We do see some bigger spikes in the loss, which are correlated with times when only 1 participant is healthy and thus has 1/30th the batch size. This is easily avoided by adjusting the minimum number of replicas. We had it set to 1 for this test.

Run 3: Semi-synchronous Training

TorchFT also supports semi-synchronous training algorithms, including LocalSGD and DiLoCo, with plans to add more in the future. Unlike HSDP2, these algorithms do not synchronize at every step. Instead, they perform local training for several steps before synchronizing weights through averaging parameters or gradients. This approach enhances performance by reducing communication costs to once every N steps (a configurable hyperparameter), rather than at every step. Our tests on the cluster demonstrate a noticeable improvement in throughput. When synchronizing every 40 steps, we minimize the communication overhead, resulting in higher overall throughput. Below is a comparison of DiLoCo’s throughput (yellow), averaging around 4000 tps, compared with that of regular HSDP2 (purple), which averages around 1200 tps.

Naturally, the longer the interval between synchronizations, the more the models within replica groups will diverge. This divergence can potentially impact the convergence of the model. However, in our testing, we observed that the model was still able to train effectively and reach convergence despite these longer synchronization intervals. This resilience is beneficial in dynamic environments where replicas might leave the group unexpectedly. Even in such scenarios, the model demonstrated the ability to continue training without significant disruption.

Next Steps

torchft is under active development, and we have a lot of planned improvements around newer algorithms such as streaming DiLoCo, making PyTorch Distributed more robust to failures (even on infiniband/nvlink!), and even more efficient.

If you’re interested in using torchft please take a look at torchft README and torchft Documentation. We’d also love to chat with you, so feel free to reach out directly via GitHub, LinkedIn, or Slack.

PyTorch Docathon 2025: Wrap Up

June 18, 2025

by PyTorch Foundation PyTorch

Huge congratulations and a massive thank you to all the amazing participants of the PyTorch Docathon 2025!

Over the past two weeks (June 3rd-15th), our virtual Docathon brought together over 150+ registrants who actively contributed to resolving long-standing documentation issues. We’re thrilled to announce that your efforts resulted in more than 60+ merged pull requests across two PyTorch repositories!

We’d like to extend a special shout-out to our top contributors who went above and beyond during this event. Your dedication, expertise, and commitment to improving PyTorch documentation are truly inspiring. You’re the driving force behind open source projects like PyTorch, and we’re grateful for your contributions.

First place: j-silv, kiszk, windsonsea

Second place: Rachel0619, jafraustro, loganthomas, nirajkamal, Dhia-naouali

Third place: Juliandlb, ggsmith842, ParagEkbote

PyTorch Docathon Top Community Contributors

Check out the full list of contributors here.

As we wrap up this Docathon, we encourage you to keep pushing the boundaries of what’s possible with PyTorch. Your collective efforts are revolutionizing the AI community, and we can’t wait to see what you achieve next.

Thank you again for being part of this incredible journey. Keep contributing, innovating, and inspiring others!

Team PyTorch

DeepNVMe: Affordable I/O scaling for Deep Learning Applications

June 17, 2025

by Joe Mayer, Logan Adams, Olatunji Ruwase PyTorch

Introduction

We introduced DeepNVMe in summer 2024 as a suite of optimizations for tackling I/O bottlenecks in Deep Learning (DL). DeepNVMe delivers significant speedups for I/O bound DL workloads by leveraging storage innovations including local NVMe SSDs, NVIDIA Magnum IO^TM GPUDirect® Storage (GDS), and Linux Asynchronous I/O (AIO). In this update, we are delighted to announce DeepNVMe improvements on multiple fronts: (i) expanding application coverage to FastPersist model checkpointing and SGLang inference, (ii) I/O performance scaling by upgrading from PCIe Gen4 to Gen5 NVMe SSDs, and (iii) expanding usability to CPU-only environments, offset-based I/O operations, and tensor data type casting. The results reported in this blog are available in DeepSpeed versions >= 0.17.1.

Evaluation environments

Our experiments are conducted on Azure ND-H200-v5 VM. The key software configurations are summarized in the following table.

Software	Version
Ubuntu	24.04.2
PyTorch	2.6.0
CUDA	12.6
SGLang	0.4.4.post4

Addressing I/O Bottlenecks of Deep Learning

We used DeepNVMe to develop FastPersist and ZeRO-Inference to target I/O bottlenecks in DL training and inference respectively. Our experiments are conducted using a single VM, in which we combine the available NVMe SSDs into a single RAID-0 (i.e., disk striping) volume to leverage aggregate read and write bandwidths. Since DeepNVMe can offload tensors using CPU bounce buffers (a.k.a., AIO), or NVIDIA GPUDirect Storage (a.k.a., GDS), we report results for both modes.

FastPersist: Faster Model Checkpoint Creation

Although saving model checkpoints to persistent storage is critical in model training, it is also a major bottleneck due to the inefficiencies of existing approaches. We developed FastPersist to address the performance challenges of checkpointing. FastPersist makes checkpointing overheads negligible during training through three key techniques: (i) DeepNVMe, (ii) data parallelism, and (iii) overlapping I/O and computation.

Our goal here is to demonstrate the impact of DeepNVMe in FastPersist using single-process micro-benchmarks (available here), which serialize a model checkpoint state from HBM to local NVMe. We use the popular PyTorch torch.save() as the baseline in our experiments, and integrate FastPersist into torch.save() to simplify adoption and performance comparisons.

Faster Saving of PyTorch Models to local NVMe Storage

We measure the throughput of serializing Phi-3-Mini checkpoint state from HBM to local NVMe storage. The results are summarized in the Figure below. We observe significantly faster checkpointing with FastPersist compared to the baseline. We see speedups of over 20X in the 8xGen5 NVMe settings. We also observe FastPersist scaling with increased NVMe bandwidth of 8xGen5 compared with 4xGen5.

FastPersist provides significantly faster model checkpointing to local NVMe.

ZeRO-Inference: Democratizing Generative AI

ZeRO-Inference is a technology that democratizes access to state-of-the-art models by reducing the GPU costs of model inference. ZeRO-Inference enables inference computations of massive models (hundreds-of-billions of parameters) on as few as one GPU by offloading the model weights to DRAM and NVMe storage. ZeRO-Inference is designed for offline or throughput-oriented inference scenarios. In this blog, we share two updates on ZeRO-Inference. First, we have integrated ZeRO-Inference into SGLang, a state-of-the-art model serving framework. Second, we observed ZeRO-Inference performance scales with the faster NVMe SSDs in the latest Azure SKUs.

Democratizing SGLang through ZeRO-Inference integration

SGLang is a state-of-the-art serving framework for large language models (LLMs) and vision language models (VLMs). Our integration of ZeRO-Inference into SGLang makes SGLang available to budget-constrained users, and offers a cost-reduction option to existing SGLang users. We used SGLang’s offline benchmarking tool to measure the generation throughput of LLAMA3-70B on a single H200 with NVMe offloading (LLAMA3-70B cannot fit in the 141GB VRAM without offloading). The experiment is configured with prompt length of 512, generation length of 32, and batch size of 128. We summarize the results in the figure below for both AIO and GDS offloading.

ZeRO-Inference improves SGLang inference with NVMe offloading to reduce hardware costs.

Scaling HF Transformer Generation with Faster NVMe SSDs

ZeRO-Inference enhances HF Transformer inference with efficient model offloading to DRAM or NVMe. We previously evaluated LLAMA-3-70B generation performance with NVMe offloading on a single GPU and four Gen4 NVMes in an Azure NC_A100_v4 VM. We measured the generation speed for a prompt of 512 tokens, output of 32 tokens, and batch size 96. Since NVMe bandwidth was the main bottleneck, we repeat the experiments on Azure ND-H200-v5 offering Gen5 NVMes. The results summarized in the Figure below show that ZeRO-Inference uses the increased NVMe bandwidths to improve generation speeds. For example, with GDS, generation speed improves from 7 tokens/sec with four Gen4 NVMes to 17 tokens/sec with four Gen5 NVMes, and further to 26 tokens/sec with eight Gen5 NVMes. We observe similar improvements without GDS. These results show that ZeRO-Inference performance can be improved in cost-effective manner by increasing NVMe bandwidths.

ZeRO-Inference leverages available NVMe bandwidth to scale LLAMA-3-70B generation.

I/O performance scaling

We used our ds_io benchmarking tool to demonstrate DeepNVMe proportionally scaling I/O performance with available NVMe bandwidths. This empowers users to accelerate I/O bound DL applications at modest cost using more or faster NVMe SSDs. In our experiments, we measure the achieved read and write bandwidths of 1GB data transfers between HBM and NVMes. We evaluate scaling up NVMes from PCIe Gen4 to Gen5, and scaling out from 4 to 8 SSDs. The SSDs are combined into a single RAID-0 (disk striping) volume. We summarize the results in the Figure below which show that DeepNVMe scales I/O performance on both dimensions. Scaling up from 4xGen4 SSDs to 4xGen5 SSDs improves reads from 10GB/sec to 27GB/sec, and writes from 5GB/sec to 11GB/sec. Scaling out from 4xGen5 to 8xGen5 further improves reads to 48GB/sec, and writes to 26GB/sec.

Microbenchmark shows DeepNVMe scales I/O performance with available NVMe bandwidth

Broadening usability

We have increased the usage scenarios of DeepNVMe by removing restrictions regarding hardware environments and I/O operations, as explained below.

CPU-Only environments

Although GPUs (and similar accelerators) dominate DL, CPUs are still used in important machine learning (ML) workloads such as recommendation systems. However, DeepNVMe was previously unusable in CPU-only environments. This was because DeepNVMe relied on torch.pin_memory() for page-locked CPU tensors, whereas torch.pin_memory() does not work in the CPU versions of torch as illustrated below.

>>> import torch
>>> torch.__version__
'2.6.0+cpu'
>>> x = torch.empty(1024).pin_memory()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Cannot access accelerator device when none is available.
>>>

We have made DeepNVMe usable in CPU environments by adding mechanisms for allocating (new_cpu_locked_tensor()) and releasing (free_cpu_locked_tensor()) page-locked CPU tensors. The snippet below illustrates allocating a pinned CPU tensor (x).

>> import torch
>>> torch.__version__
'2.6.0+cpu'
>>> from deepspeed.ops.op_builder import AsyncIOBuilder
>>> h = AsyncIOBuilder().load().aio_handle()
>>> x = h.new_cpu_locked_tensor(1024, torch.Tensor())
>>> x.shape
torch.Size([1024])
>>> x.dtype
torch.float32

Offset-based I/O operations

Previously, DeepNVMe functionality was restricted to reading or writing the entire contents of a file. We have now improved DeepNVMe to read or write a user-specified portion of file content from a given offset. In particular, we have extended the existing read/write APIs to accept a user-specified file offset argument (with default value 0) such as below:

>> from deepspeed.ops.op_builder import AsyncIOBuilder
>>> help(AsyncIOBuilder().load().aio_handle().pread)
Help on method pread in module async_io:

pread(...) method of async_io.aio_handle instance
pread(self: async_io.aio_handle, buffer: torch.Tensor, filename: str, validate: bool, async: bool, file_offset: int = 0) -> int

Tensor data type casting

While developing FastPersist, we needed to manipulate model tensors, typically of floating point data types, in byte format for both performance and convenience of I/O operations. However, we could not find a zero-copy mechanism for casting tensors from arbitrary data types to a byte data type (i.e., torch.uint8), so we decided to create one. This functionality is available via the UtilsBuilder op as demonstrated in the example below. In the example, we cast a torch.bfloat16 tensor into torch.uint8. Note that due to the zero-copy nature of the functionality, bf16_tensor and byte_tensor are aliases.

>>> import torch
>>> from deepspeed.ops.op_builder import UtilsBuilder
>>> util_ops = UtilsBuilder().load()
>>> bf16_tensor = torch.zeros(1024, dtype=torch.bfloat16, device='cuda')
>>> bf16_tensor
tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0', dtype=torch.bfloat16)
>>> byte_tensor = util_ops.cast_to_byte_tensor(bf16_tensor)
>>> byte_tensor
tensor([0, 0, 0, ..., 0, 0, 0], device='cuda:0', dtype=torch.uint8)
>>> bf16_tensor += 1.0
>>> bf16_tensor
tensor([1., 1., 1., ..., 1., 1., 1.], device='cuda:0', dtype=torch.bfloat16)
>>> byte_tensor
tensor([128, 63, 128, ..., 63, 128, 63], device='cuda:0',
dtype=torch.uint8)

Summary

This blog post has provided updates on our continued development of DeepNVMe, an I/O optimization technology for accelerating DL applications. We have announced DeepNVMe improvements on multiple aspects, including application coverage, I/O performance scaling, and usability.

Acknowledgements

This blog describes work done by Joe Mayer, Logan Adams, and Olatunji Ruwase of the DeepSpeed team at Microsoft.

ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization

June 13, 2025

by Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Andrew Or, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra (Meta) PyTorch

The field of large language models is shifting toward lower-precision computation. This shift necessitates a rethinking of scaling laws to account for the effects of quantization on resulting quantized model performance. In this work, we demonstrate that previous conclusions on the low-bit scaling laws can be significantly sharpened by better quantization scheme design and training improvements.

We propose ParetoQ, the first algorithm that unifies binary, ternary, and 2-to-4 bit quantization-aware training. ParetoQ demonstrates its robustness by yielding state-of-the-art (SOTA) models at all bit widths, surpassing prior works tailored for individual bit levels. We’ve released the MobileLLM low-bit model collection on Hugging Face, featuring models quantized with our ParetoQ method. The smallest model is an ultra-efficient 1-bit 125M variant, with just ~16MB equivalent storage size.

These SOTA points in the Pareto chart ensure that our scaling law comparisons are both reliable and consistent, as they derive from homogeneous settings. Our scaling laws reveal that binary quantization significantly compromises accuracy, while ternary, 2-bit, and 3-bit quantization are tied in performance, often surpassing 4-bit.

ParetoQ is based on PyTorch models, including LLaMA and MobileLLM. We utilized a popular PyTorch Library: HuggingFace Transformers for accuracy experiments. For the latency experiments, we utilize the low-bit quantization kernels on the CPU with ExecuTorch. We compared their speed with that of 4-bit quantization. Additionally, we implemented state-of-the-art 2-bit GPU kernels, which showed up to a 4.14x speedup compared to FP16 and a 1.24x speedup over the Machete 4-bit kernel on TritonBench.

ParetoQ has been integrated into torchao [pull]. This integration enables users to leverage ParetoQ by specifying “paretoq” as the quantization method within torchao’s codebase. Once set, the users can utilize torchao’s ParetoQ workflow, optimizing quantization parameters to balance accuracy and compression trade-offs and compare different quantization bit`s apple-to-apple using Pareto frontier analysis. This allows for the efficient deployment of models on edge devices without requiring manual tuning of quantization settings.

To obtain the ParetoQ-quantized models, simply navigate to the torchao/prototype/paretoq directory and execute the training script:

cd torchao/prototype/paretoq && bash 1_run_train.sh $w_bit

Here, $w_bit specifies the target weight bit-width for quantization.

ParetoQ code is available at: https://github.com/facebookresearch/ParetoQ

Paper link: https://arxiv.org/abs/2502.02631

1 A Better QAT Scheduling Strategy for Extreme Low-Bit LLMs

1.1 Training Budget Allocation

Given a fixed training budget B_train = B_FPT +B_QAT, how should the budget be optimally allocated between full-precision training (B_FPT) and quantization-aware training/fine-tuning (B_QAT) to maximize the accuracy of the quantized model?

Figure 1: Optimal allocation between full-precision pretraining and QAT fine-tuning.

Finding-1 QAT finetuning consistently surpasses both PTQ with B_FPT = B_train and QAT from scratch with B_QAT = B_train. Optimal performance is nearly achieved by dedicating the majority of the training budget to full precision (FP) training and approximately 10% to QAT.

1.2 Fine-tuning Characteristics

Figure 2: Analysis of training token requirements for quantization-aware fine-tuning and training from scratch

Finding-2 While fine-tuning enhances performance across all bit-widths, even binary and ternary, optimal fine-tuning effort inversely correlates with bit-width. For 3-bit and 4-bit weights, fine-tuning adjusts within a nearby grid to mitigate accuracy loss and requires less fine-tuning tokens. In contrast, binary and ternary weights break the grid, creating new semantic representations to maintain performance, requiring longer fine-tuning.

Figure 3: L1 norm difference between QAT-finetuned weights and full-precision initialization (||W_finetune −W_init||_l1 /||W_init||_l1).

2 A Hitchhiker’s Guide to Quantization Method Choices

In sub-4-bit quantization, the choice of function is highly sensitive and can drastically alter scaling law outcomes.

Figure 4: Impact of quantization grid choice across bit widths. 2.1.1 Range clippingCompared to statistics-based quantization (e.g., min-max quantization), learnable scales which optimize quantization ranges as network parameters, balancing outlier suppression and precision, yields more stable and superior performance. As shown in Figure (b)-(e), learnable policies consistently outperform stats-based methods across all bit widths.

2.1.2 Quantization grids

Level symmetry in quantization grids is vital for lower-bit quantization but often overlooked. Including “0” in even-level quantization (e.g., 2-bit, 3-bit, 4-bit) can cause imbalance. For instance, 2-bit quantization options like (-2, -1, 0, 1) limit positive representation to only one level, while (-1.5, -0.5, 0.5, 1.5) offers more balanced representation. We propose Stretched Elastic Quant (SEQ) to address this in lower-bit scenarios.

SEQ balances quantized levels and evenly divides the full-precision weight span, crucial for extremely low-bit quantization. Figures show SEQ’s advantage in ternary and 2-bit quantization, while LSQ with “0” slightly excels in 3 and 4-bit cases.

Figure 5: Comparison of quantization methods across different bit-widths

2.2 Quantization Function

Based on our analysis, we combine the optimal quantization functions identified for each bit-width into one formula, denoted as ParetoQ. This includes Elastic Binarization [1] for 1-bit quantization, LSQ [2] for 3 and 4-bit quantization, and the proposed SEQ for 1.58 and 2-bit quantization.

Here, k equals 3 in the ternary case and 2Nbit otherwise; n = –2Nbit-1 and p = 2Nbit-1 -1. In the backward pass, the gradients to the weights and scaling factor can be easily calculated using a straight-through estimator.

With ParetoQ, we present a robust comparison framework across five bit-widths (1-bit, 1.58-bit, 2-bit, 3-bit, 4-bit), each achieving state-of-the-art accuracy. This facilitates direct, apple-to-apple comparisons to identify the most effective bit-width selection.

3 Comparison with SoTA

3.1 Comparisons on 1.58-bit quantization

The figure below illustrates that ParetoQ consistently outperforms previous methods targeting ternary quantization aware training including Spectra [3] and 1-bit Era [4]. Given that a full-precision LLaMA-3 3B model achieves 69.9 accuracy, it’s remarkable that ParetoQ ternary 3B-parameter model narrows the gap to just 4.1 points, while previous methods experience drops exceeding 11.7 points.

Figure 6: Ternary quantization accuracy averaged across six tasks: ARC-e, ARC-c, BoolQ, PIQA, HellaSwag, and WinoGrande. ParetoQ consistently outperforms all prior methods in ternary quantization-aware training.

3.2 comparisons 2-bit / 3-bit / 4-bit quantization

As evidenced by Figure 1, compared to previous state-of-the-art PTQ and QAT methods on 2, 3 or 4-bit quantization settings, our approach consistently resides on the Pareto front, with a particularly pronounced advantage in lower-bit quantization settings. These results confirm that our bit-accuracy trade-off conclusions are benchmarked against SoTA results across all bit settings, ensuring its reliability.

Figure 7: Accuracy comparison on 8 models. ParetoQ outperforms all state-of-the-art PTQ and QAT methods in 2, 3, and 4-bit settings.

4 Pareto Curve

4-bit quantization-aware training (QAT) achieves near-lossless compression in many scenarios. With ParetoQ, we are able to further improve the trade-off curve. Figure (a) demonstrates that sub-4-bit quantization, including binary, ternary, 2-bit, and 3-bit, often surpasses 4-bit. Notably, 2-bit and ternary models reside on the Pareto frontier.

To evaluate potential speedup benefits beyond memory reduction, we utilize the High-Performance Low-Bit Operators for 2-bit quantization and compare the latency with 4-bit quantization. The curves in Figure8 (c) demonstrate that, within our experimental range, 2-bit quantized models consistently outperform 4-bit models in terms of accuracy-speed performance, positioning 2-bit quantization as a superior choice for on-device applications where both latency and storage are critical.

Figure 8: (a) (b) In sub-4-bit regime, 1.58-bit, 2-bit, and 3-bit quantization outperform 4-bit in terms of the accuracy-model size trade-off. (c) Under hardware constraints, 2-bit quantization demonstrates superior accuracy-speed trade-offs compared to higher-bit schemes.

5 GPU Latency

We measured the latency of LLaMA 3.2 models (1B, 3B, 8B) on an H100 NVL GPU (94GB memory). The W4A16 kernel used the Machete kernel from vLLM, while the W2A16 kernel was implemented based on the CUTLASS mixed precision backbone kernel. All tests were performed on a single GPU with a context length of 2048 tokens. For kernel-level latency, we compared the 2-bit kernel to the 4-bit Machete kernel across three weight shapes: (4096 x 4096), (8192 x 8192), and (16384 x 16384) on TritonBench. For larger size kernels, 2-bit can achieve ~24% speed up compared to the 4-bit Machete kernel.

Conclusion

In this study, we propose ParetoQ, an advanced quantization framework that achieves state-of-the-art performance across all bit-width levels. This framework uniquely enables a direct, consistent comparison across different bit-widths, ensuring an equitable evaluation of performance metrics. Our empirical analysis indicates that quantization at 1.58-bit, 2-bit, and 3-bit offers a superior trade-off between accuracy and effective quantized model size compared to 4-bit, highlighting their potential for optimized model deployment.

Feel free to try running ParetoQ from torchao/prototype/paretoq, following the steps in that repo. If you have any questions, feel free to reach out to Zechun Liu <zechunliu@meta.com>, Changsheng Zhao <cszhao@meta.com> Andrew Or <andrewor@meta.com>

References

[1] BiT: Robustly Binarized Multi-Distilled Transformer.

[2] Learned Step Size Quantization.

[3] Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models.

[4] The Era of 1-bit LLMs: All Large Language Models Are in 1.58 Bits

HuggingFace Safetensors Support in PyTorch Distributed Checkpointing

June 6, 2025

by Ankita George, Saurabh Mishra, Joe Cummings, Philip Bontrager, Daulet Askarov, Teja Rao, Chien-Chin Huang, Ela Krepska, Jafar Taghiyar PyTorch

Summary

PyTorch Distributed Checkpointing (DCP) is making investments into addressing the interoperability blockers to ensure that popular formats, like HuggingFace safetensors, can work well with PyTorch’s ecosystem. Since HuggingFace has become a leading format in inference and fine-tuning, DCP is beginning to support HuggingFace safetensors. The first customer of these changes is torchtune, who has seen an improved user experience as they can now cleanly read and write directly to HuggingFace with DCP APIs.

Problem

Since HuggingFace is used widely, with over 5 million users, many ML engineers would like to save and load their checkpoints in safetensors format to be able to easily work with their ecosystem. By supporting safetensors format natively in DCP, checkpointing is simplified for our users in the following ways:

DCP currently has its own custom format, so users who want to work with HuggingFace models, but leverage DCP’s performance wins and features, had to build custom converters and components so that they could work between both systems.
Instead of users having to download and upload their checkpoints to local storage every time, HuggingFace models can now be saved and loaded directly into the fsspec-supported storage of their choosing.

How to Use

From a user’s perspective, the only change needed to use safetensors is to call load with the new load planner and storage reader, and similarly save with the new save planner and storage writer.

The load and save APIs are called as follows:


load(
	state_dict=state_dict,
	storage_reader=HuggingFaceStorageReader(path=path),
)

save(
	state_dict=state_dict,
	storage_writer=HuggingFaceStorageWriter(
				path=path,
				fqn_to_index_mapping=mapping
			),
)

The HuggingFaceStorageReader and HuggingFaceStorageWriter can take any fsspec based path and so it can read/write in HF safetensors format to any fsspec supported back-end, including local storage and HF storage. Since HuggingFace safetensors metadata doesn’t natively provide the same level of information as DCP metadata, distributed checkpoints are currently not well-supported in these APIs, but DCP plans on supporting this natively in the future.

torchtune

Our first customer of HuggingFace DCP support is torchtune – a post-training library written in native PyTorch. The primary way torchtune users retrieve model weights is from the Hugging Face Hub. Before, users had to download the model weights and upload the trained checkpoints via extra CLI commands; the new DCP APIs allow them to directly read and write to HuggingFace, resulting in a much better user experience.

In addition, the support of safetensor serialization in DCP greatly simplifies the checkpointing code in torchtune. No longer will there need to be format-specific checkpointing solutions, thus increasing developer efficiency in the project.

Future Work

DCP plans to handle the distributed loading and saving of HuggingFace safetensors checkpoints with resharding. DCP also plans to support the ability to produce a consolidated final checkpoint to a single file for publishing.

Introducing the PyTorch Ecosystem Working Group and Project Spotlights

June 5, 2025

by PyTorch Ecosystem Working Group PyTorch

The PyTorch Ecosystem goes back several years, with some of its earliest projects like Hugging Face, Fast.ai, and PyTorch Lightning going on to grow incredible communities of their own. The goal from the beginning was to bring together innovative open source AI projects that extend, integrate with, or build upon PyTorch. Some of the key aspects we looked at were, for example, that they were well tested and maintained (including CI), were easy to onboard as a user, and there was a growing community. Fast forward several years, and the ecosystem continues to thrive with a vibrant landscape of dozens of projects spanning privacy, computer vision, to reinforcement learning. Enter the PyTorch Ecosystem Working Group.

In early 2025, the PyTorch Foundation created the PyTorch Ecosystem Working Group to showcase projects that could be of interest to the community and represent projects that are mature and healthy, standing out in their respective domain. The working group, composed of members across the ecosystem, was tasked with defining a clear bar including functional requirements (e.g., CI, licensing…), measurable requirements (e.g., commits and contributors), and the implementation of best practices for how to structure their repos. The working group also implemented a streamlined submission and review process and a transparent lifecycle. It’s still very early, but the reception from the community has been great, with 21 submissions so far and a strong pipeline of projects in review. You can learn more about this working group’s goals here, including the requirements and application process.

As part of this new blog series, every quarter we will update the community on new entries in the PyTorch Ecosystem, as well as highlight up and coming projects that are in consideration that will benefit from more eyes and contributors.

Ecosystem Project Spotlights

We’re happy to welcome SGlang and docTR to the PyTorch Ecosystem. Here’s a short intro to both.

SGLang

SGLang is a fast-serving engine for large language models and vision language models. It makes the interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

The core features include:

Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, and quantization (FP8/INT4/AWQ/GPTQ).
Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
Active Community: SGLang is open source and backed by an active community with industry adoption.

SGLang is famous for its fast speed. It can often significantly outperform other state-of-the-art frameworks in terms of serving throughput and latency. Learn more.

docTR

docTR is an Apache 2.0 project developed and distributed by Mindee to help developers integrate OCR capabilities into applications with no prior knowledge required.

To quickly and efficiently extract text information, docTR uses a two-stage approach:

First, it performs text detection to localize words.
Then, it conducts text recognition to identify all characters in a word.

Detection and recognition are performed by state-of-the-art models written in PyTorch. Learn more.

Up and Coming Project Spotlights

As part of this series, we highlight projects that are in consideration for the PyTorch Ecosystem, and that we believe will benefit from more eyes and contributors. This time it’s the turn of EIR and torchcvnn.

EIR

EIR is a comprehensive deep learning framework built on PyTorch that enables researchers and developers to perform supervised modeling, sequence generation, image/array generation, and survival analysis across multiple data modalities. EIR specializes in handling complex data types, including genotype, tabular, sequence, image, array, and binary inputs. While it has particular strengths in genomics and biomedical applications, its versatile handling of these diverse data types allows for broader applications across various sectors. For example, EIR’s multi-modal approach can enhance tasks such as detecting manufacturing defects by linking images with equipment readings (e.g., for an imperfect silicon wafer), monitoring infrastructure by analyzing site photos along with operational logs (e.g., to identify cracks in a pipeline), or improving retail insights by combining product images with their descriptions and sales figures. This demonstrates how EIR’s multi-modal capabilities can bring value to a wide range of industries.

The framework provides a high-level, yet modular API that reduces the amount of boilerplate code and pre-processing required to train models, allowing users to focus on their end goals rather than implementation details. To learn more and explore practical examples, please refer to the documentation.

Key features include:

Multi-modal inputs: Seamless integration of genotype, tabular, sequence, image, array, and binary data.
Varied modeling options: Use any of the input modalities above for supervised learning, sequence generation, image/array generation, and survival analysis.
Scaling: Capabilities for custom data streaming for model training.
Explainability: Built-in explainability functionality for when performing supervised learning and survival analysis.
Model Deployment: Serve any of your trained models with just one command, allowing you or others to interact with your models via web services.

To explore EIR and consider how it might enhance your work with multi-modal data:

Install: Installation is simple:
- pip install eir-dl
- Please refer to the README for more information.
Explore tutorials: Documentation available at eir.readthedocs.io, examples include:
- Genetic ancestry prediction: Explore the basics of how EIR works by training a model to predict genetic ancestry.
- Multi-modal training: Combine tabular, text, and image data for pet adoption prediction.
- Sequence generation: Train and serve an image captioning model.
- Image Generation: Use guided diffusion for image generation.
- Scaling: Implement custom data streaming for GPT-style training.
Contribute: Please visit our GitHub repository.

torchcvnn

torchcvnn is a library that helps researchers, developers, and organizations to easily experiment with Complex-valued Neural Networks (CVNNs)! In several domains, data are naturally represented in real-imaginary form, for instance, remote sensing, MRI, and many more. These domains would benefit from direct complex-valued computations, giving understanding about critical physical characteristics to the neural networks during the learning process.

torchcvnn gives you easy access to:

Standard datasets for both remote sensing (SLC and ALOS2 formats) and MRI, and for different tasks (classification, segmentation, reconstruction, super-resolution)

Various activation functions, either operating independently on the real/imaginary components or fully exploiting the complex nature of the representations,

Normalization layers with the complex-valued BatchNorm of Trabelsi et al.(2018), LayerNorm, RMSNorm,

Complex-valued attention layer as introduced in Eilers et al. (2023),

PyTorch already supports optimization of complex-valued neural networks by implementing Wirtinger Calculus. However, there are still complex-valued building blocks missing to really be able to explore the capabilities of complex-valued neural networks. The objective of torchcvnn is to fill in this gap and to provide a library helping the PyTorch users to dig into the realm of complex-valued neural networks.

torchcvnn warmly welcomes contributions to both the core torchcvnn library or to the examples’ repository for whether spotting a bug, having suggestions for improvements, or even wanting to contribute to the source code. All the components are described in the documentation of the project. The torchcvnn team will be present at IJCNN 2025 in July in Rome during the special session on “Complex- and Hypercomplex-valued Neural Networks.”

How to Join the PyTorch Ecosystem

If you’re developing a project that supports the PyTorch community, you’re welcome to apply for inclusion in the Ecosystem. Please review the PyTorch Ecosystem review process to ensure that you meet the minimum expectations before applying.

Cheers!

The PyTorch Ecosystem Working Group

Open Source AI is Transforming the Economy—Here’s What the Data Shows

June 4, 2025

by Frank Nagle, Assistant Professor in the Strategy Unit at Harvard Business School and Advising Chief Economist at the Linux Foundation PyTorch

Blog cross-posted on the Linux Foundation blog.

As we approach the midpoint of 2025, the potential of AI to transform businesses, economies, and industries is not only widely anticipated and nearly universal but also well documented. In a commissioned project by Meta, LF Research set out to capture existing evidence on this topic, with the specific aim of understanding how open source is playing a role in this transformation.

In its latest publication, The Economic and Workforce Impacts of Open Source AI, LF Research describes the nuances of how and to what extent open source AI (OSAI) is impacting the global economy and workforce. By examining existing evidence from industry, academic, and open source research, the authors found important insights on OSAI’s adoption rates, cost effectiveness, innovation-boosting potential, and more. Here are the big takeaways.

First, the adoption of open source AI is already widespread. Nearly all software developers have experimented with open models, and about 63% of companies are actively using them. In fact, among organizations that have embraced AI in any form, a striking 89% incorporate open source AI somewhere in their infrastructure. It’s no longer a fringe approach—it’s becoming the standard.

89 percent of orgs incorporate open source AI somewhere in their infrastructure

Why? Cost is a huge factor. Open source tools often come with significantly lower price tags than their proprietary counterparts. My prior research with Manuel Hoffmann and Yanuo Zhou has shown that if open source didn’t exist, companies would spend 3.5 times more on software than they currently do. The new LF report shows that two-thirds of organizations say OSAI is cheaper to deploy, and nearly half cite cost savings as a primary reason for choosing open source. Combine that with studies showing AI’s ability to cut business unit costs by over 50%, while still being user friendly and maintaining high performance, and it’s clear that OSAI represents a strategic advantage for boosting margins and scaling innovation.

two-thirds of organizations say Open Source AI is cheaper to deploy

Innovation and entrepreneurship are other major benefits of open source. In research with Nataliya Langburd Wright and Shane Greenstein, we found that when open source contributions increase at the country level, so do new startups; at the company level, there is a positive relationship between contributing to open source and startup growth. Open source encourages collaboration, inviting contributions from a global pool of developers and researchers. This external input helps accelerate the development of high-quality models. As Daniel Yue and I found when Meta donated the machine learning library PyTorch to the Linux Foundation, there was a notable increase in corporate contributions, especially from chip manufacturers.

Open Source AI encourages collaboration and accelerates the development of high-quality models

AI’s cost-cutting capabilities are not only linked to the increased productivity that comes from freed-up resources, but also from a re-orienting of the way people work—similar to how the full impact of the steam engine led to the industrial revolution, but only after factories re-oriented their entire work flow around it. Manuel Hoffmann, Sam Boysel, Kevin Xu, Sida Peng, and I found this to be the case with software developers. When GitHub rolled out their GenAI coding tool Copilot, developers changed the way that they worked by spending more time writing code and substantially less time doing project management. However, according to existing research identified in the LF study, this has not translated to substantial layoffs: 95% of surveyed hiring managers over the past two years said they do not plan to reduce headcount due to AI. What’s more, being able to use AI tools effectively may actually increase wages by over 20%.

Looking ahead, open source AI is likely to become foundational in areas like edge computing, where smaller, privacy-preserving models need to run efficiently on local devices. OSAI is also making big inroads in industry-specific applications. In manufacturing, for instance, open models offer the flexibility required to integrate AI into complex operational workflows. And in healthcare—a traditionally conservative and risk-averse field—open models are already matching proprietary ones in performance, giving institutions confidence to adopt without compromising on quality. OSAI is an important avenue to level the playing field, no matter your organization’s size or financial resources—as the report found, small businesses are adopting OSAI at higher rates than their larger counterparts.

small businesses are adopting open source AI at higher rates than their larger counterparts

OSAI is an economic force. It’s reducing costs, accelerating innovation, and empowering a wider range of players to shape the future of technology.

Read the Report

What’s Next for OSAI? Five Areas Ripe for Research

While the impact of OSAI is starting to take shape, the full scope of its influence is just beginning to unfold. To better understand and harness the potential of OSAI, the report outlines five key areas for future research, each crucial to shaping smart policy, business strategy, and innovation ecosystems.

Tracking the Bigger Picture: OSAI’s Role in Market Growth
One pressing question is how open models are influencing the overall AI market. Beyond the tools themselves, OSAI may be driving complementary innovation, spurring growth in services, applications, and platforms built on top of open infrastructure. Understanding this broader ripple effect is essential for grasping the true economic footprint of open AI.
Making the Case for Investment
To help make informed decisions, researchers are encouraged to analyze the return on investment in OSAI infrastructure at both country and company levels. Quantifying the long-term value of these open components, from datasets and compute to developer tooling, can guide resource allocation and policy decisions in a fast-moving field.
Connecting Openness to Innovation
Does OSAI directly foster more startups, patents, or efficient R&D? Future studies should explore how open access to models and tools correlates with concrete innovation metrics. This could provide evidence for how openness accelerates not just adoption, but invention.
Crunching the Cost Numbers
A detailed comparison of costs between open and proprietary AI solutions across sectors, company sizes, and global regions would shed light on who benefits most from going open. These insights would be invaluable for organizations navigating tight budgets and evaluating technology strategies.
Understanding Workforce Impacts
Finally, the human side matters. As AI tools reshape work, it’s vital to measure how open models affect worker productivity, satisfaction, and work patterns. Do open tools empower workers in certain tasks or industries more than others? Do they lead to more flexible, fulfilling roles? Answers to these questions will help ensure that AI benefits not just business, but people.

By exploring these future research areas, we can unlock a deeper understanding of how open source AI is transforming the global economy and workforce. The era of open source AI is here—and it’s time to study its impact with depth and rigor.

PyTorch → vLLM Integrations

What’s next..

Key Features

Architecture

Getting Started in 3 Steps

Automatic Codegen for Pointwise Ops

Performance Validation

Multi-Backend Support

Summary

Overview of the optimizations

Note on CUDA syncs

Conclusion

Introduction

Cluster Setup

Recovering with No Checkpoints

Training Loop Integration

Fault Tolerant Scheduling

Metrics and Logs

Runs

Run 1: Injected Failure Every 60s for 1100 Failures

Run 2: Injected Failure Every 15s for 1015 Failures

Run 3: Semi-synchronous Training

Next Steps

Introduction

Evaluation environments

Addressing I/O Bottlenecks of Deep Learning

FastPersist: Faster Model Checkpoint Creation

Faster Saving of PyTorch Models to local NVMe Storage

ZeRO-Inference: Democratizing Generative AI

Democratizing SGLang through ZeRO-Inference integration

Scaling HF Transformer Generation with Faster NVMe SSDs

I/O performance scaling

Broadening usability

CPU-Only environments

Offset-based I/O operations

Tensor data type casting

Summary

Acknowledgements

1 A Better QAT Scheduling Strategy for Extreme Low-Bit LLMs

1.1 Training Budget Allocation

2 A Hitchhiker’s Guide to Quantization Method Choices

2.1.2 Quantization grids

3 Comparison with SoTA

3.1 Comparisons on 1.58-bit quantization

3.2 comparisons 2-bit / 3-bit / 4-bit quantization

4 Pareto Curve

5 GPU Latency

Conclusion

References

Summary

Problem

How to Use

torchtune

Future Work

Ecosystem Project Spotlights

SGLang

docTR

Up and Coming Project Spotlights

EIR

torchcvnn

How to Join the PyTorch Ecosystem

What’s Next for OSAI? Five Areas Ripe for Research

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.