Startup Showcase Returns to the PyTorch Conference October 21 in San Francisco

Startup Showcase Returns to the PyTorch Conference October 21 in San Francisco

The Startup Showcase returns to the PyTorch Conference on Tuesday, October 21, 2025, spotlighting the most promising early-stage teams building real-world AI applications. The program gives founders a high-visibility platform to connect with investors, potential customers, and engineering talent.

Why attend

  • See what’s next: Live, on-stage pitches from AI startups pushing the boundaries of AI.
  • Meet the builders: Direct access to technical teams and founders.
  • Expand your network: Engage with leading VCs, industry partners, and recruiters.
  • Get inspired: Discover breakthrough ideas at their earliest and most exciting stage.

Join us

PyTorch Conference Startup Showcase

Be there as innovative AI startups pitch live to a panel of top VCs. Whether you’re an engineer, an investor, or simply passionate about cutting-edge AI, this is a front-row seat to the future.

Save the date: Tuesday, October 21, 2025 – San Francisco

Next steps: Register for the PyTorch Conference and learn more about the Showcase.

Founders building the next game-changing AI tool or platform? Apply to pitch.

Calling All Startups

PyTorch startup showcase

Pitch live to leading investors, connect with PyTorch engineers, and raise your visibility across the global AI community.

Selected startups receive:

  • A live 5-minute pitch slot
  • 2 PyTorch Conference Passes
  • Promotion through PyTorch marketing channels
  • Opportunity to network during the Startup Showcase Reception
  • Ability to sponsor a booth in the Startup zone of the conference Expo

Learn more about the Startup Showcase and apply to pitch by September 14, 2025.

Startup Evaluation Criteria

  1. Mission Alignment – Evaluates the extent to which the startup’s vision and focus resonate with the foundational values of innovation and community and evolving priorities of the PyTorch ecosystem.
  2. Novelty and Differentiation – Considers the distinctiveness of the startup’s concept or technology, emphasizing original thought and the ability to challenge conventional approaches.
  3. Technical Depth and Ecosystem Integration – Assesses the level of technical rigor and how deeply the startup’s solution integrates with AI and whether it leverages projects from the PyTorch ecosystem.
  4. Strategic Viability and Growth Trajectory – Reviews the soundness of the startup’s business logic, market relevance, and potential to scale effectively.
  5. Ecosystem Enrichment – Looks at the startup’s potential to positively influence the broader PyTorch and open-source communities – through contribution, accessibility, or capability expansion.

Calling all VCs

The PyTorch Startup Showcase offers a first look at high-potential startups and industry leading talent building the next wave of AI and ML innovations. As a sponsor, you’ll play an active role in spotlighting breakthrough technologies and connecting with founders before they scale.

Sponsor perks include:

  • A seat on the judging panel
  • Branding exposure in Startup Showcase marketing and signage
  • Engage directly with startups during the reception
  • Startup Showcase Application contact list

Last Year’s Startup Showcase

Finalists for the 2024 PyTorch Conference Startup Showcase represented some of the most innovative AI/ML startups in the industry including Remix Inc., Cartesia, OpenBabylon, Remyx AI, A2 Labs, Inc., QuicSnap, Iso AI, CTGT, and Creao.ai. The winner, CTGT empowers companies to create customized models using 500x less compute and went on to raise $7M to help enterprises break through the limits of AI compute.

PyTorch Startup Showcase 2024 Winner

Last year, the Showcase was moderated by Chappy Asel of The AI Collective and judges included investors and VCs from Felicis, GitHub, Vertex Ventures, Mayfield, Gradient Ventures, and Andreessen Horowitz.

More information about sponsoring the Startup Showcase is available on the PyTorch Conference website.

We’re looking forward to seeing you at the 2025 Startup Showcase!

Read More

A Primer on LLM Post-Training

A Primer on LLM Post-Training

Large Language Models (LLMs) have revolutionized how we write and consume documents. In the past year or so, we have started to see them a lot more than just rephrasing docs: LLMs can now think before they act, they can plan, they can call tools like a browser, they can write code and check that it works, and a lot more – indeed, the list is growing quickly!

What do all these skills have in common? The answer is that they are all developed in what we call the post-training phase of LLM training. Despite post-training unlocking capabilities that would have looked magical to us a few years ago, it surprisingly gets little coverage compared to the basics of Transformer architectures and pre-training. 

This tutorial was originally written for the Meta infrastructure team with the target audience of an infra engineer without expertise in LLM modeling who wanted to learn more about post-training to be able to contribute. I believe that this encompasses a large group of engineers: with Reinforcement Learning becoming mainstream, we need new infrastructure to be able to be productive, so bridging this gap is critical! I now share this broadly with the hope that many more folks across PyTorch Foundation will share a similar background and interest, and that they will also find this helpful, like our team did.

Primer on post-training

Post-training (sometimes referred to as “alignment”) is a key component of modern LLMs, and the way to “teach” models how to answer in a way that humans like, and how to reason.

Why is post-training different from pre-training, you ask? Post-training primes the model to have a conversation with a user, which follows a set of basic rules such as:

  1. In a conversation, there’s more than one speaker, and they all take turns talking
  2. You should listen before you talk to say something relevant

We find these obvious, but pre-training is only doing next-word prediction to teach the model about the world, so your data there is completely unstructured, so the model never learned these basic rules. Indeed, a model coming out of pre-training is often bad at understanding that it should stop talking after a while and will blabber on forever, kind of like a Google autocomplete box.

Furthermore, it’s also useful to impose some ground rules to the model that take absolute precedence over everything else. This is done in post-training through a system-prompt (and/or through Supervised Fine Tuning (SFT)/reward shaping, see later).

Post-training data format

Chatting with these models is possible via some plumbing that happens behind the scenes. Every time you talk in a chat window to a service like ChatGPT, you’ll see a UI like this:

What actually happens is that the post-training structure is plumbed for you, and the model will see something like this (using the data format for Llama 3):

<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
… <|eot_id|>

<|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
The capital of France is Paris

<|start_header_id|>user<|end_header_id|>
How many people live there? Tell me just the number<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
START FILLING FROM HERE

Note that the basic interface of the LLM is unchanged: you provide some text, and it will continue it to infinity and beyond. 

What this clever plumbing does is to make sure that the model receives all the metadata to know that the previous speakers have spoken (and it should not impersonate them!!), and to stop once assistant is done talking. Again, the model will happily continue filling, but after we see a <eot_id> token (“end of turn”), we stop the model from continuing and we send it back to the user.

Note that the model won’t do any of this: for all the cleverness we perceive, they do remain just text fillers that require this hand-holding.

This format is a bit hard to read, but you can essentially conceptualize this as something like:

<system> You are a helpful assistant bla bla </system>
<user> What's the weather in Paris? </user>
<assistant> ANSWER HERE

Fun fact: the model will happily play either part! You can totally play the assistant and let it play the user by feeding it the right text structure – the model simply takes over and completes text no matter what you provide. Try it out using a local model or an API (products like ChatGPT will do this plumbing for you and you can’t override them). Obviously, a model is very sensitive to its own format so make sure you use the correct one.

Post-training techniques

Post-training is a rapidly-changing field where different teams will use different techniques.

Let’s look at this pipeline as described by the OLMo 2 paper:

The following section goes through each box one by one.

SFT: Supervised Fine Tuning

The focus in SFT is imitation. It’s simple conceptually: you teach the model to forcefully learn an answer step by step

If this was chess, you could SFT by training on Magnus Carlsen’s games, where you would forcefully teach the model that it should follow the moves that Magnus took at each and every step. You can see the limit of SFT: you can reach Magnus’s ability, but he is going to be your ceiling as you have no way of trying to uplevel him (as opposed to Reinforcement Learning (RL) where you can keep trying until you get really good, see later).

In the case of LLMs, you learn the ideal answer word by word, so your loss function is simply cross-entropy against the output layer with the ideal “class” being the id of the “correct” word). A common question we get is: isn’t this the same as pre-training, then? It is indeed very similar, with only one crucial difference: you only condition on the prompt; you don’t learn the prompt. Why? Because what you want is to learn how to answer that question, not learn the question itself

Remember: unlike pretraining, in post-training we have this structure with system prompt, and user prompt (see Post-Training data format right above). These are the parts of the input that we want to condition on, but not learn from. We do it by feeding the whole sequence through (including system prompt and user prompts as well as any special characters) without any masking, but when we compute the loss, we mask the loss of every token that is not the actual response. We don’t mask the prompt in input as we do want its contribution into conditioning the input. We mask it in the backward step to prevent it from contributing to the loss).

They are so similar that in practice, SFT can piggyback on all the infra that was already built for pretraining: indeed, training platforms like Megatron use the same dataloader and trainer class as the pretraining step, and simply set an argument to mask the loss over the prompt.

That said, the scale for these is nowhere near pre-training – you should expect to do SFT within a few million samples at the maximum, so only a few B tokens, whereas pre-training will crunch trillions of tokens.

Who writes the response?

SFT learns word by word, which limits it in two important ways:

  1. Your ceiling is represented by whoever wrote your answer (see previous paragraph)
  2. Critically, you are strongly dependent on data quality. In practice, your ceiling ends up being determined by your worst answers rather than your best. When you source answers from people, you cannot expect all of them to be of the same, amazing quality. Some of them will not be great, and they have a ton of impact on the quality of your model.

So, what do we do? We can’t fight human nature, so the idea is to instead have the LLM generate the responses it will train itself on. This seemsunintuitive at first, but the idea is called Rejection Sampling (see the Llama 2 paper for more info). It works because we don’t just generate one answer (then yeah, it would not improve itself) but we generate many (usually 10), from multiple different checkpoints, random seeds, system prompts, etc, to elicit diversity. Then, we keep the best answer (as ranked by a pipeline, often including other models such as the human preference reward model) and we add that to the bank. If you have a background in Machine Learning (ML), you can think of this approach as a way to kinda, sorta, distill from a kinda, sorta, ensemble into a single model (very handwavy, I know!).

You can ride this loop through many iterations. If done right, you can climb the hill and get better and better.

A primer on Reinforcement Learning

RL is a wide spectrum that includes the famous Reinforcement Learning from Human Feedback (RLHF), but it’s not limited to it.

In general, the whole idea behind RL is that you are now an agent that can take actions against an environment and get to observe what happens and obtain rewards from it. You can think of these rewards as generating a “label” that you can then train on – though backprop is gonna look different (more on this later).

How do you train?

Backpropagation does happen in RL, but with key differences from the neat forward-backward loops we enjoy in supervised learning.

The key difference is that in RL we unfortunately do not have a differentiable cost function, like Cross Entropy or Mean Square Error (MSE). Rewards (and tools like browsers) are not differentiable, so you can’t just backprop through them – which is very unfortunate.

So what we do instead is a very crude approximation: we simply operate on the logprobs in output of the model and make them bigger if the action was good, and make them smaller if it wasn’t – then, we backprop into the model to tweak all previous layers to make this happen. Note that this is far less efficient than optimizing a supervised cost function like MSE and Cross Entropy: supervised cost functions return a dense vector of gradients, whereas here we only get a single scalar back from a whole episode, which is far less efficient as we get less “learning juice” per interaction.

There are more bells and whistles around the RL learning process, with different algorithms making different choices on how to answer the major problems (vanishing/exploding gradients, sample efficiency, infra optimization and so on,) but this is the general gist. The Appendix gives you a detailed step-by-step derivation of Proximal Policy Optimization (PPO).

Let’s look at this from an infra perspective: compared to “standard” supervised learning, you need to run a ton of inference on the model, which is more expensive (autoregressive, token-by-token vs feeding a whole existing sequence through in a single forward pass), requires more infra (KV cache etc), is harder to batch, and so on.

Despite the training objective being so crude, it actually works very well in the presence of sparse or long-term rewards, while SFT requires rewards to be dense (you are told what each token should be).

Following the chess analogy:

  • SFT teaches the model to copy move by move
  • In RL, the model is rewarded when it wins the match (which can be after 20 moves). The training algorithm will favor model configurations that lead to more reward, more often (given some exploration/exploitation trade-off).

While you start off much weaker, eventually, by playing enough games, you can reach Magnus’s level, and far surpass it – in other words, the ceiling you can reach is much, much higher.

In a way, you can view RL as a magical machine that closes the gap between judging and doing: that is a very big gap! I can recognize when a F1 driver messes up, but I can’t do what even the worst F1 driver can do. RL can transform any armchair driver into an actual driver 🏎 

If you want a one-liner, “if you can judge it, you can learn it!”.

Reward hacking

Note one thing: your ceiling is still going to be represented by your ability to judge (blue curve), which is determined by how good your rewards are: for something like games, rewards are super clean (you know when you win with 100% precision and recall), so the sky is the limit. You can apply RL to anything though, but if you can’t judge very well, your agent is going to learn noise.

A corollary to RL being limited to your judging ability is that it may learn behaviors that you didn’t intend: RL will do anything in its power to maximize the reward you give it, so the model will do exactly what you asked, not what you wanted. We call this reward hacking, even though the model isn’t responsible for this, we are when designing the incentives! 

One example to drive this home: RL found out that Super Mario 1 has always had a bug for over 30 years where after you jump, if you turn around you are invulnerable for a single frame. Given that its reward is to maximize the score and that if you clear a level faster, your score will go up, it exploits the hell out of this bug to get a higher score (and thus a higher reward).

What developers wanted:

  • Maximize score
  • Still play like a human
  • Don’t exploit bugs

What developers asked:

  • Maximize score

Note: this happens with humans too! We just call these Perverse Incentives, but they are literally the same thing. The British government, concerned about the number of venomous cobras in Delhi, offered a bounty for every dead cobra. Initially, this was a successful strategy; large numbers of snakes were killed for the reward. Eventually, however, people began to breed cobras for income.

Applications to LLMs

Similar to the chess example, RL can have a higher ceiling than SFT and thus it’s the method of choice to teach the model how to converse with humans in a way we like (RLHF), how to reason, and so on.

As we have just seen, the ceiling you can reach is set by how good your rewards are: if you use a classifier to provide a reward, your ceiling will be the accuracy of that classifier.

RLHF

One such classifier is human preference. It’s hard to write a rule for how to write with a style that humans like, so what we do instead is train a classifier to score these. We monitor its performance via accuracy and PR AUC (this one is only necessary if you want a continuous score to rank on; otherwise, point estimates like F1 or Accuracy are all you need). Once you have this classifier, all you need to do is run RL against it and optimize against its feedback. 

DPO: Direct Preference Optimization

Now let’s look at the second box in our LLM post-training pipeline. DPO is an algorithm that is used specifically for RLHF of LLMs, it is not a general RL algorithm (unlike PPO and friends, which can be used to train robots and anything else you want).

In fact, technically, DPO is not even a proper RL algorithm; it just pretends to be one! The whole idea of DPO is that if you make some reasonable assumptions, you get to have a differentiable loss function while still training for RLHF. 

To be more accurate, DPO allows us to have a supervised learning solution to a Markov Decision Process under certain assumptions (see later), which is a big deal as normally the only general way to solve MDPs is via Reinforcement Learning (and its inefficient cost function). 

The core idea behind DPO is this: instead of having a separate reward model, you can recycle your LLM to be both your policy model and your reward model. Why? Because your LLM gives you the probability of an answer given a question. So, if you have a preference pair you can simply say that for the same question, you want the probability of the preferred answer to be high and the probability of the rejected answer to be low. In other terms, you want to maximize which leads to a very nice differentiable function!

DPO is dirt cheap to run compared to PPO and other RL algorithms that instead need to sample multiple answers by running a ton of inference. The negative is that DPO doesn’t explore, so there’s also a limit to how good you can be. Here’s a more detailed comparison to a “real” RL algo like PPO (which we are going to see in more detail next).

Feature DPO (Direct Preference Optimization) PPO (Proximal Policy Optimization)
Optimization Supervised Learning Reinforcement Learning (RL)
Data Needed Fixed dataset of (prompt, preferred, rejected) pairs Rollouts + Reward Model
Loss Function Binary classification-like loss Clipped policy gradient loss
Exploration ❌ No (fixed dataset, no exploration – fully Offline) ✅ Yes (policy can explore new responses – Online algo)
On-Policy? ❌ Off-Policy (learns from fixed data) ✅ On-Policy (requires new rollouts)
Compute Cost ✅ Low (single forward pass per pair) ❌ High (rollouts + PPO training)
Training Stability ✅ Stable (like fine-tuning) ❌ Unstable (RL variance)
Convergence Speed ✅ Fast ❌ Slower (needs many rollouts)
Performance limited by ❌Data ✅Compute (better place to be)
Best for Cheap alignment using human preferences More flexible but expensive fine-tuning

Online RL

The third and final box in our sample LLM post-training pipeline is Online RL. The “standard” algorithm is PPO (Proximal Policy Gradients), which was made by OpenAI in 2017. Another algorithm that’s widely adopted is GRPO (Group Relative Policy Optimization, introduced by DeepSeek).

Key concept: On-policy vs Off-policy

A policy is simply the LLM you are training. Being on-policy means that every interaction with the environment is coming directly from the model being trained. This makes the most sense as that is how we learn with a private tutor: we try something, we make some mistakes, we immediately get feedback, and try again having this knowledge. This is much better than learning off-policy where you are shown what someone else did in this situation and you can learn from it – that someone else can (and often is) be a past version of you, so it’s common to keep around a memory of what you did in the past in a replay buffer for later use.

RL algos belong to one of these two families, with PPO belonging to the on-policy side and Q-learning (like DQN) being in the off-policy camp.

Feature On-Policy RL (e.g., PPO) Off-Policy RL (e.g., DQN, DPO)
Definition Learns from data collected by the current policy Learns from previously collected data (even if from old policies)
Exploration ✅ Yes (continuously generates new rollouts, explore-exploit tradeoff baked into the logits of the policy network already – naturally starts by exploring a lot, and gradually moves towards exploiting) ❌ When run offline, you never explore as you just recycle old data you generated before (like DPO).

You can run it online and explore, but the exploration strategy is left to you to define (eg epsilon-greedy)

Infra Efficiency ❌ Low (needs to always generate data online) ✅ Higher (reuses past generations)
Training Stability ❌ Unstable (policy keeps changing) ✅ More stable (fixed dataset or replay buffer)
Compute Cost ❌ High (requires frequent rollouts)

The cost is mostly due to the sync nature of the training loop (collect → train → collect → train)

✅ Low (trains on stored data)
Example Algorithms PPO, A2C, TRPO DQN, DDPG, SAC, DPO
Best for Situations where continuous exploration is needed When you can store and reuse past experiences

From an infra point of view: Online vs Offline RL

On-policy vs Off-policy is looking at things from the perspective of the model training dynamics. If we look at things from an infra perspective, we should think about Offline (using static data that we simply reload while we train) vs Online (we generate data live). These two concepts map well to Off-Policy and On-Policy, so are sometimes used interchangeably, though technically they are still a bit different:

  • If you are learning offline, you can only learn off-policy (as the data was generated by another model and saved). Rejection Sampling and DPO are off-policy, offline algorithms. This is the simplest thing for infra.
  • If you are learning online, being on-policy or off-policy is actually more of a spectrum once you start thinking about using multiple machines and synchronizing. If you want to be strictly on-policy, it means you are training with batch size = 1, then sampling from that model, and introducing barriers throughout so that the model is updated constantly and no new trajectories are sampled before we re-scatter all the weights to all nodes.

In code:

# Idealized PPO training loop

collector = CollectorClass(model)

for i in range(num_collection):
    collector.sync_weights_() # align weights across all workers
    # resume collection and put trainer node on hold <- this is bad!
    data = next(collector) # collect data
    # Put collector nodes on hold <- this is bad!
    for j in range(num_epochs):
        for batch in split_data_randomly(data):
            loss_val = loss_fn(data)
            loss_val.backward()
            optim.step()
            optim.zero_grad()

Therefore, some degree of “off-policiness” is desirable and acceptable. This goes with many questions: to which extent is it the case? How frequently should you update the collection weights? How can you overlap the weight sync, the collection process, and the model training to maximize the throughput?

I don’t want you to leave this section with the idea that off-policy and offline algorithms are necessarily the pits and to be avoided in all cases. In fact, the first two pieces of our pipeline in Llama (SFT and DPO) are essentially a way of solving the alignment Markov Decision Process via a supervised cost function. These are acting as “kind of” offline, off-policy RL:

  1. Our SFT data is coming from Rejection Sampling, meaning that we generate it with the model itself. While we don’t use a proper off-policy RL algo like DQN, rejection sampling done this way is a form of offline policy optimization.
  2. Similarly, we have seen that DPO is also a form of offline policy optimization that also gets away with doing actual RL gradient updates, which are slow and unstable.

Different groups found different recipes in how to leverage all these techniques and how to combine them, but not every group publishes this information.

Beyond RLHF: a general paradigm

There is nothing limiting us to sticking to human feedback – and indeed, we are not. If you want to learn how to code well, you can instead provide a testing harness and give a reward that’s proportional to how many tests you pass. If you want to learn to solve integrals, you can use Wolfram to check if your equation is correct.

In short, you can build reward pipelines by mixing Software 1.0 and Software 2.0. The common patterns are:

1. Reward Models. A classifier that gives a continuous score from 0 to 1 (or sometimes even unbounded). Useful for ranking many answers (just sort on it), especially in domains where you can’t easily express what you want (human preference, writing style, etc).

a. Outcome Reward Models. ORMs provide feedback based on the final result and the final result only of a chain of thought. These are the most common ones – indeed, if someone just says “Reward model”, it’s one of these.

b. Process Reward Models. In the example above (chess), the reward is sparse: it only occurs when the game is won or not. Intuitively, having dense rewards could help the agent to get a more fine-grained knowledge of what behaviour is or is not desirable. This is what PRMs do: they don’t just judge the final output, but the whole sequence of steps. For example, a PRM would judge an entire chain of reasoning thought and make sure that every step is sound. These are not commonly used (at least not yet) as they tend to be very noisy.

2. Rule-based rewards. You can write down a list of rules and put rewards associated with each (or respecting them as a whole).

a. Software pipelines. These run “normal” software 1.0 and give you a reward based on that. For example, passing test cases for coding or passing linting.

b. Judges. When you can’t check if rules were followed via a traditional software pipeline, you can instead use an LLM to verify if rules were respected for you. For example, you can write a set of safety rules that shall not be broken, eg: “No mentions of sexual stuff”, and a judge can ask the question “Were all the rules followed?”. NOTE: this is different from a continuous-score reward model, because you only get a binary answer (rules followed or not). You can’t rank many answers on this, so this is instead used to provide a negative reward if some rules were broken. In practice, you often don’t even train these and instead simply prompt a foundation model to judge for you.

These are components of what can become very sophisticated reward shaping pipelines. For example, you can imagine having a complex Directed Acyclic Graph (DAG) of these to give very granular rewards. 

Simple example for coding:

This has infra implications:

  • Sandboxing for the test harness and any other binaries you need to run
  • Where do we host all these reward models? You can have many, and they can be big (don’t think of these as small dedicated classifiers! They are often just as big as the model you are training!).
  • We should expect more and more engineers to develop better and better reward pipelines (more granular, to shape the model’s behavior). This can become the place where engineers push their contributions. These pipelines themselves can be a useful thing to have a Hub around and to crowdsource development.

Test-time compute and reasoning

Test-time reasoning is a major trend that came out in the last year from OpenAI, famously replicated by DeepSeek in their DeepSeek R1 paper

Let’s dive deeper into what this is.

In short, this builds upon previous work (such as Chain of Thought and ReAct loops) that figured out that letting the model “talk aloud” with itself before committing to an answer can greatly improve the quality of its answers, particularly on certain domains like math. This ability came spontaneously, without ever training the LLM for it. So naturally, the next step was to figure out a way to train the LLM to get better at this “thinking step”. 

We don’t know what technique OpenAI employed, but the most surprising finding of the DeepSeek R1 paper was that you don’t need a super clever setup to induce this learning. In fact, they show that simply providing the model the space to think (by simply instructing it to fill text between a <think>token and a </think> token, and for that text not to be empty.

This is the system prompt they used:

Once this was in place, they went into reward modeling.

Annotated excerpts from the paper

I strongly recommend reading the whole DeepSeek R1 paper, as it is very well written. Here, we quote a few sections of the paper with my own comments to provide context for a reader that’s not a specialist in this field.

Section 2.2.2: Reward modeling

The reward is the source of the training signal, which decides the optimization direction of RL. To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards:

  • Accuracy rewards: The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases.

Comment: This is all pretty standard stuff as we have seen. Conceptually, this is simple. Doing it in practice requires craft from the ML engineers to craft good rewards that aren’t noisy and that nudge the RL process in the direction you want.

  • Format rewards: In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’ tags.

Comment: Smart!!! A simple way of nudging the model to start leveraging the thinking process. Otherwise, it may not consistently explore adding thinking between these <think> and </think> tags. Adding a strong negative reward when it doesn’t do it straightens it out and constrains the exploration to using them. They stop here for R1-Zero, but you can actually keep going to more intricate reward modeling. For example, R1-Zero sometimes thinks in a mix of English and Chinese, so you could fix this by adding a prompt (and then a reward) for thinking in English. You can also add intermediate negative rewards if the thinking process is judged not “good” in some way (inconsistent, etc) to further nudge the model. You can see that this is a general paradigm…

  • We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

Comment: Maybe someone in the community will eventually figure out how to make process RMs work…

Other observations

  • The “Aha” moment:

This one is very interesting to me as RL essentially learns to backtrack on its own. That’s pretty cool, and I honestly didn’t think that this would happen. I imagine that you can improve on this via search: beam search is the simplest, and you can go into tree search algos from there like MCTS. Now that we have a baseline that works, I think that we are going to see faster progress on all these more complex methods. This IMHO a strongly misunderstood part of this work: DeepSeek hasn’t shown that you don’t need all this compute in AI, quite the opposite! They have just shown that you don’t need complex methods to get started, and that simple methods with scale are sufficient – a lesson we keep learning in AI.

Increasing thinking time (and test-time compute) on its own:

Since they never gave the model any penalty for thinking too long, RL figures out that there is simply no downside to thinking for longer and with each passing iteration, it just keeps going in the direction of longer thinking traces. This is expected. I would imagine that if they kept going, the model would also automatically learn to stop at its max sequence length, as then it wouldn’t be able to remember its whole reasoning trace, so going further wouldn’t help (and it may harm).

Once again, the model learns that test-time compute is good, and thus we should expect compute demands to go up.

Appendix A: Diving deeper into PPO

Let’s go deeper into backprop in RL.

Let’s start from the “basic training loop” in supervised ML (no RL):

loss_fn = nn.CrossEntropyLoss()
for batch in dataset:
        x, y = batch
        y_hat = model(x)  # One forward pass only
        loss = loss_fn(y, y_hat)
        loss.backward()  # One backward pass

Recall that in RL we do not have a nicely defined cost function, so instead we are just making an action more likely if it was good, and less likely if it was bad. How do we do that?

Our model already outputs a probability distribution over actions, so the final layer will have its Softmax over all the actions. What we want to do is to give a positive gradient to the good actions and a negative gradient to the bad actions.

So, basically, we wanna do something like this:

model.weights.grad[good_actions] += delta
model.weights.grad[bad_actions] -= delta

Autograd can do that for us: to get a constant delta added to the gradient, we need a function that, when differentiated, gives us this delta. The answer is multiplication. So, our “cost function” is simply log_probs * per_token_reward.

Let’s stay high level and develop this gradually (the PPO loss looks quite scary otherwise):

for batch in dataloader:  # Iterate over dataset (prompts)
    prompts = batch        # Get input prompts: (bsz, prompt_lens). Can be ragged, or packed.
    responses, log_probs = model.generate(prompts)  # Autoregressive 
generation! MANY forward calls, need KV cache etc. If this is not clear to
you, read this appendix.

# Also note: we NEED to return log_probs for all the intermediate
generations and note that we are not detaching them from the graph as we
are gonna need them later.

   # responses and log_probs are of size (bsz, response_lens). 
Also ragged/packed. 

    sequence_rewards = get_feedback(responses)  # Get rewards (e.g., human
preference or heuristic).

    # Note that rewards CAN be negative, in that case this sign will be
negative.

    # This is a tensor of size (bsz,). 

    per_token_reward = discount(sequence_rewards)  # This one is (bsz,
response_lens). A simple way to discount is to multiply tokens by a factor
gamma (eg 0.99).  So the last token gets reward of 1, the second-to-last
gets 1*gamma, then the previous gets 1*gamma*gamma and so on. You don't
want to maximize only your reward at time t but the sum of rewards till
the end of the episode 

    optimizer.zero_grad()  # Reset gradients

    # Manually nudge log-probs based on reward signal

    adjusted_log_probs = log_probs * per_token_reward # this is a
stochastic estimator for the gradient of the reward expectation given your
stochastic policy - in other words: on average, the gradient of that thing
points to where the policy is doing good!

    loss = -adjusted_log_probs.sum()  # Equivalent to maximizing
probability of good actions

    loss.backward()  # Still ONE backward call! PyTorch knows what to do.

    optimizer.step()  # Update model parameters

Note that while we do multiple forwards (due to autoregressive generation), we only need one backward (well, assuming we keep the graph around. If you have these generations done by VLLM or something else, then we are gonna need one more forward pass here to materialize the graph on this side which shouldn’t be too bad…).

Now let’s make this more realistic

The above is all we are doing conceptually. Except that when you go and try it, everything will diverge 😀 

Let’s make these changes:

  1. Adaptive delta. Using a static delta is suboptimal since the magnitude of the update should be proportional to how good a choice is. Instead of manually nudging logprobs, we can let backprop do the work for us. If you want to increase probabilities, you can simply maximize log-prob. To do it in gradient descent, we minimize the negative logprob. This is the Policy Gradient Loss function:
policy_gradient_loss = - (rewards * log_probs).sum(dim=-1).mean()

policy_gradient_loss.backward()
  1. Reducing variance. If we were to train with just the above, some responses would get huge rewards and others would get zero, leading to unstable training. A way to mitigate this is to introduce a baseline, which is the expected cumulative reward for an average move: not terrible, but not great either. The intuition behind this is that the goodness of a move always depends on the available alternatives: for example, getting 1M dollars seems great until you realize that you had the option to get 100M dollars. 

This “goodness” of a move with respect to the baseline is called the advantage, a core concept in RL.

How do we predict the baseline score (also known as the value of a state)? Two ways:

  1. Value Network. Train a model to do that for you: you can train a value network to estimate what this baseline should be.
  2. Monte Carlo. Simply run a bunch of generations (often 4-5 is enough) and the average cumulative reward of all your generations is your estimate of the value.

Now our code will look like this:

for batch in dataloader:
    prompts = batch
    responses, log_probs = model.generate(prompts)

    rewards = discount(get_feedback(responses))  # Already discounted

    for n in range(epochs):
        for _prompts, _log_probs, _rewards in make_minibatches(
            prompts,
            log_probs,
            rewards,
            ):
            values = value_network(_prompts)  # Predict baseline V(s)
            optimizer.zero_grad()
            # REINFORCE with baseline
            advantages = _rewards - values  # Compute advantage estimate
            loss = - (advantages * _log_probs).sum(dim=-1).mean()  
            loss += advantages.pow(2).mean()
            loss.backward()
            optimizer.step()

Believe it or not, that is still unstable! RL is just massively unstable and finicky (though it is surprisingly well-behaved for LLMs, compared to other fields). One reason is that the distribution of rewards can have a very long tail, i.e., your gradients do not behave very nicely. So, in practice we really really gotta make sure that updates are well-constrained so that training behaves nicely.

What PPO does is basically trying to enforce stability by cramming four different constraints:

1. Standardize advantages

The code above multiplies by the raw advantage which, despite our best efforts at baselining, can still lead to huge updates that will destabilize training. One way to make things well-behaved is to simply shift them to zero mean and scale them to unit variance.

advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

2. Importance Sampling

That is still not enough. To improve further, we are gonna enforce that updates stay constrained within a trust region. PPO does this by preventing updates that are too large relative to the old policy (so yes, it needs to keep around the model at time t-1):

So now we do this instead:

old_log_probs = get_old_log_probs(prompts, responses).detach()  # Log
probs from previous policy

importance_sampling_ratio = torch.exp(log_probs - old_log_probs)

safe_advantages = importance_sampling_ratio * advantages

3. Clipping large updates

Oldest trick in the book. If you risk having too big a gradient, simply clip it.

old_log_probs = get_old_log_probs(prompts, responses).detach()  # Log
probs from previous policy

importance_sampling_ratio = torch.exp(log_probs - old_log_probs)

clipped_sampling_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)

even_safer_advantages = clipped_sampling_ratio * advantages

4. Taking a min()

This one was the most surprising to me. Instead of just using the clipped advantages, you actually want to run a minbetween the clipped and unclipped version. The reason is subtle: RL will try to maximize its rewards, and if you only provide the clipped reward, it will push up rewards to be as close to the clipping threshold as possible, which still destabilizes the whole process (more details here).

old_log_probs = get_old_log_probs(prompts, responses).detach()  # Log
probs from previous policy
importance_sampling_ratio = torch.exp(log_probs - old_log_probs)

clipped_sampling_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)

safe_advantages_final_final = torch.min(ratio * advantages,
clipped_sampling_ratio * advantages)

Putting it all together:

epsilon = 0.2  # Clipping threshold

for batch in dataloader:
    prompts = batch
    # prev_log_probs are part of the loss - non-differentiable
    with torch.no_grad():
        responses, prev_log_probs = model.generate(prompts)

    rewards = discount(get_feedback(responses))  
    values = value_network(prompts)  

    advantages = rewards - values
    advantages = (advantages - advantages.mean()) / (advantages.std() + 
1e-8)  

    # ref_log_probs are used for regularization
    with torch.no_grad():
        ref_log_probs = get_old_log_probs(prompts, responses).detach()  

    for n in range(epochs):
        for batch in make_minibatches(...):
            log_probs = model(batch.prompts)[1]
            ratio = torch.exp(log_probs - batch.prev_log_probs) # 
Importance ratio
            clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)

            # The min() function prevents the model from "gaming" the 
clipped update
            loss = -torch.min(ratio * batch.advantages, clipped_ratio * 
batch.advantages).mean()
            # + add value_network loss, ref model regularization 
and entropy boost...

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

If you find it written as an equation, hopefully it won’t look as scary now!

In PPO, there are two more losses. 

The value loss is used to co-train the value network as you train the policy network. Simply put, you can use the actual gains you got in your various generations to keep training the value network, so that’s simply a MSE loss between them:

And finally, as yet another guardrail, we are also going to prevent RL from changing the weights of the model too much: after all, it costs millions of dollars of compute to teach the model about the world in pre-training and we don’t want RL to diverge too much from those. 

A simple way to do it is to simply add a KL divergence loss term.

The final PPO loss is this one:

Where c1 and c2 are hyperparameters that you set experimentally to balance these terms (normally they are small).

DeepSeek’s GRPO

Now you have all the ingredients to be able to open a paper and read scary formulas. This is the formula for DeepSeek’s GRPO (taken from their paper):

Monte Carlo-based advantage estimation

The “innovation” of GRPO (taking a monte-carlo sample of rewards) is actually an old trick – people did that before they had value networks! Value networks are supposed to be more stable but, you can imagine, way more expensive. There’s a bit of discussion in the community about what to do about this, it’s not set-in-stone that critic networks won’t rise from their ashes soon!

How expensive is this thing to run?

To sum things up, in the worst case setting, we have to run the following models sequentially (I’m just showing the boiled down network ops):

1. Run inference model, get tokens and log-probsinference

1 forward

2. Run reference model given generated tokens, get log-probsreference

1 forward

3. Run reward model given generated tokens, get reward

1 forward

4. (Run critic to get value -> advantage estimate)

1 forward

5. Run in-graph model copy given a batch of tokens, get log-probstrain

6. Compute
lp0 = f(log-probstrain, log-probsinference, advantage) + c1 L(log-probstrain, log-probsreference)

7. Backward lp0, adam step on model weights

8. Compute
lp1 = c2 L(value, advantage

9. Backward lp1, adam step on critic network weights

Removing the critic (as in GRPO) removes the cost of 4 (1 forward) and 8-9 (1 backward + optim.step()) which spares you a consequent amount of memory and compute.

Appendix B: Why Generation is more expensive than processing a prompt

If you look at prices for LLM cloud providers, you’ll notice that they always charge you a lot more for output tokens than they charge for input tokens, eg: GPT5 costs $1.25 / 1M input tokens but $10.00 / 1M output tokens. Why is that?

The reason is that autoregressive generation is way more expensive than processing text that’s already been written! This is due to how Transformers work. They always need to consume the whole sequence, so to process a block of text you’ll need just a single forward. To generate new text, you need to run a forward for each token you generate. Then, you take the sequence with the new word you just generated, and feed it in again to get the next word, and so on. This is hell of expensive and mitigated by the KV Cache.

Note that this is different from RNNs like LSTMs where if you had a sequence of length L and you wanted to process one more token, the forward would just ingest that single token and reuse the state that it had. Unlike LSTMs, Transformers are stateless so you need to ingest the whole sequence, and do one more forward with L+1 tokens!

Luckily, a lot of computation gets recycled, so the KV cache will greatly mitigate this problem (otherwise, we honestly would not be able to serve these models in production), but it is still an issue.

Let’s make an example.

You have a prompt:

<system>You are a nice LLM, be kind</system>

Then the user writes:

<user>What's the capital of France?</user>

The model will therefore receive this in input to start generating:

"<system>You are a nice LLM, be kind</system><user>What's the capital of France?</user><agent>"

All the above can be processed with a single forward, because all the tokens are all there.

Now, you generate one token:

The

To generate the next token, you need to do another forward with the whole sequence again!!! Now, the model needs to ingest:

"<system>You are a nice LLM, be kind</system><user>What's the capital of France?</user><agent>The"

And it will generate capital (it will actually generate a space, but… you know… let’s speed this up). Now, again:

"<system>You are a nice LLM, be kind</system><user>What's the capital of France?</user><agent>The capital"=

You can see how expensive this is…

Read More

DRAMA Model Inference Efficiency Boosted by 1.7x-2.3x

DRAMA Model Inference Efficiency Boosted by 1.7x-2.3x

TL;DR

NJTs (Nested Jagged Tensors) boost DRAMA model inference efficiency by 1.7x-2.3x, making it more production-ready in the category of LLM-based encoders, especially with variable-length sequences.

Introduction and Context

Recent advancements in Large Language Model (LLM) based encoders have shown promising results, with many models topping the evaluations leaderboard. However, the challenge lies in productionizing these complex models, which often require significant computational resources and infrastructure.

To tackle the challenge of optimizing LLaMA-based encoders, we have chosen to explore DRAMA, a dense retrieval model that leverages a pruned LLaMA backbone. The DRAMA model overall shows good performance across various versions, including base (0.1B), large (0.3B), and 1B. Specifically, DRAMA-base stands out due to its strong performance in both English and multilingual retrieval tasks, despite its compact size of 0.1B non-embedding parameters. Its quality makes it an attractive option for clients. However, the high cost associated with its implementation posed a barrier to widespread adoption. To address this challenge, we explore the use of Nested Tensors to optimize the model further to make it a viable solution for production environments. 

By leveraging Nested tensors, we have observed a substantial improvement in inference efficiency for the DRAMA model, with gains ranging from 1.7 to 2.3 times greater efficiency. This breakthrough has significant implications for the deployment of LLM-based encoders in real-world applications.

What are NJTs

Sample packing in torchtune, Ragged tensors in TensorFlow, Unpadding in ModernBert and Nested Tensors in Pytorch each tackle the challenge of variable-length sequence data, but with differing approaches. While all aim to streamline sequence modeling, their abstractions and performance impact vary by framework and use case.

PyTorch’s Nested tensors are a subclass of Python tensors that offer a unified interface for handling ragged-shaped data through an efficient packed internal representation. 

There are two types of nested tensors in PyTorch, distinguished by their construction layout: `torch.strided` or `torch.jagged`. It is recommended to use the Jagged layout nested tensors (NJTs), and that is what this blog focuses on  as well. It’s worth noting that due to being implemented fully in Python, NJTs have some amount of eager overhead, more visible on smaller input sizes. It is recommended to compile NJTs when possible to eliminate this overhead and also gain performance through operator fusion.

A NJT tensor can be created by passing a list of tensors to `torch.nested.nested_tensor` with the `layout=torch.jagged` argument. This copies inputs into a packed, contiguous memory block. NJTs currently support a single ragged dimension. 

Model deployments benefit from Nested Tensors when they typically perform inference on large batches of sequences with varying lengths. Given such a query pattern, inference with regular tensors requires that all sequences in the batch be padded to the same length, which is particularly wasteful when the batch consists of many short sequences and a single long sequence. In contrast, Nested Tensors avoid wasting compute on these extra pad tokens by natively supporting operations on batches of varying sequence length.

Dense vs Jagged

As anticipated, NJT demonstrated substantially higher throughput on inputs with uneven sequence lengths compared to padded tensors. In the plot below, we evaluated QPS on synthetic data with various sequence length patterns: (1) “dense” batches where every sequence is of length 256, (2) “linear” batches where the sequence lengths in the batch increase linearly from 1 to 256, and (3) “outlier” batches where one sequence is of length 256, and the remaining sequences are of length 1. The inference cost remains constant in all three cases when using padded tensors, whereas the inference cost with NJT decreases as batch sparsity increases. On the “linear” distribution, NJT outperforms padded tensors by approximately 1.85x.

Implementation

Following code modifications needed to apply NJTs for LLaMa model. Mainly in two key components: transform and Attention.

Transform

Convert token ids into jagged token ids and make attention mask = none as mask is not needed as there is no padding.

jagged_input_ids = torch.nested.nested_tensor(
                tokenizer_output.input_ids, layout=torch.jagged
            )
attention_mask = None

LlamaSdpaAttention

  1. Llama 3 introduces Grouped Query Attention (GQA), which is characterized by having more attention heads than key-value heads ( num_attention_heads > num_key_value_heads) To ensure compatibility during the attention process, the repeat_kv function plays a key role—its main job is to efficiently replicate key-value heads across query heads. This operation reshapes tensors from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim).

To better handle jagged and dense tensor formats, the original repeat_kv function has been split into two specialized functions:

        • repeat_dense_kv: Used for dense tensors, this function is the same as  the original repeat_kv.
        • repeat_jagged_kv: Tailored for jagged tensors, which come withragged_idxindices adding complexity. This method utilizes a sequence of transpose and flatten operations. By temporarily altering the dimension order before flattening and then transposing back, it effectively navigates the unique challenges presented by jagged tensors.
 def repeat_jagged_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). 
The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, 
seqlen, head_dim)
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    expand_shape = (batch, num_key_value_heads, -1, n_rep, head_dim)
    if n_rep == 1:
        return hidden_states
    hidden_states = (
        hidden_states.unsqueeze(3)
        .expand(expand_shape)
        .transpose(1, 2)
        .flatten(2, 3)
        .transpose(1, 2)
    )
    return hidden_states
def repeat_dense_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). 
The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, 
seqlen, head_dim)
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    if n_rep == 1:
        return hidden_states
    hidden_states = hidden_states[:, :, None, :, :].expand(
        batch, num_key_value_heads, n_rep, slen, head_dim
    )
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, 
head_dim)

2. When applying Rotary Position Embedding (RoPE) to query and key tensors, we need to handle two different tensor formats: jagged and dense. To accommodate this, we implemented two separate functions, each tailored to the specific tensor type. The main function, apply_rotary_pos_emb(), acts as a router that directs the input to either _jagged_tensor_forwardor
_dense_tensor_forward
based on whether the tensor is nested.

For jagged tensors, the process involves three key steps: first, converting the jagged tensor into a dense tensor using q.to_padded_tensor(0.0); second, applying the rotary position embedding on this dense representation; and finally, converting the dense tensor back into its original jagged format with convert
_dense_to_jagged.

def apply_rotary_pos_emb(
    q: torch.Tensor,
    k: torch.Tensor,
    cos: torch.Tensor,
    sin: torch.Tensor,
    unsqueeze_dim: int = 1,

) -> Tuple[torch.Tensor, torch.Tensor]:
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    if q.is_nested and k.is_nested:
        if q.layout != torch.jagged:
            raise NotImplementedError(f"Unsupported layout: {q.layout}")
        if k.layout != torch.jagged:
            raise NotImplementedError(f"Unsupported layout: {k.layout}")
        return _jagged_tensor_forward(q, k, cos, sin)
    else:
        return _dense_tensor_forward(q, k, cos, sin)
def _jagged_tensor_forward(
    q: torch.Tensor,
    k: torch.Tensor,
    cos: torch.Tensor,
    sin: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
    q_dense = q.to_padded_tensor(0.0) 
    k_dense = k.to_padded_tensor(0.0)
    q_dense_embed = (q_dense * cos) + (rotate_half(q_dense) * sin)
    k_dense_embed = (k_dense * cos) + (rotate_half(k_dense) * sin)
    q_jagged_embed = convert_dense_to_jagged(q, q_dense_embed)
    k_jagged_embed = convert_dense_to_jagged(k, k_dense_embed)
    return q_jagged_embed, k_jagged_embed

def _dense_tensor_forward(
    q: torch.Tensor,
    k: torch.Tensor,
    cos: torch.Tensor,
    sin: torch.Tensor,

) -> Tuple[torch.Tensor, torch.Tensor]:
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

def convert_dense_to_jagged(nested_q: torch.Tensor, q: torch.Tensor) -> torch.Tensor:
    padded_max_S = nested_q._get_max_seqlen()
 total_L = nested_q._values.shape[nested_q._ragged_idx - 1]
    if padded_max_S is None:
        # use upper bound on max seqlen if it's not present
        padded_max_S = total_L

    # convert dense tensor -> jagged
    q = q.expand(
        [
            x if i != nested_q._ragged_idx else padded_max_S
            for i, x in enumerate(q.shape)
        ]
    )
    nested_result = nested_from_padded(
        q,
        offsets=nested_q._offsets,  
        ragged_idx=nested_q._ragged_idx,
        sum_S=total_L,
        min_seqlen=nested_q._get_min_seqlen(),  
        max_seqlen=padded_max_S,
    )
    return nested_result

Added implementation for Drama model with NJTs : modeling_drama_nested.py

Acknowledgement 

We would like to thank Xilun Chen for helpful feedback in code review. And Don Husa, Jeffrey Wan, Joel Schlosser and Fernando Hernandez for helpful feedback on the blog.

Conclusion

This optimization using NJTs significantly enhances the efficiency of DRAMA (LlaMa based encoders), making them more practical for real-world deployment. By reducing computational overhead, particularly for variable-length sequences, this approach paves the way for broader adoption of high-performing LLM-based encoders in production environments. However, NJT is a feature complete in PyTorch and not actively adding new features to it, but does welcome community contributions.

Read More

ZenFlow: Stall-Free Offloading Engine for LLM Training

ZenFlow: Stall-Free Offloading Engine for LLM Training

Introduction

ZenFlow is a new extension to DeepSpeed introduced in summer 2025, designed as a stall-free offloading engine for large language model (LLM) training. Offloading is a widely used technique to mitigate the GPU memory pressure caused by ever-growing LLM sizes. However, as the CPU–GPU performance gap has widened by several orders of magnitude in recent years. Traditional offloading frameworks like DeepSpeed ZeRO-Offload often suffer from severe GPU stalls due to offloading computation on slower CPUs. 

We are excited to release ZenFlow, which decouples GPU and CPU updates with importance-aware pipelining. By fully overlapping CPU work and PCIe transfers with GPU computation we see more than 85% stall reduction, up to 5x speedup! This ensures that we can enjoy the memory benefits of offloading without sacrificing training speed to slower hardware.

Figure 1: ZenFlow is DeepSpeed’s stall-free offloading engine for LLM training. It decouples GPU and CPU updates by prioritizing important gradients for immediate GPU updates and deferring the rest for asynchronous CPU-side accumulation. By fully overlapping CPU work and PCIe transfers with GPU computation, ZenFlow eliminates stalls and achieves high hardware utilization across both single-GPU and multi-GPU settings.

Figure 2: ZeRO-Offload causes repeated GPU stalls due to blocking CPU updates and PCIe transfers, leading to >60% idle time per step when training Llama 2-7B on 4× A100s.

Offloading has become a standard approach to scale fine-tuning of large language models (LLMs) beyond GPU memory limits. Frameworks like ZeRO-Offload reduce GPU memory usage by pushing gradients and optimizer states to the CPU. However, they also create a new bottleneck: expensive GPUs often sit idle, waiting on slow CPU updates and PCIe data transfers. In practice, enabling offloading when training Llama 2-7B on 4× A100 GPUs can inflate each step from 0.5s to over 7s—a 14× slowdown.

Figure 3: In ZeRO-Offload, CPU-side optimizer updates and PCIe transfers dominate iteration time, leaving the GPU idle for over 5 seconds.

ZenFlow addresses this bottleneck with a stall-free training pipeline. It prioritizes high-impact gradients for immediate GPU updates, while offloading the rest to the CPU and applying them asynchronously. These deferred CPU updates are fully overlapped with GPU compute, eliminating stalls and significantly improving throughput. Best of all, ZenFlow maintains the same model accuracy and integrates seamlessly with DeepSpeed.

ZenFlow at a Glance

  • Zero GPU stalls: Top-k important gradients are updated immediately on GPU; low-priority gradients are asynchronously processed on CPU—no GPU wait time.
  • Asynchronous and bounded: ZenFlow decouples CPU and GPU execution with a bounded-staleness strategy that preserves convergence.
  • Auto-tuned: ZenFlow adapts update intervals at runtime based on gradient dynamics—no need to tune manually.

ZenFlow Highlights

ZenFlow is the first offloading framework to offer a bounded-asynchronous update scheme that preserves convergence while delivering up to 5× end-to-end speed-up over ZeRO-Offload.

Performance

Feature Benefit
Up to end-to-end speed-up over ZeRO-Offload and 6.3× over ZeRO-Infinity Faster time-to-convergence
> 85% reduction in GPU stalls on A100 / H100 nodes Keeps GPUs busy, higher utilization
≈ 2× lower PCIe traffic (1.13× model size per step vs. 2× in ZeRO) Less bandwidth pressure on clusters
Maintains or improves accuracy on GLUE (OPT-350M → Llama-13B) No accuracy loss
Lightweight gradient selection (6000× cheaper than full AllGather) Scales to multi-GPU settings without memory footprint spikes
Auto-tuning (Zen-auto) automatically adapts update interval on the fly No manual knob tuning

For more detailed performance results, please refer to our arXiv paper.

Design Motivation

Training large models with offloading can save GPU memory, but often at the cost of performance. In this section, we briefly discuss three topics. First, we explain why coupling CPU-side optimizer updates with GPU compute leads to severe GPU stalls during LLM fine-tuning. Next, we quantify how full-gradient offloading saturates the limited PCIe bandwidth on A100/H100 servers, inflating iteration time. Finally, we reveal the highly skewed importance distribution of gradients, showing that uniformly updating all parameters in GPUs at the same time is wasteful and unnecessary.

Offloading-Induced GPU Stalls

Figure 4:  CPU updates dominate step time, causing >60% GPU idle due to poor overlap with computation.

Synchronous offloading frameworks (e.g., ZeRO-Offload) keep the GPU idle while the CPU performs a full optimizer step and transfers updated parameters back to GPU. For Llama-2-7B with 4× A100, the CPU path can take longer than 4s while the backward pass takes approximately 2s, so over 60% of each iteration is pure GPU wait time. Eliminating this serialization is essential for achieving high GPU utilization.

Bandwidth Bottlenecks

A single training step moves a full copy of the model gradients from GPU to CPU and a full copy of the model parameters back, i.e., 2× model size of PCIe traffic per step. Even on PCIe 4.0 (≈ 32 GB/s), Llama-2-13B pushes ~40 GB per iteration, adding > 1s of transfer latency.

Unequal Gradient Importance

Not all gradients matter equally. Our analysis shows that the top 1% of gradient channels contribute over 90% of the ℓ²-norm energy during fine-tuning. In other words, most updates have little impact on model learning, yet still incur disproportionately high compute and I/O costs in traditional offloading pipelines.

This skew in gradient importance opens the door to a better design: update critical gradients on GPU right away, and defer the rest for asynchronously batched, lower-priority updates on CPU. ZenFlow turns this idea into a principled, efficient training engine.

Figure 5: Top 1% of gradients may contribute over 85% of gradient norms.

ZenFlow Design

ZenFlow is designed around three key ideas that separate critical and non-critical gradient updates while minimizing communication bottlenecks. Here’s how we break the tight coupling between GPU and CPU computation to create a stall-free pipeline.

Idea 1: Importance-Aware Top-k Gradient Update

Not all gradients are equally impactful for training. ZenFlow introduces an importance-aware design that prioritizes updates for the top-k most significant gradients. These gradients are updated directly on the GPU, using its high compute bandwidth. This approach allows us to reduce the size of the per-step gradient update by nearly 50%, cutting down the communication load by around 2×.

For the rest of the gradients, which contribute less to the model’s learning, ZenFlow batches them and performs asynchronous updates on the CPU. These updates are deferred until they are sufficiently accumulated, thereby reducing the impact on training speed.

Idea 2: Bounded-Asynchronous CPU Accumulation

ZenFlow’s asynchronous accumulation allows the CPU to stay busy while the GPU performs other computations. We apply an accumulation window for the non-critical gradients, allowing them to accumulate over several iterations before updating. This gives ZenFlow the ability to process multiple rounds of gradient updates concurrently, eliminating idle time typically spent waiting for the CPU optimizer.

By carefully coordinating CPU updates with GPU execution, ZenFlow fully hides CPU execution behind GPU computation—ensuring that GPUs remain actively utilized, avoiding stalls, and maximizing hardware efficiency.

Idea 3: Lightweight Gradient Selection

A key challenge in distributed training is selecting important gradients without introducing prohibitive communication and GPU memory costs. Traditional systems rely on global synchronization (via AllGather) to gather full gradients, which can become a major bottleneck in multi-GPU settings.

ZenFlow solves this with a lightweight gradient proxy: instead of transferring full gradients, ZenFlow uses a per-column gradient norm to approximate the importance of each gradient. By computing a compact summary of per-column gradients (e.g., squared norms), ZenFlow reduces communication volume by more than 4,000×—with nearly no loss in accuracy.

This approach allows ZenFlow to scale efficiently across GPUs, without high memory or communication overhead, and it supports dynamic gradient selection as the model evolves.

Putting It All Together: ZenFlow’s Zero-Stall Pipeline

Figure 6: ZenFlow’s stall-free pipeline overlaps CPU updates and transfers with multi-step GPU compute.

  1. Forward/Backward Pass on GPU: ZenFlow processes the forward and backward passes on the GPU, immediately updating the top-k gradients on the GPU without waiting for the CPU.
  2. Gradient Transfer to CPU: While the GPU is busy, gradients from the current iteration (or previous ones) are transferred to the CPU over a dedicated PCIe stream. This is done in parallel with GPU computation, without causing any GPU wait time.
  3. CPU Update: Once a batch of non-critical gradients has accumulated, the CPU performs the update asynchronously. This update typically spans multiple GPU iterations, but is hidden behind GPU work, making it virtually invisible to the overall pipeline.
  4. Double Buffering: ZenFlow uses double buffering to manage the newly updated gradients. When the CPU update is complete, the new parameters are transferred back to the GPU. The swap is as fast as a pointer flip—no need to reload the entire model or re-launch the kernel.

By constantly overlapping GPU computation with CPU-side work, ZenFlow transforms the traditional compute → wait → update cycle into a continuous, stall-free pipeline.

Getting Started: Try out DeepSpeed-ZenFlow

To try out DeepSpeed-ZenFlow, please refer to the ZenFlow example in our DeepSpeedExamples repo and the ZenFlow tutorial in DeepSpeed.

Citation

@article{lan2025zenflow,

  title   = {ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates},

  author  = {Tingfeng Lan and Yusen Wu and Bin Ma and Zhaoyuan Su and Rui Yang and Tekin Bicer and Masahiro Tanaka and Olatunji Ruwase and Dong Li  and Yue Cheng},

  journal = {arXiv preprint arXiv:2505.12242},

  year    = {2025}

}

Acknowledgements

This work is the result of a close collaboration between University of Virginia (UVA), University of California, Merced (UC Merced), Argonne National Laboratory (ANL), and DeepSpeed team.

The contributors include Tingfeng Lan, Yusen Wu, Zhaoyuan Su, Rui Yang, and Yue Cheng from UVA; Bin Ma and Dong Li from UC Merced; Tekin Bicer from ANL; Olatunji Ruwase and Masahiro Tanaka from the DeepSpeed team. We especially thank Olatunji Ruwase and Masahiro Tanaka for their early feedback and insightful discussions, and also for open source community support.

Read More

Accelerating MoE’s with a Triton Persistent Cache-Aware Grouped GEMM Kernel

Accelerating MoE’s with a Triton Persistent Cache-Aware Grouped GEMM Kernel

In this post, we present an optimized Triton BF16 Grouped GEMM kernel for running training and inference on Mixture-of-Experts (MoE) models, such as DeepSeekv3.

A Grouped GEMM applies independent GEMMs to several slices (groups) of an input tensor in a single kernel call.  In a baseline Pytorch implementation, these GEMMS would be carried out in a for-loop over the groups, with one kernel launch per iteration. 
Our kernel achieves up to 2.62x speedup over the manual PyTorch loop implementation on NVIDIA H100 GPUs when used in DeepSeekv3 training. We discuss the Triton kernel optimization techniques we leveraged and showcase end-to-end results.

16B DeepSeekv3 TPS throughput on 8x NVIDIA H100 with FSDP2 

Triton Kernel Grouped Gemm vs PyTorch manual looping Group GEMM (1.42x-2.62x Speedup)

Background

GEMM (General Matrix Multiplication) is a fundamental primitive in LLM workloads. When an input activation matrix is multiplied by a weight matrix, a GEMM is being performed. In modern deep learning based architectures, GEMMs dominate FLOP counts, so their efficiency often defines end-to-end model speed. 

In Mixture-of-Expert (MoE) models, tokens are dynamically routed to different experts which results in many independent GEMMs. A Grouped GEMM executes multiple smaller GEMMs together in one kernel launch. Instead of treating each expert or layer as a separate GEMM, we batch them, which reduces launch overhead and improves GPU utilization.

Figure 1. Example GEMM problem with 3 experts

To illustrate this, we can imagine a toy scenario where we have 3 expert weights and a varying number of tokens being routed to each expert, so the activations are of different sizes. We can construct these 3 matrix multiplications of varying sizes into a single Grouped GEMM problem, which allows us to calculate the output matrices C1, C2, and C3 in a single kernel launch.

Optimization 1: Persistent Kernel Design

Nvidia GPUs have streaming multiprocessor units (SMs) that contain specialized hardware units to perform load, store, and compute operations. SM utilization is key to kernel performance. Thus, when implementing parallel algorithms such as Grouped Matrix-Multiplication using the Triton programming language, a key consideration is the work decomposition across SMs.

In a naive work division, a new threadblock (CTA) would be launched for every tile of work. In contrast, persistent kernels keep CTAs “alive” and dynamically feed them new tiles until the entire GEMM is complete. This avoids launch overhead, improves cache reuse, and reduces scheduling imbalance, which can lead to an effect known as wave quantization. Wave quantization is an inefficiency that occurs when the number of output tiles are not evenly divisible by the number of GPU SMs which leads to low utilization. This Colfax post provides a deep dive into the topic.

We build on this idea by applying the persistent kernel strategy in our Group GEMM kernel. In training and prefill workloads for MoE models, the matrix multiplication problem sizes are large. Thus, in naive work decomposition, a large number of threadblocks need to be scheduled to compute the output matrix, which would result in multiple “waves” of work being done. Instead, with our persistent kernel design, we can compute the entire matrix multiplication in a single wave of work by making two key changes in our Triton kernel, as discussed in the code snippets below.

First, we set the kernel grid to be equal to the number of SMs on the H100 GPU, 132.

grid = (NUM_SMS, 1, 1)                             (Host Code)

 

Next, we change the outer for loop structure to:

for tile_id in tl.range(start_pid, num_tiles, NUM_SMS)                 (Device Code)

We launch one Triton program per SM, so all the Triton programs fit in a single wave with none waiting in the queue. Inside the kernel, each program loops over its share of tiles, fetching new work until all tiles are computed. This design keeps Triton programs alive on the SMs, eliminating repeated launches and making the GEMM a single continuous wave of work.

Optimization 2: Grouped Launch Ordering

An important consideration for kernel speed is cache performance. In Triton, the programmer controls the order in which the output tiles are computed, and thus, we can optimize L2 Cache performance at the kernel level. We experimented with both linear tile ordering (row major) and grouped launch ordering schedules. To illustrate the difference between these two approaches, we can examine the following toy matrix multiplication example, where A and B are the input matrices and C is the output matrix. 

Figure 2. Row-Major Schedule

In the row-major traversal across the output C matrix, we move quickly across the columns of the B matrix and C(0,0) -> C(0,1) -> C(0,2) before moving to the next row, C(1,0). This means that B tiles will only be re-visited after cycling through an entire row of C, by which time the data may have been evicted.

Figure 3. Grouped Launch Schedule with Group Size = 2

In the grouped launch schedule we hold a band of rows (=2) in Figure 3, from the A matrix in cache and traverse column-major across the output C matrix computing C(0,0) -> C(1,0) ->…->  C(GROUP_SIZE_M, 0) before moving to the next column and computing C(0,1) -> C(1,1) etc.

The net effect is that the grouped launch schedule increases cache performance for both A and B matrices. Consecutive Triton programs (CTAs) reuse the same B tile in quick succession while keeping a band of A rows in cache.

Figure 4. L2 Cache Gain for Grouped Launch Order vs Linear Launch Order

num_groups, m, k, n = 8, 4096, 2048, 7168

For the problem sizes we tested, the group launch ordering proved more performant in terms of date reuse and latency. From the above figure 4, we note a 1.33x speedup and a +60% in L2 Cache Hit Rate with the optimized schedule.

The main benefit of using the grouped launch schedule in our Group GEMM kernel is that it enforces temporal locality as exemplified in the illustrations above. This is achieved by re-ordering the launch order of programs so that tiles of the GEMM problem are computed in an order that allows for better reuse of the input activation and the expert weights, improving L2 cache hit rates, increasing arithmetic intensity, and thus reducing kernel latency.

Optimization 3: Tensor Memory Accelerator (TMA) utilization for Expert Weights

The TMA unit on NVIDIA Hopper GPUs is a dedicated hardware unit for load/store operations that operate on tensors. The benefit of leveraging the TMA unit in our kernel design is it can free up SM resources such as registers and CUDA cores while data is being moved from global to shared memory. To learn more about TMA usage in Triton, see our previous deep dive on this topic.

However, there is a caveat due to the special use case of this kernel. Typically, a TMA descriptor containing tensor metadata is created on the host and then passed to the kernel. 

For MoE models, a modified approach is needed since the chosen expert is not known ahead of time. Instead, it is determined at runtime, creating a data-dependent access into the expert weight matrix. This type of access is possible in Triton by dynamically creating a local TMA descriptor based on the chosen expert index. We walk through the code below on how to build a TMA 2D descriptor on the device for the chosen expert, and then how to use it to issue TMA loads.

First, we pre-allocate a chunk of GPU memory, workspace, on the host:

workspace = torch.empty(

          NUM_SMS * desc_helper.tma_size,                          (Host Code)
          device=x.device,
          dtype=torch.uint8,
          )

The size of the memory we are reserving is equal to the size in bytes of a single TMA descriptor, desc_helper.tma_size, multiplied by the number of persistent Triton programs we are launching, NUM_SMs. This ensures that each Triton program will have space to write its own TMA descriptor.

expert_desc_ptr_tile = workspace + start_pid * TMA_SIZE 
tl.extra.cuda.experimental_device_tensormap_create2d(
         desc_ptr= expert_desc_ptr_tile,
         global_address=b_ptr + expert_idx*N*K + n_start*K,               (Device Code)                  
         load_size=[BLOCK_SIZE_N, BLOCK_SIZE_K],
         global_size=[NUM_EXPERTS*N, K],
         element_ty=tl.bfloat16) 

tl.extra.cuda.experimental_tensormap_fenceproxy_acquire(expert_desc_ptr_tile)

expert_weight = tl._experimental_descriptor_load(
         expert_desc_ptr_tile,
         [0, k_offset],
         [BLOCK_SIZE_N, BLOCK_SIZE_K],
         tl.bfloat16)

In the Triton code, each triton program first creates a private slot in workspace to place the expert descriptor. Next, we create a 2D tensor map that points to the routed expert tile by passing the experts metadata. Then, we explicitly call a proxy fence, which is required to synchronize memory operation between two different proxies, SM and TMA engine. In our kernel, every time a new expert_idx is selected the SM writes a new TMA descriptor to global memory. The fence guarantees that the new TMA descriptor is globally visible before the TMA engine issues a load instruction. This ensures we are not reading stale/incorrect data.

Now, since the TMA descriptors have been constructed dynamically based on the chosen expert_idx each Triton program in the Grouped GEMM kernel can target its TMA load to the routed expert weight.

Microbenchmarks

We benchmarked our Hopper-optimized kernel against a baseline Triton Group GEMM kernel that does not contain the optimizations we discussed to isolate the gain from these techniques.

Figure 5. Triton Group GEMM Kernel TFLOPs Comparison (Higher is Better)

Figure 6. Kernel Latency Comparison with Speedup over Baseline Triton Kernel

By leveraging a persistent kernel design, grouped launch tile ordering, and the Hopper TMA unit, our kernel achieves up to 1.50x speedup over the baseline Triton kernel. 

End-to-End Benchmarks

We integrated our kernel into torchtitan to create an end-to-end test in which we train a 16B parameter flavor of DeepSeekv3 using FSDP2 across 8xH100’s.  The speedups for various batch sizes are below:

Figure 7. 16B DeepSeekv3 E2E Tokens/s/GPU Throughput Summary

MoE models have a much higher parameter-to-flops ratio than dense models, and this fact makes FSDP2 suboptimal for training due to the cost of communicating large weights. It is instead more beneficial to parallelize by statically placing different experts on different GPUs and communicating activations around. The number of tokens processed by each GPU changes dynamically in such Expert Parallel training, which makes the use of triton kernels challenging, since every new token count may require kernel recompilation, depending on the details of the implementation. We leave support for such dynamic training workloads to future work.

Training (torchtitan)

Figure 8. Tokens/s/GPU for batch-size 4, 16B DeepSeekv3 on 8x NVIDIA H100 with FSDP2

Training (torchtitan)

Figure 9. Loss curve comparison Triton vs for-loop 16B DeepSeekv3 on 8x NVIDIA H100 with FSDP2

Conclusion

For future work, we plan to integrate our kernel into vLLM (in-progress PR here), as well as extend this kernel to support FP8 in the forward and backward. Our kernel can be leveraged from torchtitan here.  Further, we also plan to experiment with even lower precision datatypes such as MXFP4 that are supported by newer generation NVIDIA GPUs such as B200. 

Read More

Open Source AI Week Heads to the San Francisco Bay Area in October 2025

Mark your calendars! The inaugural Open Source AI Week is coming to the San Francisco Bay Area from October 18–26, 2025. This week-long celebration is the premier destination for the global AI community to explore cutting-edge research, groundbreaking tools, and open collaboration in artificial intelligence and machine learning

What is Open Source AI Week?

Open Source AI Week is bringing together the best AI and ML conferences, hackathons, startup showcases, and networking opportunities exploring the intersection of artificial intelligence, machine learning, and open source technology. Taking place between October 18 – 26, 2025 in the San Francisco Bay Area. This week-long celebration is dedicated to fostering innovation, collaboration, and community-driven solutions in the rapidly evolving AI landscape, featuring the PyTorch Conference as the flagship event.

Schedule at a Glance

Below is a current snapshot of the Open Source AI Week lineup. More information about each is available on the Open Source AI Week website.

Tuesday, October 21

  • Measuring Intelligence Day
  • AI Infra Summit
  • PyTorch Conference Startup Showcase
  • AI Infra & Open Source Models Meetup

Wednesday, October 22

  • PyTorch Conference (Day 1)

Thursday, October 23

  • PyTorch Conference (Day 2)
  • PyLadies SF at AutoKitteh

Friday, October 24

  • dAGI Summit

Plus, stay tuned, as we’ll be adding more events to the lineup soon. 

Add Your Event to Open Source AI Week

If you’re organizing an AI + Open Source event, we welcome your submission to be a part of Open Source AI Week. Submit your event to be added to the Open Source AI Week lineup! 

To ensure that all the events are relevant to the Open Source AI Week and foster an open and inclusive exchange, all submissions will be reviewed against the following guidelines:

  • Focus on Open Source AI: Events should center on open technologies within the AI ecosystem, including but not limited to open-source software and hardware for AI development, open standards, open data related to AI, and open benchmarks.
  • Bay Area Location & Timing: Events must take place in the Bay Area between October 18–26, 2025, during Open Source AI Week.
  • Commitment to Inclusion: Events, particularly those featuring speakers, should actively encourage diversity and be open to all attendees, regardless of race, gender, or background.

Open Source AI Week is your opportunity to get inspired, get involved, and help shape the future of AI.

Read More

PyTorch Wheel Variants, the Frontier of Python Packaging

PyTorch Wheel Variants, the Frontier of Python Packaging

charliemarsh’s tweet, creator of uv

PyTorch is the leading machine learning framework for developing and deploying some of the largest AI products from around the world. However, there is one major wart whenever you talk to most PyTorch users: packaging.

Now, this isn’t a new problem. Python packaging is notoriously difficult, and with the advent of compiled / specialized components for packages, the packaging ecosystem has needed an answer for how to make the experience better. (If you’re interested in learning more about these difficulties, I’d highly recommend reading through pypackaging native.)

With that in mind, we’ve launched experimental support within PyTorch 2.8 for wheel variants. To install them, you can use the following commands:

Linux x86 and aarch64, MacOS

curl -LsSf https://astral.sh/uv/install.sh |
INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh sh

uv pip install torch

Windows x86

powershell -ExecutionPolicy Bypass -c 
“$env:INSTALLER_DOWNLOAD_URL=‘https://wheelnext.astral.sh’; irm 
https://astral.sh/uv/install.ps1 | iex”

uv pip install torch

This particular post will focus on the problems that wheel variants are trying to solve and how they could impact the future of PyTorch’s packaging (and the overall Python packaging) ecosystem.

More details on the proposal and install instructions can be found on the following resources:

What are the problems?

Currently, the matrix for installing PyTorch is manifested as a modal on the PyTorch website, which looks like:

This modal has more than 10 buttons dedicated to installing different versions of PyTorch that are compiled for specialized hardware, and most of the pathways lead to install commands that end up looking like:

pip install torch torchvision --
index-url https://download.pytorch.org/whl/cu129

While the command itself doesn’t look terrible, it does take a lot of steps to get to that point, including:

  • Understanding what accelerator you are using
  • Understanding what accelerator version you are using
  • Knowing which URL maps to which accelerator + accelerator version you are using
    • Of which the naming conventions can be non-standard

This has led to frustration for PyTorch users, and even worse, churn for developers using PyTorch within their own projects who need to support multiple accelerators (see example).

The future of PyTorch Packaging

This is the future of PyTorch packaging. Yes, really, that’s it.

More seriously, we have worked alongside engineers from the WheelNext community to deliver experimental binaries that:

  • Automatically identify which accelerator and accelerator version you are using (e.g., CUDA 12.8)
  • Install the best-fitting variant of PyTorch automatically based on these software and hardware parameters

NOTE: This particular feature is experimental and based on the wheel variants proposal. (PEP pending)

The PyTorch team views wheel variants as a promising way for Python packages to ensure that they can mark specific packages for support of specialized hardware and software, and will be supporting its development/proposal as it makes its way through the PEP process.

We’d love your feedback!

As we blaze the frontier of Python packaging, we’d love to hear how you are utilizing PyTorch’s wheel variants and, more importantly, if you have any feedback, we invite you to post them to our issue tracker on pytorch/pytorch.

Read More

PyTorch Day China Recap

PyTorch Day China Recap

On June 7, 2025, PyTorch Day China was held in Beijing, co-hosted by PyTorch Foundation and the Beijing Academy of Artificial Intelligence (BAAI). The one-day conference featured 16 talks and averaged 160 participants per session. Explore the full YouTube playlist to find sessions that interest you.

Matt White, Executive Director of the PyTorch Foundation, delivered key insights into PyTorch Foundation’s commitment to accelerating open source AI. Since its establishment two years ago, the foundation has grown to 30 members and evolved into an umbrella foundation capable of hosting open source projects beyond PyTorch core. vLLM and DeepSpeed became the first projects under the Foundation umbrella, BAAI’s open source project FlagGems also joined the PyTorch Ecosystem. The PyTorch Ambassador Program, launched to support local community development, received over 200 applications within a month. Matt also introduced the new PyTorch website, as well as the schedules for PyTorch Conference and Open Source AI Week. He mentioned the Foundation’s upcoming initiatives, including the Speaker Bureau, university collaborations, and training certifications, thanked the attendees, and expressed anticipation for the day’s speeches.  

2. Running Large Models on Diverse AI Chips: PyTorch + Open Source Stack (FlagOS) for Architecture-Free Deployment

Yonghua Lin, Vice President of the Beijing Academy of Artificial Intelligence, discussed the current status of running large models on diverse AI chips. She explained the rationale behind building a unified open source system software stack: large models face challenges such as high costs, massive resource demands, and expensive training/inference, while the fragmented global AI accelerator ecosystem creates additional issues. She then introduced FlagOS, developed by BAAI in collaboration with multiple partners, including core components and essential tools, supporting various underlying chips and system deployment architectures, as well as multiple large models. It has gained support from various architectures and demonstrated outstanding performance in operator efficiency and compatibility. Finally, she called for more teams to participate in building this open source ecosystem.  

3. Diving in Hugging Face Hub; Share Your Model Weights on the #1 AI Hub, Home of 700k+ PyTorch Models

Tiezhen Wang from HuggingFace introduced the HuggingFace Hub, an open source AI community often referred to as the “GitHub of AI.” It hosts a vast number of open source models and datasets, along with diverse features: spaces for easily testing models, kernels, API provider gateways, social communication functions, and open source-related metrics. Its model library offers convenient filtering by popularity and task, with a trending models page featuring various hot models. Each model has a dedicated page displaying model cards, code, and structured data. For datasets, it supports git repositories, provides visualization and SQL query functions, and offers a powerful programming interface.  

4. verl: An Open Source Large Scale LLM RL Framework for Agentic Tasks

Yuxuan Tong from ByteDance introduced verl, an open source large-scale LLM Reinforcement Learning framework. He first emphasized the importance of large-scale RL, which significantly enhances language model performance and has wide applications in real-world tasks. However, it faces challenges such as complex data flows (involving multiple models, stages, and workloads), distributed workloads, and the need to balance data dependencies and resource constraints. Verl’s strengths lie in balancing flexibility and efficiency: it achieves programming flexibility through a single-controller paradigm, allowing core logic to be described with minimal code and supporting multiple algorithms, and it features a hybrid engine to optimize resource utilization. The framework has an active open source community, with several popular projects built on it. Finally, he shared the community’s future roadmap and welcomed new members.  

5. PyTorch in China: Community Growth, Localization, and Interaction  

Zesheng Zong from Huawei discussed the development of the PyTorch community in China. As a globally popular framework, PyTorch has a large number of contributors from China, ranking among the top globally. To address the lack of localized resources for beginners, they translated PyTorch’s official website, built a community homepage, and translated tutorials from beginner to advanced levels. They also actively engaged with users through chat channels (established late last year), published over 60 technical blogs, and gained 2,500 subscribers. Future plans include further automating translations, providing more high-quality resources and events, and inviting users to participate.

6. The Development of AI Open Source and Its Influence on the AI Ecosystem  

Jianzhong Li, Senior Vice President of CSDN and Boulon technical expert, shared insights into the development of AI open source and its impact on the AI ecosystem. He compared global and Chinese AI technology ecosystems, noting that Chinese AI open source is gaining increasing global importance, and drew parallels between AI development and the evolution of biological intelligence on Earth. He then discussed the development of reasoning models, which enable large models to “think slowly” and reduce reliance on weak reasoning signals in training corpora, with machine-synthesized data in reinforcement learning playing a key role. He analyzed open source’s impact on the ecosystem, including drastically reducing model training and inference costs, and driving the evolution of AI applications toward agents capable of planning, collaboration, and action.  

7. torch.accelerator: A Unified, Device-Agnostic Runtime API for Stream-Based Accelerators  

Guangye Yu from Intel introduced the torch.accelerator APIs launched in PyTorch 2.6, a unified, device-agnostic runtime API for stream-based accelerators. While PyTorch, a widely used machine learning framework, supports various acceleration hardware, existing runtimes are coupled with specific device modules (e.g., `torch.cuda.current_device` only works for CUDA devices), limiting code portability and creating challenges for hardware vendors integrating new backends. PyTorch 2.5 introduced the concept of accelerators, and 2.6 proposed a unified device-agnostic runtime API, with functionality mapping closely to existing device-specific APIs to minimize code migration changes. Future plans include adding memory-related APIs and universal unit tests. He concluded by thanking the community and contributors for these improvements. 

8. vLLM: Easy, Fast, and Cheap LLM Serving for Everyone  

Kaichao You from Tsinghua University introduced vLLM, which aims to provide accessible, fast, and affordable language model inference services for everyone. Open-sourced in June 2023, it has gained widespread attention with nearly 48.3K GitHub stars. It is easy to use, supporting offline batch inference and an OpenAI-compatible API server, and works with various model types. As an official partner of major language model companies, it enables immediate deployment upon model release. vLLM supports a wide range of hardware, explores plugin-based integrations, and is used in daily life and enterprise applications. It prioritizes user experience with packages, Docker images, precompiled wheels, and a robust continuous integration system. Finally, he thanks over 1,100 contributors in the vLLM community.

9. A torch.fx Based Compression Toolkit Empowered by torch_musa 

Fan Mo from Moore Threads introduced torch_musa, a PyTorch plugin enabling PyTorch to run natively on its platform with highly optimized features and operators. He then detailed the compression toolkit, explaining the choice of FX (debuggable, easy to modify graphs, easy to integrate). Its workflow involves inputting models and configuration files, capturing complete model graphs in the tracing phase, and optimizing/reducing via the backend. He also covered customized optimizations and support for multiple data types. Future work includes making large language and vision models traceable, accelerating inference, and building fault-tolerant systems.  

10. Efficient Training of Video Generation Foundation Model at ByteDance  

Heng Zhang from ByteDance shared ByteDance’s experience in large-scale, high-performance training of video generation foundation models, including applications in advertising, film, and animation. He introduced the video generation model structure (VE encoding, MMDIT diffusion, VE decoding) and training process (phased training, with VE encoding offline to optimize storage and preprocessing). He also discussed the challenges of load imbalance in video generation models and solutions. 

11. torch.compile Practice and Optimization in Different Scenarios

Yichen Yan from Alibaba Cloud shared the team’s experience with `torch.compile` practice and optimization. `torch.compile` accelerates models with one line of code through components like graph capturing, fallback handling, and optimized kernel generation, but faces challenges in production environments. To address these, the team resolved compatibility between Dynamo and DeepSpeed ZeRO/gradient checkpointing, submitting integration solutions to relevant libraries; identified and rewrote attention computation patterns via pattern matching for better fusion and performance; and optimized input alignment to reduce unnecessary recompilations. He also mentioned unresolved issues and future directions: compilation strategies for dynamic shapes, startup latency optimization, reducing overhead, and improving kernel caching mechanisms.

12. PyTorch in Production: Boosting LLM Training and Inferencing on Ascend NPU

Jiawei Li and Jing Li from Huawei introduced advancements in Ascend NPU(torch_npu) within the PyTorch ecosystem. Focusing on upstream diversity support for PyTorch, they explained the third-party device Integration mechanism: using the CPU-based simulation backend OpenRag as a test backend to monitor interface functionality, and establishing mechanisms for downstream hardware vendors to identify risks before community PR merges.

Jing Li shared Ascend NPU’s performance and ecosystem support. He introduced torch_npu architecture for high performance and reliability. Currently already supports more than 20+ popular libraries, including vLLM, torchtune, torchtitan etc. He also explained the mechanism of torch_npu work with NPUGraph and torch.compile, to provide high-performance computation. Finally, he invites everyone to join the community and attend periodic meetings.

13. Hetu-Galvatron: An Automatic Distributed System for Efficient Large-Scale Foundation Model Training

Xinyi Liu and Yujie Wang, from Peking University, detailed Hetu-Galvatron, an innovative PyTorch-based system with key features: automatic optimization, versatility, and user-friendliness. For model conversion, it builds on native PyTorch, transforming single-GPU training models into multi-parallel-supported models by replacing layers supporting tensors and synchronization comparison. For automatic optimization, it has an engine based on cost models and search algorithms. It supports diverse model architectures and hardware backends, ensuring integration with GPU and NPU via PyTorch. It demonstrates superior efficiency on different clusters and models, with verified performance and accuracy. Future plans include integrating torch FSDP2, supporting more parallelism strategies, more models and attention types, and optimizing post-training workflows.  

14. Intel’s PyTorch Journey: Promoting AI Performance and Optimizing Open-Source Software

Mingfei Ma from Intel’s PyTorch team introduced Intel’s work in PyTorch. For PyTorch optimization on Intel GPUs, Intel provides support on Linux and Windows, covering runtime, operator support, `torch.compile`, and distributed training. For CPU backend optimization in `torch.compile`, the team participated in architecture design, expanded data type support, implemented automatic tuning of gemm templates, supported Windows, and continuously improved performance speedups. For DeepSeek 671B full-version performance optimization, the team completed CPU backend development with significant speedups(14x performance boost for prefill and 2.9x for decode), supporting multiple data types, meeting real-time requirements at low cost. 

15. FlagTree: Unified AI Compiler for Diverse AI Chips 

Chunlei Men from the Beijing Academy of Artificial Intelligence introduced FlagTree, a unified AI compiler supporting diverse AI chips and a key component of the FlagOS open source stack. FlagOS, developed by BAAI with multiple partners, includes FlagGems (a general operator library for large models), FlagCX (multi-chip communication), and parallel training/inference frameworks, supporting large model training and inference. He also introduced FlagTree’s architecture for multi-backend integration, and features under development: annotation-based programming paradigms, refactored Triton compiler runtime, etc., with significant performance improvements via related optimizations.

16. KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models  

Dr. Mingxing Zhang from Tsinghua University introduced KTransformers, which stands for Quick Transformers, a library built on HuggingFace’s Transformers, aiming to unlock CPU/GPU hybrid inference potential for MoE models via optimized operator integration and data layout strategies. Initially designed as a flexible framework for integrating various operator optimizations, it addresses rising inference costs due to larger models and longer contexts. For scenarios with low throughput and concurrency, it enables low-threshold model operation by offloading compute-intensive parts to GPUs and sparse parts to CPUs (tailored to models like DeepSeek), with flexible configuration. Future focus includes attention layer sparsification, adding local fine-tuning, and maintaining the Mooncake project for distributed inference, welcoming community exchanges.

17. SGLang: An Efficient Open Source Framework for Large-Scale LLM Serving  

Liangsheng Yin, a graduate student from Shanghai Jiao Tong University, introduced SGLang, an efficient open source framework for large-scale LLM serving. As a leading-performance open source engine with an elegant, lightweight, and customizable design, it is adopted by academia and companies like Microsoft and AMD, offering high-performance RL solutions. Its core is the PD disaggregation design, solving issues in non-decoupled modes: latency, unbalanced computation-communication, and scheduling incompatibility. It routes requests via load balancers, enabling KV cache transmission between prefetching and decoding instances. Future plans include latency optimization, longer sequence support, and integrating data-parallel attention. With over 400 contributors, it is used by multiple enterprises.

Read More

Introducing Mixed Precision Training in Opacus

Introduction

We integrate mixed and low-precision training with Opacus to unlock increased throughput and training with larger batch sizes. Our initial experiments show that one can maintain the same utility as with full precision training by using either mixed or low precision. These are early-stage results, and we encourage further research on the utility impact of low and mixed precision with DP-SGD.

Opacus is making significant progress in meeting the challenges of training large-scale models such as LLMs and bridging the gap between private and non-private training. In 2024, we introduced Fast Gradient Clipping to Opacus to reduce the memory footprint of the hooks-based implementation of DP-SGD. Recently, the added capability of fully sharded data parallelism (FSDP) scales training of large models across devices.

Mixed precision training, which combines different numerical precisions, has been effective in speeding up training and reducing memory usage while maintaining model utility. By using low-precision (e.g., BF16) operations alongside single-precision (FP32) operations, larger models can be trained with larger batch sizes and faster matrix operations. As an example, Llama 3 models were trained using a mix of FP32 and BF16, whereas Llama 4 used BF16 and FP8. We invite developers and researchers to experiment with scaling Opacus training to larger models by taking advantage of mixed precision and other recently introduced techniques.

Mixed and low precision training

Single-precision floating-point numbers are represented by 32 bits. Newer GPUs support high-throughput arithmetic operations with floating-point representations of 16 or 8 bits. These efficiency gains have been adopted in deep learning applications where, generally, lower precision does not harm model performance.

In low precision training, forward and backward passes and weight updates are all performed in a low precision data type (e.g., BF16 or FP8). However, weight updates in low precision can be numerically unstable.

As an alternative, mixed precision training maintains weight updates in high precision (FP32) and only uses low precision (e.g., BF16) for the forward and backward passes. Additionally, some layers, such as normalization layers, also perform operations in FP32 to maintain numerical stability.

To enable mixed precision training with Opacus, we add logic to handle the computation of per-sample gradients when activations and backprops have different precision types (as happens with the two green boxes in Figure 1). This logic is implemented inside the functions that calculate per-sample gradients (e.g., here).

Figure 1. Forward and backward pass with mixed precision training. LayerNorm forward pass happens in full precision. Linear layer operations (and most other layers) are in low precision. The output of one layer is the input to the next layer.

Figure 2. Weight update with mixed precision. Weights are always stored in full precision. Backprops are cast up to full precision.

How to use mixed and low precision training with Opacus

Low and mixed precision training with Opacus is achieved with just a few extra lines of code. It looks very similar to non-private training. Recall that Opacus wraps training components, such as model, optimizer, and dataloaders, into the PrivacyEngine, the main interface of the library. Thereafter, the training loop is identical to native PyTorch. 

Low precision training

Compared to full precision training with Opacus, we only need to: 

  • cast model weights to the lower precision before training starts, and 
  • cast inputs to lower precision.
from opacus import PrivacyEngine

# cast model weights to lower precision before training
model = model.to(torch.bfloat16)
model.train()

privacy_engine = PrivacyEngine()
model, optimizer, dataloader = privacy_engine.make_private(
        module=model,
        optimizer=optimizer,
        data_loader=dataloader,
        noise_multiplier=noise_multiplier
        max_grad_norm=max_grad_norm
)

for x, y in train_loader:

        # cast input to lower precision
        # integer inputs should stay as integers
        # (if y is a float, y should also be cast)
        if x.is_floating():
             x = x.to(torch.bfloat16)
        # proceed with training step as usual
        output = model(x)     
        optimizer.zero_grad()
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()

Mixed precision training

PyTorch supports mixed-precision training through the torch.amppackage, which we also leverage. The forward pass and loss computation are within the context, whereas the backward pass should be outside of the context. The main change here is the addition of the torch.amp context.

from opacus import PrivacyEngine

privacy_engine = PrivacyEngine()
model, optimizer, dataloader = privacy_engine.make_private(
        module=model,
        optimizer=optimizer,
        data_loader=dataloader,
        noise_multiplier=noise_multiplier
        max_grad_norm=max_grad_norm
)

for x, y in train_loader:
        # mixed precision training context for forward pass 
        with torch.amp.context("cuda", dtype=torch.bfloat16):
             output = model(x)     
             optimizer.zero_grad()
             loss = criterion(output, y)

        # backward pass is outside of amp context
        loss.backward()
        optimizer.step()

BERT fine-tuning task

We experiment with fine-tuning a pre-trained BERT-base model with the SNLI dataset (similar to this Opacus tutorial). We consider two common fine-tuning setups for DP-SGD: 

  • fine-tuning only the last few layers of the model, while freezing all other layers, or 
  • fine-tuning all layers with LoRA (low-rank adaptation). 

In the first case, ghost clipping improves memory usage given the large width of linear layers. In the second case, ghost clipping is not useful since the effective layer width with LoRA is very small. 

We use either FP32 only, BF16 only, or mixed-precision for training. 

We find that while non-private training has the same utility across all precision settings, DP-SGD fine-tuning of the last few layers with BF16 has a drop in performance. Mixed precision training recovers this utility loss. With LoRA fine-tuning, DP-SGD maintains the same utility across all precision settings. We hypothesize that low precision training with DP-SGD performs best when fine-tuning only linear layers, as is the case with LoRA. However, it harms utility when other types of layers are involved (e.g., normalization layers), which typically require high precision operations. Thus, LoRA is our recommended fine-tuning setting for DP-SGD with low/mixed precision. 

Memory and speed improvements

In Table 2, we compare peak memory and time taken for one forward and backward pass for three precision settings. BF16 improves peak memory ~2x, compared to FP32, whereas mixed precision improves it ~1.2-1.4x. Note that at small batch sizes, mixed precision can use more memory than FP32 as model weights are stored twice, in low and high precision. 

In Table 2, we compare time for one training step for the three precision settings. BF16 speedup compared to FP32 increases with the batch size, ranging from ~2x to ~6x. Mixed precision training speedups range from ~1x-~4x. 

Experiments were performed on a A100 GPU with 40GB of memory.

Table 1. Peak memory for one iteration (forward+backward pass) with increasing batch size. 

Table 2. Running time of one iteration (averaged over 10 runs), with increasing batch size. 

Impact on utility

We train for one epoch and measure the maximum test set accuracy achieved during the epoch. We average the maximum accuracy over 5 runs.

When fine-tuning the last few layers, mixed precision and FP32 training achieve on par performance, while low precision training incurs a significant decrease in utility. The utility gap shrinks as the privacy budget increases. In the non-private case, low precision training seems to help performance, likely due to a regularization effect from the noisier matrix operations that counters overfitting. 

With LoRA fine-tuning, the highest accuracy is achieved with BF16, with the advantage of BF16 increasing as the privacy budget increases. Mixed and high-precision training are on par. 

We hypothesize that low precision training with DP-SGD performs best when fine-tuning only linear layers, as in LoRA. It harms utility when other types of layers are involved in fine-tuning, such as normalization layers, which typically require high-precision operations.

Table 3. Test set accuracy averaged over 5 runs, at different privacy levels. Batch size = 32. 

Industry use case

We experiment with fine-tuning a large language model with 8B parameters with DP-SGD and LoRA (~7M trainable parameters). Compared to FP32 training, BF16 achieves a 3.4x increase in samples processed per second, whereas mixed precision achieves a 1.1x increase. We achieve on-par loss and loss convergence speed between all precision settings. 

Conclusion

We have integrated a popular technique for training large-scale models into Opacus, further enhancing Opacus’ ability to meet the challenges of private training.  With mixed and low precision, Opacus achieves increased throughput and training with larger batch sizes. Our preliminary experiments show that this can be achieved without sacrificing utility. We also provide some insight into which type of precision is most suitable for different fine-tuning settings. We invite developers and the research community to experiment with this new feature and to provide further results on the utility performance of DP-SGD in mixed and low precision settings. 

To learn more about Opacus, visit opacus.ai and github.com/pytorch/opacus

Acknowledgments

We thank Ilya Mironov and Will Bullock for their valuable technical review and guidance.

Read More

Bringing Generative AI to the Masses with ExecuTorch and KleidiAI

Bringing Generative AI to the Masses with ExecuTorch and KleidiAI

Key Takeaways:

  • ExecuTorch 0.7 now enables KleidiAI by default, delivering automatic acceleration on Arm CPUs with zero integration effort.
  • GenAI is now performant on millions of existing devices—including 3–5 year-old smartphones and Raspberry Pi 5—thanks to Arm CPU features like SDOT and I8MM.
  • On-device use cases like private voice assistants, message summarization, and local code/gen AI copilots are now possible—without the cloud, and without the battery drain.

Arm’s recent SME2 announcement underscores the growing role of Arm KleidiAI as the AI acceleration layer powering the next wave of AI on Arm. By embedding into widely-used Edge AI frameworks like XNNPack, MediaPipe, MNN, ONNX Runtime, and even llama.cpp, KleidiAI has delivered substantial performance improvements with no code changes required by developers. That foundation leads directly to the upcoming ExecuTorch 0.7 beta, where KleidiAI will be enabled by default—bringing automatic acceleration to devices built on the latest Arm CPU architecture, as well as a vast base of existing phones built on earlier generations.

Android and cross-platform developers—whether first- or third-party—gain instant access to KleidiAI AI performance optimizations via ExecuTorch and XNNPack. The result? Faster model startups, lower latency, leaner memory footprints—and no integration hurdles. What previously required custom tuning is now turn-key performance, ready out of the box. This efficiency unlocks new possibilities—not just for the latest high-end devices, but for a much broader range of hardware.

When we consider running Generative AI (GenAI) on mobile devices, it’s easy to envision the latest flagship smartphones equipped with powerful CPUs, GPUs, and NPUs. But what if we told you that GenAI experiences—like running large language models (LLMs)—can also be brought to devices that are 3, 4, or even 5 years old? Or even to the Raspberry Pi 5?

Well, this is now not just a vision, but a practical reality—thanks to the Arm SDOT CPU feature, which has been available in Arm CPUs since 2015.

What is SDOT?

The SDOT (Signed Dot Product) instruction, introduced in the Armv8.2 architecture and later CPUs, enables efficient dot product operations on vectors of 8-bit signed integers. The following image illustrates the behavior of one such SDOT instruction available on Arm CPUs:

As shown above, the instruction produces four 32-bit integer outputs, each resulting from the dot product of corresponding groups of four int8 elements from the left-hand side (LHS) and right-hand side (RHS) vector registers.

This instruction can be utilized to accelerate matrix multiplication routines—the core computational workload behind every LLM—when using Int8 or lower-bit precision formats, such as Int4. These operations typically involve numerous dot products between individual rows of the left-hand side matrix and corresponding columns of the right-hand side matrix.

The SDOT instruction is already widely supported across a diverse range of devices, opening the door for GenAI use cases to reach a significantly larger smartphone audience. As of today, Arm CPUs in approximately 3 billion Arm-based devices include this capability—enabling powerful on-device GenAI experiences for the majority of users. In fact, 72% of all devices now support this instruction

Thanks to ExecuTorch, we’re now enabling models like Llama 3.2 to run efficiently on the majority of Android devices as well as edge devices like the Raspberry Pi 5.

KleidiAI + ExecuTorch: Bringing It All Together

For last year’s quantized Llama 3.2 1B announcement, the ExecuTorch and KleidiAI teams collaborated to deliver optimizations for the Int4 matrix-multiplication on Arm CPUs leveraging the I8MM feature, available from the armv8.6 architecture onwards. As highlighted in a previous blog post, ExecuTorch with KleidiAI achieves over 20% higher prefill performance on the Galaxy S24+ compared to non-KleidiAI kernels. This translates to more than 350 tokens per second during the prefill phase and over 40 tokens per second during the decode phase. This level of performance is sufficient to enable on-device tasks, such as summarizing unread messages, with a smooth user experience using only Arm CPUs. For context, summarizing around 50 unread messages typically involves processing approximately 600 tokens.

This year, the ExecuTorch and KleidiAI teams have focused on optimizing Int4 matrix multiplication performance by leveraging the SDOT instruction, aiming to broaden adoption.
👉See the XNNPack PR

While LLM performance on Arm CPUs with only the SDOT extension may not match latest flagship smartphones, it still enables impressive capabilities for on-device generative AI. In fact, in many scenarios, the decode phase is faster than the average human reading speed—highlighting that even older Arm CPUs can support practical and meaningful GenAI use cases.

For example, when combined with speech-to-text and text-to-speech models, a local LLM of this kind enables the creation of a fully private smart assistant that operates entirely offline, eliminating concerns about data privacy while still offering rich voice-based interactions. Such a device could seamlessly interact with your connected devices, ensuring users have peace of mind with their data.

Another compelling use case for running Llama 3.2 1B is context-aware text completion in local text editors. As you type, the model provides intelligent, real-time suggestions to streamline writing or coding workflows—all without requiring an internet connection.

These are just a few examples, and they only scratch the surface of what’s possible with on-device GenAI.

Conclusion: GenAI for Everyone

With the combined power of SDOT, KleidiAI, and ExecuTorch, we’re pushing the boundaries of what’s possible—bringing Generative AI beyond high-end flagship devices and making it accessible on billions of Arm-based devices already in use.

Now it’s your turn—we’re excited to see what you’ll create. To help you get started, check out Arm’s learning path, designed to guide you through developing your own applications with LLMs using ExecuTorch and KleidiAI.

Read More