PyTorch 2.8+TorchAO: Unlock Efficient LLM Inference on Intel® AI PCs

Large Language Models (LLMs) have transformed tasks across numerous industries, including drafting emails, generating code, and summarizing meetings. In recent months, we have worked closely with the PyTorch community to optimize LLM workloads for Intel® GPUs labeled as ‘xpu’ in PyTorch. This post illustrates how to harness these advancements for faster local LLM inference. Notably, Intel® GPUs, including Intel® Arc™ discrete and built-in, and Intel® Arc™ Pro, support PyTorch, empowering developers to run PyTorch and LLMs locally on widely available laptops and desktops, making advanced AI capabilities more accessible than ever.

Background  

Running LLMs on client devices presents two core challenges: 
Memory constraints: 7B+ models exceed typical GPU VRAM. 
GEMM and SPDA kernel efficiency: Kernel efficiency needs to be optimized for both compute-bound and memory-bound scenarios.  

PyTorch 2.8 addresses these on Intel® GPUs via: 
oneDNN backend: Optimizes GEMM/SDPA ops for Intel® GPUs. 
TorchAO: Enables INT4 quantization through tensor subclassing on Intel® GPUs. 
Windows-native torch.compile: Graph fusion for memory-bound ops on Intel® GPUs.  

LLM Inference Optimizations for Intel® GPUs  

Scaled Dot Product Attention Optimization: Unlocking Competitive Performance on Intel GPUs  

The torch.nn.functional.scaled_dot_product_attention (SDPA) implements the core attention mechanism from the landmark paper Attention Is All You Need”. While a naive PyTorch implementation exists, its fused backend delivers transformative performance gains. Since PyTorch 2.7, we have integrated oneDNN as the default XPU backend – optimizing SDPA for both:  
✅Long sequences (prefill stage) 
✅Short sequences (decoding stage)  

Key Advancements  

  • PyTorch 2.8+: Added Grouped Query Attention (GQA) support alongside Multi-Head Attention (MHA).  
  • Hugging Face Integration: Native support in Transformers as primary attention backend.  
  • KV Cache Compatibility: Works seamlessly with both dynamic and static caching strategies.  
# Automatic integration with Hugging Face Transformers    

Intel GPUs are now automatically integrated via Hugging Face, providing out-of-the-box hardware acceleration for Transformers models.  

import torch  
from transformers import AutoModelForCausalLM, AutoTokenizer  

model_id = "microsoft/Phi-3-mini-4k-instruct"  
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="xpu", torch_dtype=torch.float16)  

torch.compile: Graph Fusion for Intel® GPUs  

torch.compile transforms eager-mode execution into optimized graph mode, leveraging TorchDynamo for Python tracing and TorchInductor for Triton kernel generation.   

For Intel® GPUs, this unlocks:  

✅Cross-platform acceleration 
– Linux support since PyTorch 2.5 
First Windows-compatible accelerator for torch.compile (PyTorch 2.7+). Refer to How to use torch.compile on Windows CPU/XPU” 

decoding speedup via fused ROPE/RMSNorm kernels. 
prefill acceleration through graph optimizations.  

Based on our benchmarks, torch.compile delivers a decoding speedup of over 1.39x and a prefill (first token generation) speedup of over 1.1x. 

Model  Prefill Speedup (First Token)  Decoding Speedup (Next Token) 
meta-llama/Llama-3.2-3B  2.42×  1.58× 
microsoft/Phi-3-mini-4k-instruct  1.12×  1.39× 
Qwen/Qwen3-4B  2.68×  1.99× 

Tested on Intel® Core™ i5-13400 (Arc B580, 12GB VRAM) and more information could be found at Product and Performance Information Session. 

WOQ-INT4: Elevating LLM Efficiency on Intel® GPUs  

Weight-Only Quantization (WOQ) is a technique that compresses model weights to 4-bit integers (INT4) while keeping activations in 16-bit precision (FP16/BF16). Figure 1 shows how this process is applied in detail within a Transformer model. It balances efficiency and accuracy by targeting the memory bandwidth bottleneck, which in turn reduces the memory footprint and accelerates overall performance. 

For users of Intel® GPUs, addressing this bottleneck translates directly into significant, practical advantages: models become smaller, faster, and more power-efficient. This enables larger models to run on consumer hardware with a substantially better user experience. 

Our benchmarks on popular models concretely demonstrate these improvements: 

Model  MemoryReduction    Decoding Speedup 
meta-llama/Llama-3.2-3B  5.98GB→2.10GB     1.62× 
microsoft/Phi-3-mini-4k-instruct  7.11GB→2.13GB     2.14× 
Qwen/Qwen3-4B  7.49GB→ 2.49GB     1.56× 

Tested on Intel® Core™ i5-13400 (Arc B580, 12GB VRAM) and more information could be found at Product and Performance Information Session. 

The data confirm that WOQ consistently achieves over 65% memory savings and more than 1.5x faster decoding speeds, highlighting its significant role in enhancing the accessibility and performance of LLMs. 

Enabled in PyTorch 2.8 via TorchAO’s tensor subclass abstraction, this PyTorch-native optimization framework implements A16W4 GEMM using oneDNN to replace Linear layer weights with backend-agnostic INT4 representations, as illustrated by the Phi-3-mini-4k-instruct implementation, where only decoder Linear layers are quantized.  

Seamlessly integrated with Hugging Face Transformers, WOQ-INT4 supports AWQ, RTN, and GPTQ algorithms while delivering near-FP16 inference experiences through post-training quantization workflows. 

Figure 1: WOQ-INT4 applied to Linear layers in Phi-3-mini  

Step-by-Step: Run LLM Inference on Intel® GPUs

Installation  

Install Intel GPU Driver  

To enable Intel® GPU acceleration, begin by installing the latest graphics driver: Windows users should download the driver from the Intel Arc & Iris Xe Graphics Driver page and follow the on-screen installation instructions. Ubuntu users should refer to the Intel GPU Driver Installation guide for OS-specific setup steps.  

Install PyTorch and other required packages

To install PyTorch and its dependencies, use the following command:

# Install PyTorch + dependencies    

pip install torch torchao --index-url
 https://download.pytorch.org/whl/xpu    
  
pip install transformers accelerate    

Enable torch.compile on Windows  

Please refer to How to use torch.compile on Windows CPU/XPU to install MSVC and activate the environment on Windows. Linux users can skip this step.  

Run FP16 Inference (Eager Mode)

import torch  
from transformers import AutoModelForCausalLM, AutoTokenizer  
model_id = "microsoft/Phi-3-mini-4k-instruct"  
model = AutoModelForCausalLM.from_pretrained(model_id, 
device_map="xpu", torch_dtype=torch.float16)  
tokenizer = AutoTokenizer.from_pretrained(model_id)  
prompt = "Hey, are you conscious? Can you talk to me?"  
inputs = tokenizer(  
        prompt,  
        return_tensors="pt",  
).to("xpu")  
generate_kwargs = dict(do_sample=True, temperature=0.9, num_beams=1, 
cache_implementation="static")  
generated_ids = model.generate(**inputs, max_new_tokens=128, 
**generate_kwargs)  
output_text = tokenizer.batch_decode(  
 generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False  
)  
print(output_text)  

The example above demonstrates LLM inference running in PyTorch eager mode simply by setting devices to XPU.

Accelerate with torch.compile  

import torch  
from transformers import AutoModelForCausalLM, AutoTokenizer  
model_id = "microsoft/Phi-3-mini-4k-instruct"  
model = AutoModelForCausalLM.from_pretrained(model_id, 
device_map="xpu", torch_dtype=torch.float16)  
model.forward = torch.compile(model.forward)  
tokenizer = AutoTokenizer.from_pretrained(model_id)  
prompt = "Hey, are you conscious? Can you talk to me?"  
inputs = tokenizer(  
        prompt,  
        return_tensors="pt",  
).to("xpu")  
generate_kwargs = dict(do_sample=True, temperature=0.9, num_beams=1,
cache_implementation="static")  
generated_ids = model.generate(**inputs, max_new_tokens=128, 
**generate_kwargs)  
output_text = tokenizer.batch_decode(  
        generated_ids, skip_special_tokens=True, 
clean_up_tokenization_spaces=False  
)  
print(output_text)

Accelerating Inference with WOQ-INT4  

import torch  
from transformers import AutoModelForCausalLM, AutoTokenizer, 
TorchAoConfig  
from torchao.quantization.quant_api import Int4WeightOnlyConfig  
from torchao.dtypes import Int4XPULayout  
from torchao.quantization.quant_primitives import ZeroPointDomain  
  
model_id = "microsoft/Phi-3-mini-4k-instruct"  
# Create quantization configuration  
quantization_config = TorchAoConfig("int4_weight_only",
group_size=128,  layout=Int4XPULayout(), 
zero_point_domain=ZeroPointDomain.INT)  
# Load and automatically quantize  
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="xpu", 
torch_dtype=torch.float16, quantization_config=quantization_config)  
  
# Use unwrap_tensor_subclass_parameters to reduce the subclass host 
overhead  
from torch._functorch._aot_autograd.subclass_parametrization import (  
 unwrap_tensor_subclass_parameters,  
)  
unwrap_tensor_subclass_parameters(model)  
  
model.forward = torch.compile(model.forward)  
tokenizer = AutoTokenizer.from_pretrained(model_id)  
prompt = "Hey, are you conscious? Can you talk to me?"  
inputs = tokenizer(  
        prompt,  
        return_tensors="pt",  
).to("xpu")  
generate_kwargs = dict(do_sample=True, temperature=0.9, num_beams=1, 
cache_implementation="static")  
model.generate(**inputs, max_new_tokens=128, **generate_kwargs)  
generated_ids = model.generate(**inputs, max_new_tokens=128, 
**generate_kwargs)  
output_text = tokenizer.batch_decode(  
        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False  
)  
print(output_text)

To further reduce token generation latency, Weight-Only Quantization INT4 (WOQ-INT4) strategically balances accuracy and performance. For the Phi-3-mini model, WOQ-INT4 achieves over 65% model compression while delivering:   

1) faster decoding during the memory-bound stage, resulting in reduced token latency;   

2) Prefill efficiency that matches FP16 during the compute-bound stage;   

3) Scalability for larger models enabling large language models—which were previously too large to run in FP16 precision on a single consumer GPU—to now operate smoothly on a single Intel® Arc™ B-Series GPU. This brings practical, real-time performance for complex LLMs to consumer-grade hardware. 

Conclusion

PyTorch 2.8+TorchAO unlocks: 
✅reduced next-token latency on consumer GPUs 
✅Support for 10B+ parameter models on single Intel® Arc GPUs 
✅Windows-native acceleration via torch.compile  

Key achievements

– over 65% reduction in VRAM usage through WOQ-INT4 
– over 1.5× decoding speedup compared to FP16 
– Seamless integration with Hugging Face 

Acknowledgement

Special thanks to PyTorch maintainers: Jerry Zhang, Nikita Shulga, Jason Ansel, Andrey Talman, Alban Desmaison, and Bin Bao. 

Product and Performance Information

Test Hardware
– Intel® Core™ Ultra 9 288V (Arc 140V, 16GB VRAM) 
– Intel® Core™ i5-13400 (Arc B580, 12GB VRAM) 
Software: Windows 11 Pro, Intel® Graphics Driver 32.0.101.6972, PyTorch 2.8 
Workload: Performance tested on “microsoft/Phi-3-mini-4k-instruct” with configuration float16 vs torch.compile optimization vs + torch.compile optimization + torchAO WOQ optimization.   

Test by Intel on July 29th, 2025.  

Explore more
Torch.compile on Windows Tutorial  
TorchAO Quantization Guide  
Intel® GPU Driver Portal on Windows  

Notice and Disclaimers

Performance varies by use, configuration, and other factors. Performance results are based on testing as of the dates shown in configurations and may not reflect all publicly available updates.  No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software, or service activation. Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.  

Read More