Large Language Models (LLMs) have transformed tasks across numerous industries, including drafting emails, generating code, and summarizing meetings. In recent months, we have worked closely with the PyTorch community to optimize LLM workloads for Intel® GPUs labeled as ‘xpu’ in PyTorch. This post illustrates how to harness these advancements for faster local LLM inference. Notably, Intel® GPUs, including Intel® Arc discrete and built-in, and Intel® Arc
Pro, support PyTorch, empowering developers to run PyTorch and LLMs locally on widely available laptops and desktops, making advanced AI capabilities more accessible than ever.
Background
Running LLMs on client devices presents two core challenges:
– Memory constraints: 7B+ models exceed typical GPU VRAM.
– GEMM and SPDA kernel efficiency: Kernel efficiency needs to be optimized for both compute-bound and memory-bound scenarios.
PyTorch 2.8 addresses these on Intel® GPUs via:
– oneDNN backend: Optimizes GEMM/SDPA ops for Intel® GPUs.
– TorchAO: Enables INT4 quantization through tensor subclassing on Intel® GPUs.
– Windows-native torch.compile: Graph fusion for memory-bound ops on Intel® GPUs.
LLM Inference Optimizations for Intel® GPUs
Scaled Dot Product Attention Optimization: Unlocking Competitive Performance on Intel GPUs
The torch.nn.functional.scaled_dot_product_attention (SDPA) implements the core attention mechanism from the landmark paper “Attention Is All You Need”. While a naive PyTorch implementation exists, its fused backend delivers transformative performance gains. Since PyTorch 2.7, we have integrated oneDNN as the default XPU backend – optimizing SDPA for both: Long sequences (prefill stage)
Short sequences (decoding stage)
Key Advancements
- PyTorch 2.8+: Added Grouped Query Attention (GQA) support alongside Multi-Head Attention (MHA).
- Hugging Face Integration: Native support in Transformers as primary attention backend.
- KV Cache Compatibility: Works seamlessly with both dynamic and static caching strategies.
# Automatic integration with Hugging Face Transformers
Intel GPUs are now automatically integrated via Hugging Face, providing out-of-the-box hardware acceleration for Transformers models.
import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "microsoft/Phi-3-mini-4k-instruct" model = AutoModelForCausalLM.from_pretrained(model_id, device_map="xpu", torch_dtype=torch.float16)
torch.compile: Graph Fusion for Intel® GPUs
torch.compile transforms eager-mode execution into optimized graph mode, leveraging TorchDynamo for Python tracing and TorchInductor for Triton kernel generation.
For Intel® GPUs, this unlocks:
Cross-platform acceleration
– Linux support since PyTorch 2.5
– First Windows-compatible accelerator for torch.compile (PyTorch 2.7+). Refer to “How to use torch.compile on Windows CPU/XPU”.
– decoding speedup via fused ROPE/RMSNorm kernels.
– prefill acceleration through graph optimizations.
Based on our benchmarks, torch.compile delivers a decoding speedup of over 1.39x and a prefill (first token generation) speedup of over 1.1x.
Model | Prefill Speedup (First Token) | Decoding Speedup (Next Token) |
meta-llama/Llama-3.2-3B | 2.42× | 1.58× |
microsoft/Phi-3-mini-4k-instruct | 1.12× | 1.39× |
Qwen/Qwen3-4B | 2.68× | 1.99× |
Tested on Intel® Core i5-13400 (Arc B580, 12GB VRAM) and more information could be found at Product and Performance Information Session.
WOQ-INT4: Elevating LLM Efficiency on Intel® GPUs
Weight-Only Quantization (WOQ) is a technique that compresses model weights to 4-bit integers (INT4) while keeping activations in 16-bit precision (FP16/BF16). Figure 1 shows how this process is applied in detail within a Transformer model. It balances efficiency and accuracy by targeting the memory bandwidth bottleneck, which in turn reduces the memory footprint and accelerates overall performance.
For users of Intel® GPUs, addressing this bottleneck translates directly into significant, practical advantages: models become smaller, faster, and more power-efficient. This enables larger models to run on consumer hardware with a substantially better user experience.
Our benchmarks on popular models concretely demonstrate these improvements:
Model | MemoryReduction | Decoding Speedup |
meta-llama/Llama-3.2-3B | 5.98GB→2.10GB | 1.62× |
microsoft/Phi-3-mini-4k-instruct | 7.11GB→2.13GB | 2.14× |
Qwen/Qwen3-4B | 7.49GB→ 2.49GB | 1.56× |
Tested on Intel® Core i5-13400 (Arc B580, 12GB VRAM) and more information could be found at Product and Performance Information Session.
The data confirm that WOQ consistently achieves over 65% memory savings and more than 1.5x faster decoding speeds, highlighting its significant role in enhancing the accessibility and performance of LLMs.
Enabled in PyTorch 2.8 via TorchAO’s tensor subclass abstraction, this PyTorch-native optimization framework implements A16W4 GEMM using oneDNN to replace Linear layer weights with backend-agnostic INT4 representations, as illustrated by the Phi-3-mini-4k-instruct implementation, where only decoder Linear layers are quantized.
Seamlessly integrated with Hugging Face Transformers, WOQ-INT4 supports AWQ, RTN, and GPTQ algorithms while delivering near-FP16 inference experiences through post-training quantization workflows.
Figure 1: WOQ-INT4 applied to Linear layers in Phi-3-mini
Step-by-Step: Run LLM Inference on Intel® GPUs
Installation
Install Intel GPU Driver
To enable Intel® GPU acceleration, begin by installing the latest graphics driver: Windows users should download the driver from the Intel Arc & Iris Xe Graphics Driver page and follow the on-screen installation instructions. Ubuntu users should refer to the Intel GPU Driver Installation guide for OS-specific setup steps.
Install PyTorch and other required packages
To install PyTorch and its dependencies, use the following command:
# Install PyTorch + dependencies pip install torch torchao --index-url https://download.pytorch.org/whl/xpu pip install transformers accelerate
Enable torch.compile on Windows
Please refer to How to use torch.compile on Windows CPU/XPU to install MSVC and activate the environment on Windows. Linux users can skip this step.
Run FP16 Inference (Eager Mode)
import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "microsoft/Phi-3-mini-4k-instruct" model = AutoModelForCausalLM.from_pretrained(model_id, device_map="xpu", torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained(model_id) prompt = "Hey, are you conscious? Can you talk to me?" inputs = tokenizer( prompt, return_tensors="pt", ).to("xpu") generate_kwargs = dict(do_sample=True, temperature=0.9, num_beams=1, cache_implementation="static") generated_ids = model.generate(**inputs, max_new_tokens=128, **generate_kwargs) output_text = tokenizer.batch_decode( generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)
The example above demonstrates LLM inference running in PyTorch eager mode simply by setting devices to XPU.
Accelerate with torch.compile
import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "microsoft/Phi-3-mini-4k-instruct" model = AutoModelForCausalLM.from_pretrained(model_id, device_map="xpu", torch_dtype=torch.float16) model.forward = torch.compile(model.forward) tokenizer = AutoTokenizer.from_pretrained(model_id) prompt = "Hey, are you conscious? Can you talk to me?" inputs = tokenizer( prompt, return_tensors="pt", ).to("xpu") generate_kwargs = dict(do_sample=True, temperature=0.9, num_beams=1, cache_implementation="static") generated_ids = model.generate(**inputs, max_new_tokens=128, **generate_kwargs) output_text = tokenizer.batch_decode( generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)
Accelerating Inference with WOQ-INT4
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig from torchao.quantization.quant_api import Int4WeightOnlyConfig from torchao.dtypes import Int4XPULayout from torchao.quantization.quant_primitives import ZeroPointDomain model_id = "microsoft/Phi-3-mini-4k-instruct" # Create quantization configuration quantization_config = TorchAoConfig("int4_weight_only", group_size=128, layout=Int4XPULayout(), zero_point_domain=ZeroPointDomain.INT) # Load and automatically quantize model = AutoModelForCausalLM.from_pretrained(model_id, device_map="xpu", torch_dtype=torch.float16, quantization_config=quantization_config) # Use unwrap_tensor_subclass_parameters to reduce the subclass host overhead from torch._functorch._aot_autograd.subclass_parametrization import ( unwrap_tensor_subclass_parameters, ) unwrap_tensor_subclass_parameters(model) model.forward = torch.compile(model.forward) tokenizer = AutoTokenizer.from_pretrained(model_id) prompt = "Hey, are you conscious? Can you talk to me?" inputs = tokenizer( prompt, return_tensors="pt", ).to("xpu") generate_kwargs = dict(do_sample=True, temperature=0.9, num_beams=1, cache_implementation="static") model.generate(**inputs, max_new_tokens=128, **generate_kwargs) generated_ids = model.generate(**inputs, max_new_tokens=128, **generate_kwargs) output_text = tokenizer.batch_decode( generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)
To further reduce token generation latency, Weight-Only Quantization INT4 (WOQ-INT4) strategically balances accuracy and performance. For the Phi-3-mini model, WOQ-INT4 achieves over 65% model compression while delivering:
1) faster decoding during the memory-bound stage, resulting in reduced token latency;
2) Prefill efficiency that matches FP16 during the compute-bound stage;
3) Scalability for larger models enabling large language models—which were previously too large to run in FP16 precision on a single consumer GPU—to now operate smoothly on a single Intel® Arc B-Series GPU. This brings practical, real-time performance for complex LLMs to consumer-grade hardware.
Conclusion
PyTorch 2.8+TorchAO unlocks: reduced next-token latency on consumer GPUs
Support for 10B+ parameter models on single Intel® Arc GPUs
Windows-native acceleration via torch.compile
Key achievements:
– over 65% reduction in VRAM usage through WOQ-INT4
– over 1.5× decoding speedup compared to FP16
– Seamless integration with Hugging Face
Acknowledgement
Special thanks to PyTorch maintainers: Jerry Zhang, Nikita Shulga, Jason Ansel, Andrey Talman, Alban Desmaison, and Bin Bao.
Product and Performance Information
Test Hardware:
– Intel® Core Ultra 9 288V (Arc 140V, 16GB VRAM)
– Intel® Core i5-13400 (Arc B580, 12GB VRAM)
Software: Windows 11 Pro, Intel® Graphics Driver 32.0.101.6972, PyTorch 2.8
Workload: Performance tested on “microsoft/Phi-3-mini-4k-instruct” with configuration float16 vs torch.compile optimization vs + torch.compile optimization + torchAO WOQ optimization.
Test by Intel on July 29th, 2025.
Explore more:
– Torch.compile on Windows Tutorial
– TorchAO Quantization Guide
– Intel® GPU Driver Portal on Windows
Notice and Disclaimers
Performance varies by use, configuration, and other factors. Performance results are based on testing as of the dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software, or service activation. Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.