Capital One Banks on AI for Financial Services

Capital One Banks on AI for Financial Services

Financial services has long been at the forefront of adopting technological innovations. Today, generative AI and agentic systems are redefining the industry, from customer interactions to enterprise operations.

Prem Natarajan, executive vice president, chief scientist and head of AI at Capital One, joined the NVIDIA AI Podcast to discuss how his organization is building proprietary AI systems that deliver value to over 100 million customers.

“AI is at its best when it transfers cognitive burden from the human to the system,” Natarajan said. “It allows the human to have that much more fun and experience that magic.”

Capital One’s strategy centers on a “test, iterate, refine” approach that balances innovation with rigorous risk management. The company’s first agentic AI deployment is a chat concierge that helps customers navigate the car-buying process, such as by scheduling test drives.

Rather than simply integrating third-party solutions, Capital One builds proprietary AI technologies that tap into its vast data repositories.

“Your data advantage is your AI advantage,” Natarajan emphasized. “Proprietary data allows you to build proprietary AI that provides enduring differentiated services for your customers.”

Capital One’s AI architecture combines open-weight foundation models with deep customizations using proprietary data. This approach, Natarajan explained, supports the creation of specialized models that excel at financial services tasks and integrate into multi-agent workflows that can take actions.

Natarajan stressed that responsible AI is fundamental to Capital One’s design process. His teams take a “responsibility through design” approach, implementing robust guardrails — both technological and human-in-the-loop — to ensure safe deployment.

The concept of an AI factory — where raw data is processed and refined to produce actionable intelligence — aligns naturally with Capital One’s cloud-native technology stack. AI factories incorporate all the components required for financial institutions to generate intelligence, combining hardware, software, networking and development tools for AI applications in financial services.

Time Stamps

1:10 – Natarajan’s background and journey to Capital One.

4:50 – Capital One’s approach to generative AI and agentic systems.

15:56 – Challenges in implementing responsible AI in financial services.

28:46 – AI factories and Capital One’s cloud-native advantage.

You Might Also Like… 

NVIDIA’s Jacob Liberman on Bringing Agentic AI to Enterprises

Agentic AI enables developers to create intelligent multi-agent systems that reason, act and execute complex tasks with a degree of autonomy. Jacob Liberman, director of product management at NVIDIA, explains how agentic AI bridges the gap between powerful AI models and practical enterprise applications.

Telenor Builds Norway’s First AI Factory, Offering Sustainable and Sovereign Data Processing

Telenor opened Norway’s first AI factory in November 2024, enabling organizations to process sensitive data securely on Norwegian soil while prioritizing environmental responsibility. Telenor’s Chief Innovation Officer and Head of the AI Factory Kaaren Hilsen discusses the AI factory’s rapid development, going from concept to reality in under a year.

Imbue CEO Kanjun Qiu on Transforming AI Agents Into Personal Collaborators

Kanjun Qiu, CEO of Imbue, explores the emerging era where individuals can create and use their own AI agents. Drawing a parallel to the PC revolution of the late 1970s and ‘80s, Qiu discusses how modern AI systems are evolving to work collaboratively with users, enhancing their capabilities rather than just automating tasks.

Read More

Research Focus: Week of April 21, 2025

Research Focus: Week of April 21, 2025

In this issue:

Catch a preview of our presentations and papers at CHI 2025 and ICLR 2025. We also introduce new research on causal reasoning and LLMs; enhancing LLM jailbreak capabilities to bolster safety and robustness; understanding how people using AI compared to AI-alone, and Distill-MOS, a compact and efficient model that delivers state-of-the-art speech quality assessment. You’ll also find a replay of a podcast discussion on rural healthcare innovation with Senior Vice President of Microsoft Health Jim Weinstein.

Research Focus: April 23, 2025

Microsoft at CHI 2025

Microsoft Research is proud to be a sponsor of the ACM Computer Human Interaction (CHI) 2025 Conference on Human Factors in Computing Systems (opens in new tab). CHI brings together researchers and practitioners from all over the world and from diverse cultures, backgrounds, and positionalities, who share an overarching goal to make the world a better place with interactive digital technologies.

Our researchers will host more than 30 sessions and workshops at this year’s conference in Yokohama, Japan. We invite you to preview our presentations and our two dozen accepted papers.


Microsoft at ICLR 2025

Microsoft is proud to be a sponsor of the thirteenth International Conference on Learning Representations (ICLR). This gathering is dedicated to the advancement of representation learning, which is a branch of AI. We are pleased to share that Microsoft has more than 30 accepted papers at this year’s conference, which we invite you to preview.

ICLR is globally renowned for presenting and publishing cutting-edge research on all aspects of deep learning used in the fields of artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, text understanding, gaming, and robotics.


Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

Diagram illustrating the process of tackling real-world causal tasks. The diagram shows how individuals alternate between logical and covariance-based causal reasoning to formulate sub-questions, iterate, and verify their premises and implications. The strategic alternation between these two types of causality is highlighted as a key approach in addressing complex causal tasks.

What kinds of causal arguments can large language models (LLMs) generate, how valid are these arguments, and what causal reasoning workflows can this generation support or automate? This paper, which was selected for ICLR 2025, clarifies this debate. It advances our understanding of LLMs and their causal implications, and proposes a framework for future research at the intersection of LLMs and causality.

This discussion has critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. In capturing common sense and domain knowledge about causal mechanisms and supporting translation between natural language and formal methods, LLMs open new frontiers for advancing the research, practice, and adoption of causality.


The Future of AI in Knowledge Work: Tools for Thought at CHI 2025

A digital illustration of a person with a contemplative expression, resting their chin on their hand. The top of the person's head is open, revealing a white bird standing inside. The seagull is holding a worm in its beak, feeding the baby birds. The background is blue, and the words

Can AI tools do more than streamline workflows—can they actually help us think better? That’s the driving question behind the Microsoft Research Tools for Thought initiative. At this year’s CHI conference, this group is presenting four new research papers and cohosting a workshop that dives deep into this intersection of AI and human cognition.

The team provides an overview of their latest research, starting with a study on how AI is changing the way people think and work. They introduce three prototype systems designed to support different cognitive tasks. Finally, through their Tools for Thought workshop, they invite the CHI community to help define AI’s role in supporting human thinking.


Building LLMs with enhanced jailbreaking capabilities to bolster safety and robustness

The overview of crafting ADV-LLM. The process begins with refining the target and initializing a starting suffix. ADV-LLM then iteratively generates data for self-tuning.

Recent research shows that LLMs are vulnerable to automated jailbreak attacks, where algorithm-generated adversarial suffixes bypass safety alignment and trigger harmful responses. This paper introduces ADV-LLM, an iterative self-tuning process for crafting adversarial LLMs with enhanced jailbreak capabilities—which could provide valuable insights for future safety alignment research.

ADV-LLM is less computationally expensive than prior mechanisms and achieves higher attack success rates (ASR), especially against well-aligned models like Llama2 and Llama3.

It reaches nearly 100% ASR on various open-source LLMs and demonstrates strong transferability to closed-source models—achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4—despite being optimized solely on Llama3. Beyond improving jailbreak performance, ADV-LLM offers valuable insights for future alignment research by enabling large-scale generation of safety-relevant datasets.


ChatBench: From Static Benchmarks to Human-AI Evaluation

This figure displays the flow of the ChatBench user study. The rectangle on top represents Phase 1 of the study, where users answer questions on their own, and the rectangle on the bottom represents Phase 2 of the study, where users answer with AI.

The rapid adoption of LLM-based chatbots raises the need to understand what people and LLMs can achieve together. However, standard benchmarks like MMLU (opens in new tab) assess LLM capabilities in isolation (i.e., “AI alone”). This paper presents the results of a user study that transforms MMLU questions into interactive user-AI conversations. The researchers seeded the participants with the question and then had them engage in a conversation with the LLM to arrive at an answer. The result is ChatBench, a new dataset comprising AI-alone, user-alone, and user-AI data for 396 questions and two LLMs, including 144,000 answers and 7,336 user-AI conversations.

The researchers’ analysis reveals that AI-alone accuracy does not predict user-AI accuracy, with notable differences across subjects such as math, physics, and moral reasoning. Examining user-AI conversations yields insights into how these interactions differ from AI-alone benchmarks. Finally, the researchers demonstrate that finetuning a user simulator on a subset of ChatBench improves its ability to predict user-AI accuracy, boosting correlation on held-out questions by more than 20 points, thereby enabling scalable interactive evaluation.


Distill-MOS: A compact speech-quality assessment model 

Block diagram illustrating XLS-R-based speech quality assessment and its usage as a teacher model for distillation using unlabeled speech.

Distill-MOS is a compact and efficient speech quality assessment model with dramatically reduced size—over 100x smaller than the reference model—enabling efficient, non-intrusive evaluation in real-world, low-resource settings. 

This paper investigates the distillation and pruning methods to reduce model size for non-intrusive speech quality assessment based on self-supervised representations. The researchers’ experiments build on XLS-R-SQA, a speech quality assessment model using wav2vec 2.0 XLS-R embeddings. They retrain this model on a large compilation of mean opinion score datasets, encompassing over 100,000 labeled clips. 


Collaborating to Affect Change for Rural Health Care with Innovation and Technology

Senior Vice President of Microsoft Health Jim Weinstein joins Dan Liljenquist, Chief Strategy Officer from Intermountain Health, on the NEJM Catalyst podcast for a discussion of their combined expertise and resources and their collaboration to address healthcare challenges in the rural United States. These challenges include limited access to care, rising mortality rates, and severe staffing shortages. Working together, they aim to create a scalable model that can benefit both rural and urban health care systems. Key goals include expanding access through telemedicine and increasing cybersecurity, ultimately improving the quality of care delivered and financial stability for rural communities.


Empowering patients and healthcare consumers in the age of generative AI

Two champions of patient-centered digital health join Microsoft Research President Peter Lee to talk about how AI is reshaping healthcare in terms of patient empowerment and emerging digital health business models. Dave deBronkart, a cancer survivor and longtime advocate for patient empowerment, discusses how AI tools like ChatGPT can help patients better understand their conditions, navigate the healthcare system, and communicate more effectively with clinicians. Christina Farr, a healthcare investor and former journalist, talks about the evolving digital health–startup ecosystem, highlighting where AI is having the most meaningful impact—particularly in women’s health, pediatrics, and elder care. She also explores consumer trends, like the rise of cash-pay healthcare. 


Beyond the Image: AI’s Expanding Role in Healthcare

Jonathan Carlson, Managing Director of Microsoft Research Health Futures, joins the Healthcare Unfiltered show to explore the evolution of AI in medicine, from the early days to cutting-edge innovations like ambient clinical intelligence. This podcast explores how pre-trained models and machine learning are transforming care delivery, as well as the future of biomedicine and healthcare, including important ethical and practical questions.

The post Research Focus: Week of April 21, 2025 appeared first on Microsoft Research.

Read More

How the Economics of Inference Can Maximize AI Value

How the Economics of Inference Can Maximize AI Value

As AI models evolve and adoption grows, enterprises must perform a delicate balancing act to achieve maximum value.

That’s because inference — the process of running data through a model to get an output — offers a different computational challenge than training a model.

Pretraining a model — the process of ingesting data, breaking it down into tokens and finding patterns — is essentially a one-time cost. But in inference, every prompt to a model generates tokens, each of which incur a cost.

That means that as AI model performance and use increases, so do the amount of tokens generated and their associated computational costs. For companies looking to build AI capabilities, the key is generating as many tokens as possible — with maximum speed, accuracy and quality of service — without sending computational costs skyrocketing.

As such, the AI ecosystem has been working to make inference cheaper and more efficient. Inference costs have been trending down for the past year thanks to major leaps in model optimization, leading to increasingly advanced, energy-efficient accelerated computing infrastructure and full-stack solutions.

According to the Stanford University Institute for Human-Centered AI’s 2025 AI Index Report, “the inference cost for a system performing at the level of GPT-3.5 dropped over 280-fold between November 2022 and October 2024. At the hardware level, costs have declined by 30% annually, while energy efficiency has improved by 40% each year. Open-weight models are also closing the gap with closed models, reducing the performance difference from 8% to just 1.7% on some benchmarks in a single year. Together, these trends are rapidly lowering the barriers to advanced AI.”

As models evolve and generate more demand and create more tokens, enterprises need to scale their accelerated computing resources to deliver the next generation of AI reasoning tools or risk rising costs and energy consumption.

What follows is a primer to understand the concepts of the economics of inference, enterprises can position themselves to achieve efficient, cost-effective and profitable AI solutions at scale.

Key Terminology for the Economics of AI Inference

Knowing key terms of the economics of inference helps set the foundation for understanding its importance.

Tokens are the fundamental unit of data in an AI model. They’re derived from data during training as text, images, audio clips and videos. Through a process called tokenization, each piece of data is broken down into smaller constituent units. During training, the model learns the relationships between tokens so it can perform inference and generate an accurate, relevant output.

Throughput refers to the amount of data — typically measured in tokens — that the model can output in a specific amount of time, which itself is a function of the infrastructure running the model. Throughput is often measured in tokens per second, with higher throughput meaning greater return on infrastructure.

Latency is a measure of the amount of time between inputting a prompt and the start of the model’s response. Lower latency means faster responses. The two main ways of measuring latency are:

  • Time to First Token: A measurement of the initial processing time required by the model to generate its first output token after a user prompt.
  • Time per Output Token: The average time between consecutive tokens — or the time it takes to generate a completion token for each user querying the model at the same time. It’s also known as “inter-token latency” or token-to-token latency.

Time to first token and time per output token are helpful benchmarks, but they’re just two pieces of a larger equation. Focusing solely on them can still lead to a deterioration of performance or cost.

To account for other interdependencies, IT leaders are starting to measure “goodput,” which is defined as the throughput achieved by a system while maintaining target time to first token and time per output token levels. This metric allows organizations to evaluate performance in a more holistic manner, ensuring that throughput, latency and cost are aligned to support both operational efficiency and an exceptional user experience.

Energy efficiency is the measure of how effectively an AI system converts power into computational output, expressed as performance per watt. By using accelerated computing platforms, organizations can maximize tokens per watt while minimizing energy consumption.

How the Scaling Laws Apply to Inference Cost

The three AI scaling laws are also core to understanding the economics of inference:

  • Pretraining scaling: The original scaling law that demonstrated that by increasing training dataset size, model parameter count and computational resources, models can achieve predictable improvements in intelligence and accuracy.
  • Post-training: A process where models are fine-tuned for accuracy and specificity so they can be applied to application development. Techniques like retrieval-augmented generation can be used to return more relevant answers from an enterprise database.
  • Test-time scaling (aka “long thinking” or “reasoning”): A technique by which models allocate additional computational resources during inference to evaluate multiple possible outcomes before arriving at the best answer.

While AI is evolving and post-training and test-time scaling techniques become more sophisticated, pretraining isn’t disappearing and remains an important way to scale models. Pretraining will still be needed to support post-training and test-time scaling.

Profitable AI Takes a Full-Stack Approach

In comparison to inference from a model that’s only gone through pretraining and post-training, models that harness test-time scaling generate multiple tokens to solve a complex problem. This results in more accurate and relevant model outputs — but is also much more computationally expensive.

Smarter AI means generating more tokens to solve a problem. And a quality user experience means generating those tokens as fast as possible. The smarter and faster an AI model is, the more utility it will have to companies and customers.

Enterprises need to scale their accelerated computing resources to deliver the next generation of AI reasoning tools that can support complex problem-solving, coding and multistep planning without skyrocketing costs.

This requires both advanced hardware and a fully optimized software stack. NVIDIA’s AI factory product roadmap is designed to deliver the computational demand and help solve for the complexity of inference, while achieving greater efficiency.

AI factories integrate high-performance AI infrastructure, high-speed networking and optimized software to produce intelligence at scale. These components are designed to be flexible and programmable, allowing businesses to prioritize the areas most critical to their models or inference needs.

To further streamline operations when deploying massive AI reasoning models, AI factories run on a high-performance, low-latency inference management system that ensures the speed and throughput required for AI reasoning are met at the lowest possible cost to maximize token revenue generation.

Learn more by reading the ebook “AI Inference: Balancing Cost, Latency and Performance.”

Read More

Carnegie Mellon University at ICLR 2025

Carnegie Mellon University at ICLR 2025

CMU researchers are presenting 143 papers at the Thirteenth International Conference on Learning Representations (ICLR 2025), held from April 24 – 28 at the Singapore EXPO. Here is a quick overview of the areas our researchers are working on:

And here are our most frequent collaborator institutions:

Oral Papers

Backtracking Improves Generation Safety

Authors: Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason E Weston, Eric Michael Smith

This paper introduces backtracking, a new technique that allows language models to recover from unsafe text generation by using a special [RESET] token to “undo” problematic outputs. Unlike traditional safety methods that aim to prevent harmful responses outright, backtracking trains the model to self-correct mid-generation. The authors demonstrate that backtracking significantly improves safety without sacrificing helpfulness, and it also provides robustness against several adversarial attacks.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Authors: Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, David Lo, Binyuan Hui, Niklas Muennighoff, Daniel Fried, Xiaoning Du, Harm De Vries, Leandro Von Werra

Recent advances in LLMs have enabled task automation through Python code, but existing benchmarks mainly focus on simple, self-contained tasks. To assess LLMs’ ability to handle more practical challenges requiring diverse and compositional function use, the authors introduce BigCodeBench—a benchmark covering 1,140 tasks across 139 libraries and 7 domains. Each task includes rigorous testing with high branch coverage, and a variant, BigCodeBench-Instruct, reformulates instructions for natural language evaluation. Results from testing 60 LLMs reveal significant performance gaps, highlighting that current models struggle to follow complex instructions and compose function calls accurately compared to human performance.

Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance

Authors: Sachin Goyal, Christina Baek, J Zico Kolter, Aditi Raghunathan

LLMs are expected to follow user-provided context, especially when they contain new or conflicting information. While instruction finetuning should improve this ability, the authors uncover a surprising failure mode called context-parametric inversion: models initially rely more on input context, but this reliance decreases as finetuning continues—even as benchmark performance improves. Through controlled experiments and theoretical analysis, the authors trace the cause to training examples where context aligns with pretraining knowledge, reinforcing parametric reliance. They suggest mitigation strategies and highlight this as a key challenge in instruction tuning.

EmbodiedSAM: Online Segment Any 3D Thing in Real Time

Authors: Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu

Embodied tasks demand fine-grained 3D perception, which is difficult to achieve due to limited high-quality 3D data. To address this, the authors propose a method that leverages the Segment Anything Model (SAM) for online 3D instance segmentation by transforming 2D masks into 3D-aware queries. Their approach enables real-time object matching across video frames and efficient inference using a similarity matrix. Experiments across multiple datasets show that the method outperforms offline alternatives and generalizes well to new settings with minimal data.

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Authors: Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, Chandan K. Reddy

Mathematical equations are remarkably effective at describing natural phenomena, but discovering them from data is challenging due to vast combinatorial search spaces. Existing symbolic regression methods often overlook domain knowledge and rely on limited representations. To address this, the authors propose LLM-SR, a novel approach that uses Large Language Models to generate equation hypotheses informed by scientific priors and refines them through evolutionary search. Evaluated across multiple scientific domains, LLM-SR outperforms existing methods, particularly in generalization, by efficiently exploring the equation space and producing accurate, interpretable models.

Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

Authors: Yuda Song, Hanlin Zhang, Udaya Ghai, Carson Eisenach, Sham M. Kakade, Dean Foster

Self-improvement in Large Language Models involves the model verifying its outputs, filtering data accordingly, and using the refined data for further learning. While effective in practice, there has been little theoretical grounding for this technique. This work presents a comprehensive study of LLM self-improvement, introducing a formal framework centered on the generation-verification gap—a key quantity that governs self-improvement. Experiments reveal that this gap scales consistently with pretraining FLOPs across tasks and model families. The authors also explore when and how iterative self-improvement works and offer insights and strategies to enhance it.

On the Benefits of Memory for Modeling Time-Dependent PDEs

Authors: Ricardo Buitrago, Tanya Marwah, Albert Gu, Andrej Risteski

Data-driven methods offer an efficient alternative to traditional numerical solvers for PDEs, but most existing approaches assume Markovian dynamics, limiting their effectiveness when input signals are distorted. Inspired by the Mori-Zwanzig theory, the authors propose MemNO, a Memory Neural Operator that explicitly incorporates past states using structured state-space models and the Fourier Neural Operator. MemNO demonstrates strong performance on various PDE families, especially on low-resolution inputs, achieving over six times lower error than memoryless baselines.

On the Identification of Temporal Causal Representation with Instantaneous Dependence

Authors: Zijian Li, Yifan Shen, Kaitao Zheng, Ruichu Cai, Xiangchen Song, Mingming Gong, Guangyi Chen, Kun Zhang

This work introduces IDOL (Identification framework for Instantaneous Latent dynamics), a method designed to identify latent causal processes in time series data, even when instantaneous relationships are present. Unlike existing methods that require interventions or grouping of observations, IDOL imposes a sparse influence constraint, allowing both time-delayed and instantaneous causal relations to be captured. Through a temporally variational inference architecture and gradient-based sparsity regularization, IDOL effectively estimates latent variables. Experimental results show that IDOL can identify latent causal processes in simulations and real-world human motion forecasting tasks, demonstrating its practical applicability.

Progressive distillation induces an implicit curriculum

Authors: Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, Surbhi Goel

This work explores the concept of progressive distillation, where a student model learns from intermediate checkpoints of a teacher model, rather than just the final model. The authors identify an “implicit curriculum” that emerges through these intermediate checkpoints, which accelerates the student’s learning and provides a sample complexity benefit. Using sparse parity as a sandbox, they demonstrate that this curriculum imparts valuable learning steps that are unavailable from the final teacher model. The study extends this idea to Transformers trained on probabilistic context-free grammars (PCFGs) and real-world datasets, showing that the teacher progressively teaches the student to capture longer contexts. Both theoretical and empirical results highlight the effectiveness of progressive distillation across different tasks.

Scaling Laws for Precision

Authors: Tanishq Kumar, Zachary Ankner, Benjamin Frederick Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Re, Aditi Raghunathan

This work introduces precision-aware scaling laws that extend traditional scaling frameworks to account for the effects of low-precision training and inference in language models. The authors show that lower precision effectively reduces a model’s usable parameter count, enabling predictions of performance degradation due to quantization. For inference, they find that post-training quantization causes increasing degradation with more pretraining data, potentially making additional training counterproductive. Their unified framework predicts loss across varying precisions and suggests that training larger models in lower precision may be more compute-efficient. These predictions are validated on over 465 pretraining runs, including models up to 1.7B parameters.

Self-Improvement in Language Models: The Sharpening Mechanism

Authors: Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, Akshay Krishnamurthy

This paper presents a theoretical framework for understanding how LLMs can self-improve by using themselves as verifiers to refine their own outputs; a process the authors call “sharpening.” The key insight is that LLMs are often better at judging response quality than generating high-quality responses outright, so sharpening helps concentrate probability mass on better sequences. The paper analyzes two families of self-improvement algorithms: one based on supervised fine-tuning (SFT) and one on reinforcement learning (RLHF). They show that while the SFT-based approach is optimal under certain conditions, the RLHF-based approach can outperform it by actively exploring beyond the model’s existing knowledge.

When Selection meets Intervention: Additional Complexities in Causal Discovery

Authors: Haoyue Dai, Ignavier Ng, Jianle Sun, Zeyu Tang, Gongxu Luo, Xinshuai Dong, Peter Spirtes, Kun Zhang

This work tackles the often-overlooked issue of selection bias in interventional studies, where participants are selectively included based on specific criteria. Existing causal discovery methods typically ignore this bias, leading to inaccurate conclusions. To address this, the authors introduce a novel graphical model that distinguishes between the observed world with interventions and the counterfactual world where selection occurs. They develop a sound algorithm that identifies both causal relationships and selection mechanisms, demonstrating its effectiveness through experiments on both synthetic and real-world data.

miniCTX: Neural Theorem Proving with (Long-)Contexts

Authors: Jiewen Hu, Thomas Zhu, Sean Welleck

Real-world formal theorem proving relies heavily on rich contextual information, which is often absent from traditional benchmarks. To address this, the authors introduce miniCTX, a benchmark designed to test models’ ability to prove theorems using previously unseen, extensive context from real Lean projects and textbooks. Unlike prior benchmarks, miniCTX includes large repositories with relevant definitions, lemmas, and structures. Baseline experiments show that models conditioned on this broader context significantly outperform those relying solely on the local state. The authors also provide a toolkit to facilitate the expansion of the benchmark.

Spotlight Papers

ADIFF: Explaining audio difference using natural language

Authors: Soham Deshmukh, Shuo Han, Rita Singh, Bhiksha Raj

This paper tackles the novel task of explaining differences between audio recordings, which is important for applications like audio forensics, quality assessment, and generative audio systems. The authors introduce two new datasets and propose a three-tiered explanation framework—ranging from concise event descriptions to rich, emotionally grounded narratives—generated using large language models. They present ADIFF, a new method that improves on baselines by incorporating audio cross-projection, position-aware captioning, and multi-stage training, and show that it significantly outperforms existing audio-language models both quantitatively and via human evaluation.

Better Instruction-Following Through Minimum Bayes Risk

Authors: Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone Kim, Sina Khoshfetrat Pakazad, Graham Neubig

This paper explores how LLMs can be used as judges to evaluate and improve other LLMs. The authors show that using a method called Minimum Bayes Risk (MBR) decoding—where an LLM judge selects the best output from a set—can significantly improve model performance compared to standard decoding methods. They also find that training models on these high-quality outputs can lead to strong gains even without relying on MBR at test time, making the models faster and more efficient while maintaining or exceeding previous performance.

DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Authors: Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin

This paper introduces DeFT, a new algorithm that speeds up how large language models handle tasks involving tree-like structures with shared text prefixes, such as multi-step reasoning or few-shot prompting. Existing methods waste time and memory by repeatedly accessing the same data and poorly distributing the workload across the GPU. DeFT solves this by smartly grouping and splitting memory usage to avoid redundant operations and better balance the work, leading to up to 3.6x faster performance on key tasks compared to current approaches.

Holistically Evaluating the Environmental Impact of Creating Language Models

Authors: Jacob Morrison, Clara Na, Jared Fernandez, Tim Dettmers, Emma Strubell, Jesse Dodge

This paper estimates the full environmental impact of developing large language models, including not just the final training runs but also model development and hardware manufacturing—areas typically underreported. The authors found that training a series of models released 493 metric tons of carbon emissions and used 2.769 million liters of water, even in a highly efficient data center. Notably, around half of the carbon emissions came from the development phase alone, and power usage during training varied significantly, raising concerns for energy grid planning as AI systems grow.

Language Model Alignment in Multilingual Trolley Problems

Authors: Zhijing Jin, Max Kleiman-weiner, Giorgio Piatti, Sydney Levine, Jiarui Liu, Fernando Gonzalez Adauto, Francesco Ortu, András Strausz, Mrinmaya Sachan, Rada Mihalcea, Yejin Choi, Bernhard Schölkopf

This paper evaluates how well LLMs align with human moral preferences across languages using multilingual trolley problems. The authors introduce MultiTP, a new dataset of moral dilemmas in over 100 languages based on the Moral Machine experiment, enabling cross-lingual analysis of LLM decision-making. By assessing 19 models across six moral dimensions and examining demographic correlations and prompt consistency, they uncover significant variation in moral alignment across languages—highlighting ethical biases and the need for more inclusive, multilingual approaches to responsible AI development.

Lean-STaR: Learning to Interleave Thinking and Proving

Authors: Haohan Lin, Zhiqing Sun, Sean Welleck, Yiming Yang

This paper introduces Lean-STaR, a framework that improves language model-based theorem proving by incorporating informal “thoughts” before each proof step. Unlike traditional approaches that rely solely on formal proof data, Lean-STaR generates synthetic thought processes using retrospective proof tactics during training. At inference time, the model generates these thoughts to guide its next action, and expert iteration further refines its performance using the Lean theorem prover. This approach boosts proof success rates and offers new insights into how structured reasoning improves formal mathematical problem solving.

MagicPIG: LSH Sampling for Efficient LLM Generation

Authors: Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen

This paper introduces MagicPIG, a new system that speeds up LLM inference by approximating attention more efficiently. While many methods assume attention is sparse and use TopK approximations, the authors show this isn’t always accurate and can hurt performance. Instead, MagicPIG uses a sampling method backed by theoretical guarantees and accelerates it using Locality Sensitive Hashing, offloading computations to the CPU to support longer inputs and larger batches without sacrificing accuracy.

Multi-Robot Motion Planning with Diffusion Models

Authors: Yorai Shaoul, Itamar Mishani, Shivam Vats, Jiaoyang Li, Maxim Likhachev

This paper introduces a method for planning coordinated, collision-free movements for many robots using only data from individual robots. The authors combine learned diffusion models with classical planning algorithms to generate realistic, safe multi-robot trajectories. Their approach, called Multi-robot Multi-model planning Diffusion, also scales to large environments by stitching together multiple diffusion models, showing strong results in simulated logistics scenarios.

Reinforcement Learning for Control of Non-Markovian Cellular Population Dynamics

Authors: Josiah C Kratz, Jacob Adamczyk

This paper explores how reinforcement learning can be used to develop drug dosing strategies for controlling cell populations that adapt over time, such as cancer cells switching between resistant and susceptible states. Traditional methods struggle when the system’s dynamics are unknown or involve memory of past environments, making optimal control difficult. The authors show that deep RL can successfully learn effective strategies even in complex, memory-based systems, offering a promising approach for real-world biomedical applications.

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Authors: Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar

This paper explores how to improve large language models’ reasoning by giving feedback at each step of their thinking process, rather than only at the final answer. The authors introduce a method where feedback—called a process reward—is based on whether a step helps make a correct final answer more likely, as judged by a separate model (a “prover”) that can recognize progress better than the model being trained. They show both theoretically and experimentally that this strategy makes learning more efficient, leading to significantly better and faster results than traditional outcome-based feedback methods.

SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models

Authors: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Junxian Guo, Xiuyu Li, Enze Xie, Chenlin Meng, Jun-yan Zhu, Song Han

This paper introduces SVDQuant, a method for significantly speeding up diffusion models by quantizing both weights and activations to 4 bits. Since such aggressive quantization can hurt image quality, the authors use a clever technique: they shift problematic “outlier” values into a separate low-rank component handled with higher precision, while the rest is processed with efficient low-bit operations. To avoid slowing things down due to extra computation, they also design a custom inference engine called Nunchaku, which merges the processing steps to minimize memory access. Together, these techniques reduce memory usage and deliver over 3x speedups without sacrificing image quality.

Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation

Authors: Eliot Xing, Vernon Luk, Jean Oh

This paper tackles the challenge of applying reinforcement learning (RL) to soft-body robotics, where simulations are usually too slow for data-hungry RL algorithms. The authors introduce SAPO, a new model-based RL algorithm that efficiently learns from differentiable simulations using analytic gradients. The authors also present Rewarped, a fast, parallel simulation platform that supports both rigid and deformable materials, demonstrating that their approach outperforms existing methods on complex manipulation and locomotion tasks.

Streaming Algorithms For $ell_p$ Flows and $ell_p$ Regression

Authors: Amit Chakrabarti, Jeffrey Jiang, David Woodruff, Taisuke Yasuda

This paper investigates how to solve underdetermined linear regression problems in a streaming setting, where the data arrives one column at a time and storing the full dataset is impractical. The authors develop algorithms that approximate the regression cost or output a near-optimal solution using much less memory than storing the entire dataset—particularly relevant for applications like computing flows on large graphs. They also establish space lower bounds, showing the limitations of what’s possible, and provide the first algorithms that achieve nontrivial approximations using sublinear space in various settings.

Poster Papers

Alignment, Fairness, Safety, Privacy, And Societal Considerations

$beta$-calibration of Language Model Confidence Scores for Generative QA

Authors: Putra Manggala, Atalanti A. Mastakouri, Elke Kirschbaum, Shiva Kasiviswanathan, Aaditya Ramdas

AgentHarm: Benchmarking Robustness of LLM Agents on Harmful Tasks

Authors: Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrikson, Yarin Gal, Xander Davies

Aligned LLMs Are Not Aligned Browser Agents

Authors: Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Elaine T Chang, Vaughn Robinson, Shuyan Zhou, Matt Fredrikson, Sean M. Hendryx, Summer Yue, Zifan Wang

Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

Authors: Keltin Grimes, Marco Christiani, David Shriver, Marissa Catherine Connor

DECISION-FOCUSED UNCERTAINTY QUANTIFICATION

Authors: Santiago Cortes-gomez, Carlos Miguel Patiño, Yewon Byun, Steven Wu, Eric Horvitz, Bryan Wilder

Dissecting Adversarial Robustness of Multimodal LM Agents

Authors: Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

Authors: Zhenting Qi, Hanlin Zhang, Eric P. Xing, Sham M. Kakade, Himabindu Lakkaraju

Generative Classifiers Avoid Shortcut Solutions

Authors: Alexander Cong Li, Ananya Kumar, Deepak Pathak

Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks

Authors: Shengyuan Hu, Yiwei Fu, Steven Wu, Virginia Smith

Noisy Test-Time Adaptation in Vision-Language Models

Authors: Chentao Cao, Zhun Zhong, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, Bo Han

Pacmann: Efficient Private Approximate Nearest Neighbor Search

Authors: Mingxun Zhou, Elaine Shi, Giulia Fanti

Permute-and-Flip: An optimally stable and watermarkable decoder for LLMs

Authors: Xuandong Zhao, Lei Li, Yu-xiang Wang

Persistent Pre-training Poisoning of LLMs

Authors: Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, Daphne Ippolito

Prompting Fairness: Integrating Causality to Debias Large Language Models

Authors: Jingling Li, Zeyu Tang, Xiaoyu Liu, Peter Spirtes, Kun Zhang, Liu Leqi, Yang Liu

Reconciling Model Multiplicity for Downstream Decision Making

Authors: Ally Yalei Du, Dung Daniel Ngo, Steven Wu

Self-Play Preference Optimization for Language Model Alignment

Authors: Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu

Toward Robust Defenses Against LLM Weight Tampering Attacks

Authors: Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika

Applications To Computer Vision, Audio, Language, And Other Modalities

Agent-to-Sim: Learning Interactive Behavior Model from Casual Longitudinal Videos

Authors: Gengshan Yang, Andrea Bajcsy, Shunsuke Saito, Angjoo Kanazawa

Context-aware Dynamic Pruning for Speech Foundation Models

Authors: Masao Someki, Yifan Peng, Siddhant Arora, Markus Müller, Athanasios Mouchtaris, Grant Strimel, Jing Liu, Shinji Watanabe

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

Authors: Shuhong Zheng, Zhipeng Bao, Ruoyu Zhao, Martial Hebert, Yu-xiong Wang

Fugatto 1: Foundational Generative Audio Transformer Opus 1

Authors: Rafael Valle, Rohan Badlani, Zhifeng Kong, Sang-gil Lee, Arushi Goel, Joao Felipe Santos, Aya Aljafari, Sungwon Kim, Shuqi Dai, Siddharth Gururani, Alexander H. Liu, Kevin J. Shih, Ryan Prenger, Wei Ping, Chao-han Huck Yang, Bryan Catanzaro

Gaussian Splatting Lucas-Kanade

Authors: Liuyue Xie, Joel Julin, Koichiro Niinuma, Laszlo Attila Jeni

ImageFolder: Autoregressive Image Generation with Folded Tokens

Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, Zhe Lin

Improving Large Language Model based Multi-Agent Framework through Dynamic Workflow Updating

Authors: Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu

MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis

Authors: Jun-yan He, Zhi-qi Cheng, Chenyang Li, Jingdong Sun, Qi He, Wangmeng Xiang, Hanyuan Chen, Jin-peng Lan, Xianhui Lin, Kang Zhu, Bin Luo, Yifeng Geng, Xuansong Xie, Alexander G Hauptmann

OMG: Opacity Matters in Material Modeling with Gaussian Splatting

Authors: Silong Yong, Venkata Nagarjun Pudureddiyur Manivannan, Bernhard Kerbl, Zifu Wan, Simon Stepputtis, Katia P. Sycara, Yaqi Xie

Scene Flow as a Partial Differential Equation

Authors: Kyle Vedder, Neehar Peri, Ishan Khatri, Siyi Li, Eric Eaton, Mehmet Kemal Kocamaz, Yue Wang, Zhiding Yu, Deva Ramanan, Joachim Pehserl

TrackTheMind: program-guided adversarial data generation for theory of mind reasoning

Authors: Melanie Sclar, Jane Dwivedi-yu, Maryam Fazel-zarandi, Yulia Tsvetkov, Yonatan Bisk, Yejin Choi, Asli Celikyilmaz

Understanding Visual Concepts Across Models

Authors: Brandon Trabucco, Max A Gurinas, Kyle Doherty, Russ Salakhutdinov

Applications To Neuroscience & Cognitive Science

Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers

Authors: Andrew Luo, Jacob Yeung, Rushikesh Zawar, Shaurya Rajat Dewan, Margaret Marie Henderson, Leila Wehbe, Michael J. Tarr

Self-Attention-Based Contextual Modulation Improves Neural System Identification

Authors: Isaac Lin, Tianye Wang, Shang Gao, Tang Shiming, Tai Sing Lee

Applications To Physical Sciences (Physics, Chemistry, Biology, Etc.)

Causal Representation Learning from Multimodal Biological Observations

Authors: Yuewen Sun, Lingjing Kong, Guangyi Chen, Loka Li, Gongxu Luo, Zijian Li, Yixuan Zhang, Yujia Zheng, Mengyue Yang, Petar Stojanov, Eran Segal, Eric P. Xing, Kun Zhang

Chemistry-Inspired Diffusion with Non-Differentiable Guidance

Authors: Yuchen Shen, Chenhao Zhang, Sijie Fu, Chenghui Zhou, Newell Washburn, Barnabas Poczos

Text2PDE: Latent Diffusion Models for Accessible Physics Simulation

Authors: Anthony Zhou, Zijie Li, Michael Schneier, John R Buchanan Jr, Amir Barati Farimani

Applications To Robotics, Autonomy, Planning

Enhancing Software Agents with Monte Carlo Tree Search and Hindsight Feedback

Authors: Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, William Yang Wang

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Authors: Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang

Causal Reasoning

A Conditional Independence Test in the Presence of Discretization

Authors: Boyang Sun, Yu Yao, Guang-yuan Hao, Yumou Qiu, Kun Zhang

A Robust Method to Discover Causal or Anticausal Relation

Authors: Yu Yao, Yang Zhou, Bo Han, Mingming Gong, Kun Zhang, Tongliang Liu

A Skewness-Based Criterion for Addressing Heteroscedastic Noise in Causal Discovery

Authors: Yingyu Lin, Yuxing Huang, Wenqin Liu, Haoran Deng, Ignavier Ng, Kun Zhang, Mingming Gong, Yian Ma, Biwei Huang

Analytic DAG Constraints for Differentiable DAG Learning

Authors: Zhen Zhang, Ignavier Ng, Dong Gong, Yuhang Liu, Mingming Gong, Biwei Huang, Kun Zhang, Anton Van Den Hengel, Javen Qinfeng Shi

Causal Graph Transformer for Treatment Effect Estimation Under Unknown Interference

Authors: Anpeng Wu, Haiyi Qiu, Zhengming Chen, Zijian Li, Ruoxuan Xiong, Fei Wu, Kun Zhang

Differentiable Causal Discovery for Latent Hierarchical Causal Models

Authors: Parjanya Prajakta Prashant, Ignavier Ng, Kun Zhang, Biwei Huang

Datasets And Benchmarks

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Authors: Chien-yu Huang, Wei-chih Chen, Shu-wen Yang, Andy T. Liu, Chen-an Li, Yu-xiang Lin, Wei-cheng Tseng, Anuj Diwan, Yi-jen Shih, Jiatong Shi, William Chen, Xuanjun Chen, Chi-yuan Hsiao, Puyuan Peng, Shih-heng Wang, Chun-yi Kuan, Ke-han Lu, Kai-wei Chang, Chih-kai Yang, Fabian Alejandro Ritter Gutierrez, Huang Kuan-po, Siddhant Arora, You-kuan Lin, Chuang Ming To, Eunjung Yeo, Kalvin Chang, Chung-ming Chien, Kwanghee Choi, Cheng-hsiu Hsieh, Yi-cheng Lin, Chee-en Yu, I-hsiang Chiu, Heitor Guimarães, Jionghao Han, Tzu-quan Lin, Tzu-yuan Lin, Homu Chang, Ting-wu Chang, Chun Wei Chen, Shou-jen Chen, Yu-hua Chen, Hsi-chun Cheng, Kunal Dhawan, Jia-lin Fang, Shi-xin Fang, Kuan Yu Fang Chiang, Chi An Fu, Hsien-fu Hsiao, Ching Yu Hsu, Shao-syuan Huang, Lee Chen Wei, Hsi-che Lin, Hsuan-hao Lin, Hsuan-ting Lin, Jian-ren Lin, Ting-chun Liu, Li-chun Lu, Tsung-min Pai, Ankita Pasad, Shih-yun Shan Kuan, Suwon Shon, Yuxun Tang, Yun-shao Tsai, Wei Jui Chiang, Tzu-chieh Wei, Chengxi Wu, Dien-ruei Wu, Chao-han Huck Yang, Chieh-chi Yang, Jia Qi Yip, Shao-xiang Yuan, Haibin Wu, Karen Livescu, David Harwath, Shinji Watanabe, Hung-yi Lee

GameArena: Evaluating LLM Reasoning through Live Computer Games

Authors: Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, Hao Zhang

Harnessing Webpage UIs for Text-Rich Visual Understanding

Authors: Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue

Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video

Authors: Xiaohao Xu, Tianyi Zhang, Shibo Zhao, Xiang Li, Sibo Wang, Yongqi Chen, Ye Li, Bhiksha Raj, Matthew Johnson-roberson, Sebastian Scherer, Xiaonan Huang

Speech Robust Bench: A Robustness Benchmark For Speech Recognition

Authors: Muhammad A Shah, David Solans Noguero, Mikko A. Heikkilä, Bhiksha Raj, Nicolas Kourtellis

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

Authors: Siddhant Arora, Zhiyun Lu, Chung-cheng Chiu, Ruoming Pang, Shinji Watanabe

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Authors: Kush Jain, Gabriel Synnaeve, Baptiste Roziere

Unearthing Skill-level Insights for Understanding Trade-offs of Foundation Models

Authors: Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Authors: Lawrence Keunho Jang, Yinheng Li, Dan Zhao, Charles Ding, Justin Lin, Paul Pu Liang, Rogerio Bonatti, Kazuhito Koishida

Foundation Or Frontier Models, Including Llms

Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws

Authors: Yiding Jiang, Allan Zhou, Zhili Feng, Sadhika Malladi, J Zico Kolter

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

Authors: Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh R N, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, Bo Pang, Yingbo Zhou, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong

Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data

Authors: Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang

Improving Large Language Model Planning with Action Sequence Similarity

Authors: Xinran Zhao, Hanie Sedghi, Bernd Bohnet, Dale Schuurmans, Azade Nova

Inference Optimal VLMs Need Only One Visual Token but Larger Models

Authors: Kevin Li, Sachin Goyal, João D. Semedo, J Zico Kolter

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving

Authors: Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang

Language Models Need Inductive Biases to Count Inductively

Authors: Yingshan Chang, Yonatan Bisk

MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

Authors: Syeda Nahida Akter, Shrimai Prabhumoye, John Kamalu, Sanjeev Satheesh, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Authors: Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng

Mixture of Parrots: Experts improve memorization more than reasoning

Authors: Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-melis, Yuanzhi Li, Sham M. Kakade, Eran Malach

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Authors: Xiang Yue, Yueqi Song, Akari Asai, Simran Khanuja, Anjali Kantharuban, Seungone Kim, Jean De Dieu Nyandwi, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig

Physics of Language Models: Part 3.2, Knowledge Manipulation

Authors: Zeyuan Allen-zhu, Yuanzhi Li

Scaling Long Context Training Data by Long-Distance Referrals

Authors: Yonghao Zhuang, Lanxiang Hu, Longfei Yun, Souvik Kundu, Zhengzhong Liu, Eric P. Xing, Hao Zhang

Sparse Matrix in Large Language Model Fine-tuning

Authors: Haoze He, Juncheng B Li, Xuan Jiang, Heather Miller

Specialized Foundation Models struggle to beat Supervised Baselines

Authors: Zongzhe Xu, Ritvik Gupta, Wenduo Cheng, Alexander Shen, Junhong Shen, Ameet Talwalkar, Mikhail Khodak

Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo

Authors: Shengyu Feng, Xiang Kong, Shuang Ma, Aonan Zhang, Dong Yin, Chong Wang, Ruoming Pang, Yiming Yang

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Authors: Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia

Time, Space and Streaming Efficient Algorithm for Heavy Attentions

Authors: Ravindran Kannan, Chiranjib Bhattacharyya, Praneeth Kacham, David Woodruff

Generative Models

Consistency Models Made Easy

Authors: Zhengyang Geng, Ashwini Pokle, Weijian Luo, Justin Lin, J Zico Kolter

Human-Aligned Chess With a Bit of Search

Authors: Yiming Zhang, Athul Paul Jacob, Vivian Lai, Daniel Fried, Daphne Ippolito

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

Authors: Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Shuaiqi Wang, Matthew B. Blaschko, Sergey Yekhanin, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Authors: Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-hsu Yen, Avner May, Tianqi Chen, Beidi Chen

OmniPhysGS: 3D Constitutive Gaussians for General Physics-Based Dynamics Generation

Authors: Yuchen Lin, Chenguo Lin, Jianjin Xu, Yadong Mu

RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

Authors: Xinze Li, Sen Mei, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Hao Chen, Ge Yu, Zhiyuan Liu, Maosong Sun, Chenyan Xiong

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Authors: Wenda Xu, Rujun Han, Zifeng Wang, Long Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-yu Lee, Tomas Pfister

TFG-Flow: Training-free Guidance in Multimodal Generative Flow

Authors: Haowei Lin, Shanda Li, Haotian Ye, Yiming Yang, Stefano Ermon, Yitao Liang, Jianzhu Ma

Truncated Consistency Models

Authors: Sangyun Lee, Yilun Xu, Tomas Geffner, Giulia Fanti, Karsten Kreis, Arash Vahdat, Weili Nie

TypedThinker: Typed Thinking Improves Large Language Model Reasoning

Authors: Danqing Wang, Jianxin Ma, Fei Fang, Lei Li

Infrastructure, Software Libraries, Hardware, Systems, Etc.

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Authors: Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig

Interpretability And Explainable Ai

Improving Instruction-Following in Language Models through Activation Steering

Authors: Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi

Interpreting Language Reward Models via Contrastive Explanations

Authors: Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso

LICORICE: Label-Efficient Concept-Based Interpretable Reinforcement Learning

Authors: Zhuorui Ye, Stephanie Milani, Geoffrey J. Gordon, Fei Fang

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Authors: Tian Ye, Zicheng Xu, Yuanzhi Li, Zeyuan Allen-zhu

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Authors: Hyesu Lim, Jinho Choi, Jaegul Choo, Steffen Schneider

Learning On Graphs And Other Geometries & Topologies

Learning Graph Invariance by Harnessing Spuriosity

Authors: Tianjun Yao, Yongqiang Chen, Kai Hu, Tongliang Liu, Kun Zhang, Zhiqiang Shen

Spectro-Riemannian Graph Neural Networks

Authors: Karish Grover, Haiyang Yu, Xiang Song, Qi Zhu, Han Xie, Vassilis N. Ioannidis, Christos Faloutsos

Learning Theory

A Theoretical Analysis of Self-Supervised Learning for Vision Transformers

Authors: Yu Huang, Zixin Wen, Yuejie Chi, Yingbin Liang

Larger Language Models Provably Generalize Better

Authors: Marc Anton Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Andrew Gordon Wilson, Christopher De Sa, J Zico Kolter

Learning from weak labelers as constraints

Authors: Vishwajeet Agrawal, Rattana Pukdee, Maria Florina Balcan, Pradeep Kumar Ravikumar

Neurosymbolic & Hybrid Ai Systems (Physics-informed, Logic & Formal Reasoning, Etc.)

ImProver: Agent-Based Automated Proof Optimization

Authors: Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck

NeSyC: A Neuro-symbolic Continual Learner For Complex Embodied Tasks in Open Domains

Authors: Wonje Choi, Jinwoo Park, Sanghyun Ahn, Daehee Lee, Honguk Woo

Optimization

Understanding Optimization in Deep Learning with Central Flows

Authors: Jeremy Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, Jason D. Lee

Other Topics In Machine Learning (I.e., None Of The Above)

AnoLLM: Large Language Models for Tabular Anomaly Detection

Authors: Che-ping Tsai, Ganyu Teng, Phillip Wallis, Wei Ding

Beyond Worst-Case Dimensionality Reduction for Sparse Vectors

Authors: Sandeep Silwal, David Woodruff, Qiuyi Zhang

Zeroth-Order Fine-Tuning of LLMs with Transferable Static Sparsity

Authors: Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, Zhaozhuo Xu

Probabilistic Methods (Bayesian Methods, Variational Inference, Sampling, Uq, Etc.)

Conformalized Interactive Imitation Learning: Handling Expert Shift and Intermittent Feedback

Authors: Michelle D Zhao, Henny Admoni, Reid Simmons, Aaditya Ramdas, Andrea Bajcsy

Reinforcement Learning

Diffusing States and Matching Scores: A New Framework for Imitation Learning

Authors: Runzhe Wu, Yiding Chen, Gokul Swamy, Kianté Brantley, Wen Sun

Efficient Imitation under Misspecification

Authors: Nicolas Espinosa-dice, Sanjiban Choudhury, Wen Sun, Gokul Swamy

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Authors: Zhaolin Gao, Wenhao Zhan, Jonathan Daniel Chang, Gokul Swamy, Kianté Brantley, Jason D. Lee, Wen Sun

Reinforcement learning with combinatorial actions for coupled restless bandits

Authors: Lily Xu, Bryan Wilder, Elias Boutros Khalil, Milind Tambe

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Authors: Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai

Transfer Learning, Meta Learning, And Lifelong Learning

Many-Objective Multi-Solution Transport

Authors: Ziyue Li, Tian Li, Virginia Smith, Jeff Bilmes, Tianyi Zhou

pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

Authors: Shentong Mo, Xufang Luo, Dongsheng Li

Unsupervised, Self-supervised, Semi-supervised, And Supervised Representation Learning

Learning Representations of Intermittent Temporal Latent Process

Authors: Yuke Li, Yujia Zheng, Guangyi Chen, Kun Zhang, Heng Huang

Memory Mosaics

Authors: Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, Leon Bottou

MetaOOD: Automatic Selection of OOD Detection Models

Authors: Yuehan Qin, Yichi Zhang, Yi Nian, Xueying Ding, Yue Zhao

Repetition Improves Language Model Embeddings

Authors: Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, Aditi Raghunathan

Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning

Authors: Zijian Li, Shunxing Fan, Yujia Zheng, Ignavier Ng, Shaoan Xie, Guangyi Chen, Xinshuai Dong, Ruichu Cai, Kun Zhang

Read More

Enterprises Onboard AI Teammates Faster With NVIDIA NeMo Tools to Scale Employee Productivity

Enterprises Onboard AI Teammates Faster With NVIDIA NeMo Tools to Scale Employee Productivity

An AI agent is only as accurate, relevant and timely as the data that powers it.

Now generally available, NVIDIA NeMo microservices are helping enterprise IT quickly build AI teammates that tap into data flywheels to scale employee productivity. The microservices provide an end-to-end developer platform for creating state-of-the-art agentic AI systems and continually optimizing them with data flywheels informed by inference and business data, as well as user preferences.

With a data flywheel, enterprise IT can onboard AI agents as digital teammates. These agents can tap into user interactions and data generated during AI inference to continuously improve model performance — turning usage into insight and insight into action.

Building Powerful Data Flywheels for Agentic AI

Without a constant stream of high-quality inputs — from databases, user interactions or real-world signals — an agent’s understanding can weaken, making responses less reliable and agents less productive.

Maintaining and improving the models that power AI agents in production requires three types of data: inference data to gather insights and adapt to evolving data patterns, up-to-date business data to provide intelligence, and user feedback data to advise if the model and application are performing as expected. NeMo microservices help developers tap into these three data types.

NeMo microservices speed AI agent development with end-to-end tools for curating, customizing, evaluating and guardrailing the models that drive their agents.

NVIDIA NeMo microservices — including NeMo Customizer, NeMo Evaluator and NeMo Guardrails — can be used alongside NeMo Retriever and NeMo Curator to ease enterprises’ experiences building, optimizing and scaling AI agents through custom enterprise data flywheels. For example:

  • NeMo Customizer accelerates large language model fine-tuning, delivering up to 1.8x higher training throughput. This high-performance, scalable microservice uses popular post-training techniques including supervised fine-tuning and low-rank adaptation.
  • NeMo Evaluator simplifies the evaluation of AI models and workflows on custom and industry benchmarks with just five application programming interface (API) calls.
  • NeMo Guardrails improves compliance protection by up to 1.4x with only half a second of additional latency, helping organizations implement robust safety and security measures that align with organizational policies and guidelines.

With NeMo microservices, developers can build data flywheels that boost AI agent accuracy and efficiency. Deployed through the NVIDIA AI Enterprise software platform, NeMo microservices are easy to operate and can run on any accelerated computing infrastructure, on premises or in the cloud, with enterprise-grade security, stability and support.

The microservices have become generally available at a time when enterprises are building large-scale multi-agent systems, where hundreds of specialized agents — with distinct goals and workflows — collaborate to tackle complex tasks as digital teammates, working alongside employees to assist, augment and accelerate work across functions.

This enterprise-wide impact positions AI agents as a trillion-dollar opportunity — with applications spanning automated fraud detection, shopping assistants, predictive machine maintenance and document review — and underscores the critical role data flywheels play in transforming business data into actionable insights.

Data flywheels built with NVIDIA NeMo microservices constantly curate data, retrain models and evaluate their performance, all with minimal human interactions and maximum autonomy.

Industry Pioneers Boost AI Agent Accuracy With NeMo Microservices

NVIDIA partners and industry pioneers are using NeMo microservices to build responsive AI agent platforms so that digital teammates can help get more done.

Working with Arize and Quantiphi, AT&T has built an advanced AI-powered agent using NVIDIA NeMo, designed to process a knowledge base of nearly 10,000 documents, refreshed weekly. The scalable, high-performance AI agent is fine-tuned for three key business priorities: speed, cost efficiency and accuracy — all increasingly critical as adoption scales.

AT&T boosted AI agent accuracy by up to 40% using NeMo Customizer and Evaluator by fine-tuning a Mistral 7B model to help deliver personalized services, prevent fraud and optimize network performance.

BlackRock is working with NeMo microservices for agentic AI capabilities in its Aladdin tech platform, which unifies the investment management process through a common data language.

Teaming with Galileo, Cisco’s Outshift team is using NVIDIA NeMo microservices to power a coding assistant that delivers 40% fewer tool selection errors and achieves up to 10x faster response times.

Nasdaq is accelerating its Nasdaq Gen AI Platform with NeMo Retriever microservices and NVIDIA NIM microservices. NeMo Retriever enhanced the platform’s search capabilities, leading to up to 30% improved accuracy and response times, in addition to cost savings.

Broad Model and Partner Ecosystem Support for NeMo Microservices

NeMo microservices support a broad range of popular open models, including Llama, the Microsoft Phi family of small language models, Google Gemma, Mistral and Llama Nemotron Ultra, currently the top open model on scientific reasoning, coding and complex math benchmarks.

Meta has tapped NVIDIA NeMo microservices through new connectors for Meta Llamastack. Users can access the same capabilities — including Customizer, Evaluator and Guardrails — via APIs, enabling them to run the full suite of agent-building workflows within their environment.

“With Llamastack integration, agent builders can implement data flywheels powered by NeMo microservices,” said Raghotham Murthy, software engineer, GenAI, at Meta. “This allows them to continuously optimize models to improve accuracy, boost efficiency and reduce total cost of ownership.”

Leading AI software providers such as Cloudera, Datadog, Dataiku, DataRobot, DataStax, SuperAnnotate, Weights & Biases and more have integrated NeMo microservices into their platforms. Developers can use NeMo microservices in popular AI frameworks including CrewAI, Haystack by deepset, LangChain, LlamaIndex and Llamastack.

Enterprises can build data flywheels with NeMo Retriever microservices using NVIDIA AI Data Platform offerings from NVIDIA-Certified Storage partners including DDN, Dell Technologies, Hewlett Packard Enterprise, Hitachi Vantara, IBM, NetApp, Nutanix, Pure Storage, VAST Data and WEKA.

Leading enterprise platforms including Amdocs, Cadence, Cohesity, SAP, ServiceNow and Synopsys are using NeMo Retriever microservices in their AI agent solutions.

Enterprises can run AI agents on NVIDIA-accelerated infrastructure, networking and software from leading system providers including Cisco, Dell, Hewlett Packard Enterprise and Lenovo.

Consulting giants including Accenture, Deloitte and EY are building AI agent platforms for enterprises using NeMo microservices.

Developers can download NeMo microservices from the NVIDIA NGC catalog. The microservices can be deployed as part of NVIDIA AI Enterprise with extended-life software branches for API stability, proactive security remediation and enterprise-grade support.

Read More

Project G-Assist Plug-In Builder Lets Anyone Customize AI on GeForce RTX AI PCs

Project G-Assist Plug-In Builder Lets Anyone Customize AI on GeForce RTX AI PCs

AI is rapidly reshaping what’s possible on a PC — whether for real-time image generation or voice-controlled workflows. As AI capabilities grow, so does their complexity. Tapping into the power of AI can entail navigating a maze of system settings, software and hardware configurations.

Enabling users to explore how on-device AI can simplify and enhance the PC experience, Project G-Assist — an AI assistant that helps tune, control and optimize GeForce RTX systems — is now available as an experimental feature in the NVIDIA app. Developers can try out AI-powered voice and text commands for tasks like monitoring performance, adjusting settings and interacting with supporting peripherals. Users can even summon other AIs powered by GeForce RTX AI PCs.

And it doesn’t stop there. For those looking to expand Project G-Assist capabilities in creative ways, the AI supports custom plug-ins. With the new ChatGPT-based G-Assist Plug-In Builder, developers and enthusiasts can create and customize G-Assist’s functionality, adding new commands, connecting external tools and building AI workflows tailored to specific needs. With the plug-in builder, users can generate properly formatted code with AI, then integrate the code into G-Assist — enabling quick, AI-assisted functionality that responds to text and voice commands.

Teaching PCs New Tricks: Plug-Ins and APIs Explained

Plug-ins are lightweight add-ons that give software new capabilities. G-Assist plug-ins can control music, connect with large language models and much more.

Under the hood, these plug-ins tap into application programming interfaces (APIs), which allow different software and services to talk to each other. Developers can define functions in simple JSON formats, write logic in Python and quickly integrate new tools or features into G-Assist.

With the G-Assist Plug-In Builder, users can:

  • Use a responsive small language model running locally on GeForce RTX GPUs for fast, private inference.
  • Extend G-Assist’s capabilities with custom functionality tailored to specific workflows, games and tools.
  • Interact with G-Assist directly from the NVIDIA overlay, without tabbing out of an application or workflow.
  • Invoke AI-powered GPU and system controls from applications using C++ and Python bindings.
  • Integrate with agentic frameworks using tools like Langflow, letting G-Assist function as a component in larger AI pipelines and multi-agent systems.

Built for Builders: Using Free APIs to Expand AI PC Capabilities 

NVIDIA’s GitHub repository provides everything needed to get started on developing with G-Assist — including sample plug-ins, step-by-step instructions and documentation for building custom functionalities.

Developers can define functions in JSON and drop config files into a designated directory, where G-Assist can automatically load and interpret them. Users can even submit plug-ins for review and potential inclusion in the NVIDIA GitHub repository to make new capabilities available for others.

Hundreds of free, developer-friendly APIs are available today to extend G-Assist capabilities — from automating workflows to optimizing PC setups to boosting online shopping. For ideas, find searchable indices of free APIs for use across entertainment, productivity, smart home, hardware and more on publicapis.dev, free-apis.github.io, apilist.fun and APILayer.

Available sample plug-ins include Spotify, which enables hands-free music and volume control, and Google Gemini, which allows G-Assist to invoke a much larger cloud-based AI for more complex conversations, brainstorming and web searches using a free Google AI Studio API key.

In the clip below, G-Assist asks Gemini for advice on which Legend to pick in the hit game Apex Legends when solo queueing, as well as whether it’s wise to jump into Nightmare mode for level 25 in Diablo IV:

And in the following clip, a developer uses the new plug-in builder to create a Twitch plug-in for G-Assist that checks if a streamer is live. After generating the necessary JSON manifest and Python files, the developer simply drops them into the G-Assist directory to enable voice commands like, “Hey, Twitch, is [streamer] live?”

In addition, users can customize G-Assist to control select peripherals and software applications with simple commands, such as to benchmark or adjust fan speeds, or to change lighting on supported Logitech G, Corsair, MSI and Nanoleaf devices.

Other examples include a Stock Checker plug-in that lets users quickly look up real-time stock prices and performance data, or a Weather plug-in allows users to ask G-Assist for current weather conditions in any city.

Details on how to build, share and load plug-ins are available on the NVIDIA GitHub repository.

Start Building Today

With the G-Assist Plugin Builder and open API support, anyone can extend G-Assist to fit their exact needs. Explore the GitHub repository and submit features for review to help shape the next wave of AI-powered PC experiences.

Plug in to NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter.

Follow NVIDIA Workstation on LinkedIn and X.

See notice regarding software product information.

Read More

PyTorch 2.7 Release

We are excited to announce the release of PyTorch® 2.7 (release notes)! This release features:

  • support for the NVIDIA Blackwell GPU architecture and pre-built wheels for CUDA 12.8 across Linux x86 and arm64 architectures.
  • torch.compile support for Torch Function Modes which enables users to override any *torch.** operation to implement custom user-defined behavior.
  • Mega Cache which allows users to have end-to-end portable caching for torch;
  • new features for FlexAttention – LLM first token processing, LLM throughput mode optimization and Flex Attention for Inference.

This release is composed of 3262 commits from 457 contributors since PyTorch 2.6. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.7. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Beta Prototype
Torch.Compile support for Torch Function Modes NVIDIA Blackwell Architecture Support
Mega Cache PyTorch Native Context Parallel
Enhancing Intel GPU Acceleration
FlexAttention LLM first token processing on X86 CPUs
FlexAttention LLM throughput mode optimization on X86 CPUs
Foreach Map
Flex Attention for Inference
Prologue Fusion Support in Inductor

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

This feature enables users to override any *torch.** operation to implement custom user-defined behavior. For example, ops can be rewritten to accommodate a specific backend. This is used in FlexAttention to re-write indexing ops.

See the tutorial for more information.

[Beta] Mega Cache

Mega Cache allows users to have end-to-end portable caching for torch. The intended use case is after compiling and executing a model, the user calls torch.compiler.save_cache_artifacts() which will return the compiler artifacts in a portable form. Later, potentially on a different machine, the user may call torch.compiler.load_cache_artifacts() with these artifacts to pre-populate the torch.compile caches in order to jump-start their cache.

See the tutorial for more information.

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

PyTorch 2.7 introduces support for NVIDIA’s new Blackwell GPU architecture and ships pre-built wheels for CUDA 12.8. For more details on CUDA 12.8 see CUDA Toolkit Release.

  • Core components and libraries including cuDNN, NCCL, and CUTLASS have been upgraded to ensure compatibility with Blackwell platforms.
  • PyTorch 2.7 includes Triton 3.3, which adds support for the Blackwell architecture with torch.compile compatibility.
  • To utilize these new features, install PyTorch with CUDA 12.8 using: pip install torch==2.7.0 –index-url https://download.pytorch.org/whl/cu128

More context can also be found here.

[Prototype] PyTorch Native Context Parallel

PyTorch Context Parallel API allows users to create a Python context so that every *torch.nn.functional.scaled_dot_product_attention() *call within will run with context parallelism. Currently, PyTorch Context Parallel supports 3 attention backends: 1. Flash attention; 2. Efficient attention; and 3. cuDNN attention.

As an example, this is used within TorchTitan as the Context Parallel solution for LLM training.

See tutorial here.

[Prototype] Enhancing Intel GPU Acceleration

This latest release introduces enhanced performance optimizations for Intel GPU architectures. These improvements accelerate workloads across various Intel GPUs through the following key enhancements:

  • Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux.
  • Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide a full graph mode quantization pipelines with enhanced computational efficiency.
  • Improve Scaled Dot-Product Attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
  • Enable AOTInuctor and torch.export on Linux to simplify deployment workflows.
  • Implement more Aten operators to enhance the continuity of operators execution on Intel GPU and increase the performance on Intel GPU in eager mode.
  • Enable profiler on both Windows and Linux to facilitate model performance analysis.
  • Expand the Intel GPUs support to Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics, and Intel® Arc™ B-Series graphics on both Windows and Linux.

For more information regarding Intel GPU support, please refer to Getting Started Guide.

See also the tutorials here and here.

[Prototype] FlexAttention LLM first token processing on X86 CPUs

FlexAttention X86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific scaled_dot_product_attention operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.

[Prototype] FlexAttention LLM throughput mode optimization

The performance of FlexAttention on x86 CPUs for LLM inference throughput scenarios has been further improved by adopting the new C++ micro-GEMM template ability. This addresses the performance bottlenecks for large batch size scenarios present in PyTorch 2.6. With this enhancement, users can transparently benefit from better performance and a smoother experience when using FlexAttention APIs and torch.compile for LLM throughput serving on x86 CPUs.

[Prototype] Foreach Map

This feature uses torch.compile to allow users to apply any pointwise or user-defined function (e.g. torch.add) to lists of tensors, akin to the existing *torch.foreach** ops. The main advantage over the existing *torch.foreach** ops is that any mix of scalars or lists of tensors can be supplied as arguments, and even user-defined python functions can be lifted to apply to lists of tensors. Torch.compile will automatically generate a horizontally fused kernel for optimal performance.

See tutorial here.

[Prototype] Flex Attention for Inference

In release 2.5.0, FlexAttention* torch.nn.attention.flex_attention* was introduced for ML researchers who’d like to customize their attention kernels without writing kernel code. This update introduces a decoding backend optimized for inference, supporting GQA and PagedAttention, along with feature updates including nested jagged tensor support, performance tuning guides and trainable biases support.

[Prototype] Prologue Fusion Support in Inductor

Prologue fusion optimizes matrix multiplication (matmul) operations by fusing operations that come before the matmul into the matmul kernel itself, improving performance by reducing global memory bandwidth.

Read More

Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Today, we’re excited to announce the launch of Amazon SageMaker Large Model Inference (LMI) container v15, powered by vLLM 0.8.4 with support for the vLLM V1 engine. This version now supports the latest open-source models, such as Meta’s Llama 4 models Scout and Maverick, Google’s Gemma 3, Alibaba’s Qwen, Mistral AI, DeepSeek-R, and many more. Amazon SageMaker AI continues to evolve its generative AI inference capabilities to meet the growing demands in performance and model support for foundation models (FMs).

This release introduces significant performance improvements, expanded model compatibility with multimodality (that is, the ability to understand and analyze text-to-text, images-to-text, and text-to-images data), and provides built-in integration with vLLM to help you seamlessly deploy and serve large language models (LLMs) with the highest performance at scale.

What’s new?

LMI v15 brings several enhancements that improve throughput, latency, and usability:

  1. An async mode that directly integrates with vLLM’s AsyncLLMEngine for improved request handling. This mode creates a more efficient background loop that continuously processes incoming requests, enabling it to handle multiple concurrent requests and stream outputs with higher throughput than the previous Rolling-Batch implementation in v14.
  2. Support for the vLLM V1 engine, which delivers up to 111% higher throughput compared to the previous V0 engine for smaller models at high concurrency. This performance improvement comes from reduced CPU overhead, optimized execution paths, and more efficient resource utilization in the V1 architecture. LMI v15 supports both V1 and V0 engines, with V1 being the default. If you have a need to use V0, you can use the V0 engine by specifying VLLM_USE_V1=0. vLLM V1’s engine also comes with a core re-architecture of the serving engine with simplified scheduling, zero-overhead prefix caching, clean tensor-parallel inference, efficient input preparation, and advanced optimizations with torch.compile and Flash Attention 3. For more information, see the vLLM Blog.
  3. Expanded API schema support with three flexible options to allow seamless integration with applications built on popular API patterns:
    1. Message format compatible with the OpenAI Chat Completions API.
    2. OpenAI Completions format.
    3. Text Generation Inference (TGI) schema to support backward compatibility with older models.
  4. Multimodal support, with enhanced capabilities for vision-language models including optimizations such as multimodal prefix caching
  5. Built-in support for function calling and tool calling, enabling sophisticated agent-based workflows.

Enhanced model support

LMI v15 supports an expanding roster of state-of-the-art models, including the latest releases from leading model providers. The container offers ready-to-deploy compatibility for but not limited to:

  • Llama 4 – Llama-4-Scout-17B-16E and Llama-4-Maverick-17B-128E-Instruct
  • Gemma 3 – Google’s lightweight and efficient models, known for their strong performance despite smaller size
  • Qwen 2.5 – Alibaba’s advanced models including QwQ 2.5 and Qwen2-VL with multimodal capabilities
  • Mistral AI models – High-performance models from Mistral AI that offer efficient scaling and specialized capabilities
  • DeepSeek-R1/V3 – State of the art reasoning models

Each model family can be deployed using the LMI v15 container by specifying the appropriate model ID, for example, meta-llama/Llama-4-Scout-17B-16E, and configuration parameters as environment variables, without requiring custom code or optimization work.

Benchmarks

Our benchmarks demonstrate the performance advantages of LMI v15’s V1 engine compared to previous versions:

Model Batch size Instance type LMI v14 throughput [tokens/s] (V0 engine) LMI v15 throughput [tokens/s] (V1 engine) Improvement
1 deepseek-ai/DeepSeek-R1-Distill-Llama-70B 128 p4d.24xlarge 1768 2198 24%
2 meta-llama/Llama-3.1-8B-Instruct 64 ml.g6e.2xlarge 1548 2128 37%
3 mistralai/Mistral-7B-Instruct-v0.3 64 ml.g6e.2xlarge 942 1988 111%

DeepSeek-R1 Llama 70B for various levels of concurrency

Llama 3.1 8B Instruct for various level of concurrency

Mistral 7B for various levels of concurrency

The async engine in LMI v15 shows strength in high-concurrency scenarios, where multiple simultaneous requests benefit from the optimized request handling. These benchmarks highlight that the V1 engine in async mode delivers between 24% and 111% higher throughput compared to LMI v14 using rolling batch in the models tested in high concurrency scenarios for batch size of 64 and 128. We suggest to keep in mind the following considerations for optimal performance:

  • Higher batch sizes increase concurrency but come with a natural tradeoff in terms of latency
  • Batch sizes of 4 and 8 provide the best latency for most use cases
  • Batch sizes up to 64 and 128 achieve maximum throughput with acceptable latency trade-offs

API formats

LMI v15 supports three API schemas: OpenAI Chat Completions, OpenAI Completions, and TGI.

  • Chat Completions – Message format is compatible with OpenAI Chat Completions API. Use this schema for tool calling, reasoning, and multimodal use cases. Here is a sample of the invocation with the Messages API:
    body = {
        "messages": [
            {"role": "user", "content": "Name popular places to visit in London?"}
        ],
        "temperature": 0.9,
        "max_tokens": 256,
        "stream": True,
    }

  • OpenAI Completions format – The Completions API endpoint is no longer receiving updates:
    body = {
     "prompt": "Name popular places to visit in London?",
     "temperature": 0.9,
     "max_tokens": 256,
     "stream": True,
    } 

  • TGI – Supports backward compatibility with older models:
    body = {
    "inputs": "Name popular places to visit in London?",
    "parameters": {
    "max_new_tokens": 256,
    "temperature": 0.9,
    },
    "stream": True,
    }

Getting started with LMI v15

Getting started with LMI v15 is seamless, and you can deploy with LMI v15 in only a few lines of code. The container is available through Amazon Elastic Container Registry (Amazon ECR), and deployments can be managed through SageMaker AI endpoints. To deploy models, you need to specify the Hugging Face model ID, instance type, and configuration options as environment variables.

For optimal performance, we recommend the following instances:

  • Llama 4 Scout: ml.p5.48xlarge
  • DeepSeek R1/V3: ml.p5e.48xlarge
  • Qwen 2.5 VL-32B: ml.g5.12xlarge
  • Qwen QwQ 32B: ml.g5.12xlarge
  • Mistral Large: ml.g6e.48xlarge
  • Gemma3-27B: ml.g5.12xlarge
  • Llama 3.3-70B: ml.p4d.24xlarge

To deploy with LMI v15, follow these steps:

  1. Clone the notebook to your Amazon SageMaker Studio notebook or to Visual Studio Code (VS Code). You can then run the notebook to do the initial setup and deploy the model from the Hugging Face repository to the SageMaker AI endpoint. We walk through the key blocks here.
  2. LMI v15 maintains the same configuration pattern as previous versions, using environment variables in the form OPTION_<CONFIG_NAME>. This consistent approach makes it straightforward for users familiar with earlier LMI versions to migrate to v15.
    vllm_config = {
        "HF_MODEL_ID": "meta-llama/Llama-4-Scout-17B-16E",
        "HF_TOKEN": "entertoken",
        "OPTION_MAX_MODEL_LEN": "250000",
        "OPTION_MAX_ROLLING_BATCH_SIZE": "8",
        "OPTION_MODEL_LOADING_TIMEOUT": "1500",
        "SERVING_FAIL_FAST": "true",
        "OPTION_ROLLING_BATCH": "disable",
        "OPTION_ASYNC_MODE": "true",
        "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service"
    }

    • HF_MODEL_ID sets the model id from Hugging Face. You can also download model from Amazon Simple Storage Service (Amazon S3).
    • HF_TOKEN sets the token to download the model. This is required for gated models like Llama-4
    • OPTION_MAX_MODEL_LEN. This is the max model context length.
    • OPTION_MAX_ROLLING_BATCH_SIZE sets the batch size for the model.
    • OPTION_MODEL_LOADING_TIMEOUT sets the timeout value for SageMaker to load the model and run health checks.
    • SERVING_FAIL_FAST=true. We recommend setting this flag because it allows SageMaker to gracefully restart the container when an unrecoverable engine error occurs.
    • OPTION_ROLLING_BATCH= disable disables the rolling batch implementation of LMI, which was the default offering in LMI V14. We recommend using async instead as this latest implementation and provides better performance
    • OPTION_ASYNC_MODE=true enables async mode.
    • OPTION_ENTRYPOINT provides the entrypoint for vLLM’s async integrations
  3. Set the latest container (in this example we used 0.33.0-lmi15.0.0-cu128), AWS Region (us-east-1), and create a model artifact with all the configurations. To review the latest available container version, see Available Deep Learning Containers Images.
  4. Deploy the model to the endpoint using model.deploy().
    CONTAINER_VERSION = '0.33.0-lmi15.0.0-cu128'
    REGION = 'us-east-1'
    # Construct container URI
    container_uri = f'763104351884.dkr.ecr.{REGION}.amazonaws.com/djl-inference:{CONTAINER_VERSION}'
    
    # Select instance type
    instance_type = "ml.p5.48xlarge"
    
    model = Model(image_uri=container_uri,
                  role=role,
                  env=vllm_config)
    endpoint_name = sagemaker.utils.name_from_base("Llama-4")
    
    print(endpoint_name)
    model.deploy(
        initial_instance_count=1,
        instance_type=instance_type,
        endpoint_name=endpoint_name,
        container_startup_health_check_timeout = 1800
    )

  5. Invoke the model, SageMaker inference provides two APIs to invoke the model- InvokeEndpoint and InvokeEndpointWithResponseStream. You can choose either option based on your needs.
    # Create SageMaker Runtime client
    smr_client = boto3.client('sagemaker-runtime')
    ##Add your endpoint here 
    endpoint_name = ''
    
    # Invoke with messages format
    body = {
    "messages": [
    {"role": "user", "content": "Name popular places to visit in London?"}
    ],
    "temperature": 0.9,
    "max_tokens": 256,
    "stream": True,
    }
    
    # Invoke with endpoint streaming
    resp = smr_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(body),
    ContentType="application/json",
    )

To run multi-modal inference with Llama-4 Scout, see the notebook for the full code sample to run inference requests with images.

Conclusion

Amazon SageMaker LMI container v15 represents a significant step forward in large model inference capabilities. With the new vLLM V1 engine, async operating mode, expanded model support, and optimized performance, you can deploy cutting-edge LLMs with greater performance and flexibility. The container’s configurable options give you the flexibility to fine-tune deployments for your specific needs, whether optimizing for latency, throughput, or cost.

We encourage you to explore this release for deploying your generative AI models.

Check out the provided example notebooks to start deploying models with LMI v15.


About the authors

Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Siddharth Venkatesan is a Software Engineer in AWS Deep Learning. He currently focusses on building solutions for large model inference. Prior to AWS he worked in the Amazon Grocery org building new payment features for customers world-wide. Outside of work, he enjoys skiing, the outdoors, and watching sports.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the SageMaker machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.

Read More

Accuracy evaluation framework for Amazon Q Business – Part 2

Accuracy evaluation framework for Amazon Q Business – Part 2

In the first post of this series, we introduced a comprehensive evaluation framework for Amazon Q Business, a fully managed Retrieval Augmented Generation (RAG) solution that uses your company’s proprietary data without the complexity of managing large language models (LLMs). The first post focused on selecting appropriate use cases, preparing data, and implementing metrics to support a human-in-the-loop evaluation process.

In this post, we dive into the solution architecture necessary to implement this evaluation framework for your Amazon Q Business application. We explore two distinct evaluation solutions:

  • Comprehensive evaluation workflow – This ready-to-deploy solution uses AWS CloudFormation stacks to set up an Amazon Q Business application, complete with user access, a custom UI for review and evaluation, and the supporting evaluation infrastructure
  • Lightweight AWS Lambda based evaluation – Designed for users with an existing Amazon Q Business application, this streamlined solution employs an AWS Lambda function to efficiently assess the application’s accuracy

By the end of this post, you will have a clear understanding of how to implement an evaluation framework that aligns with your specific needs with a detailed walkthrough, so your Amazon Q Business application delivers accurate and reliable results.

Challenges in evaluating Amazon Q Business

Evaluating the performance of Amazon Q Business, which uses a RAG model, presents several challenges due to its integration of retrieval and generation components. It’s crucial to identify which aspects of the solution need evaluation. For Amazon Q Business, both the retrieval accuracy and the quality of the answer output are important factors to assess. In this section, we discuss key metrics that need to be included for a RAG generative AI solution.

Context recall

Context recall measures the extent to which all relevant content is retrieved. High recall provides comprehensive information gathering but might introduce extraneous data.

For example, a user might ask the question “What can you tell me about the geography of the United States?” They could get the following responses:

  • Expected: The United States is the third-largest country in the world by land area, covering approximately 9.8 million square kilometers. It has a diverse range of geographical features.
  • High context recall: The United States spans approximately 9.8 million square kilometers, making it the third-largest nation globally by land area. country’s geography is incredibly diverse, featuring the Rocky Mountains stretching from New Mexico to Alaska, the Appalachian Mountains along the eastern states, the expansive Great Plains in the central region, arid deserts like the Mojave in the southwest.
  • Low context recall: The United States features significant geographical landmarks. Additionally, the country is home to unique ecosystems like the Everglades in Florida, a vast network of wetlands.

The following diagram illustrates the context recall workflow.

Context precision

Context precision assesses the relevance and conciseness of retrieved information. High precision indicates that the retrieved information closely matches the query intent, reducing irrelevant data.

For example, “Why Silicon Valley is great for tech startups?”might give the following answers:

  • Ground truth answer: Silicon Valley is famous for fostering innovation and entrepreneurship in the technology sector.
  • High precision context: Many groundbreaking startups originate from Silicon Valley, benefiting from a culture that encourages innovation, risk-taking
  • Low precision context: Silicon Valley experiences a Mediterranean climate, with mild, wet, winters and warm, dry summers, contributing to its appeal as a place to live and works

The following diagram illustrates the context precision workflow.

Answer relevancy

Answer relevancy evaluates whether responses fully address the query without unnecessary details. Relevant answers enhance user satisfaction and trust in the system.

For example, a user might ask the question “What are the key features of Amazon Q Business Service, and how can it benefit enterprise customers?” They could get the following answers:

  • High relevance answer: Amazon Q Business Service is a RAG Generative AI solution designed for enterprise use. Key features include a fully managed Generative AI solutions, integration with enterprise data sources, robust security protocols, and customizable virtual assistants. It benefits enterprise customers by enabling efficient information retrieval, automating customer support tasks, enhancing employee productivity through quick access to data, and providing insights through analytics on user interactions.
  • Low relevance answer: Amazon Q Business Service is part of Amazon’s suite of cloud services. Amazon also offers online shopping and streaming services.

The following diagram illustrates the answer relevancy workflow.

Truthfulness

Truthfulness verifies factual accuracy by comparing responses to verified sources. Truthfulness is crucial to maintain the system’s credibility and reliability.

For example, a user might ask “What is the capital of Canada?” They could get the following responses:

  • Context: Canada’s capital city is Ottawa, located in the province of Ontario. Ottawa is known for its historic Parliament Hill, the center of government, and the scenic Rideau Canal, a UNESCO World Heritage site
  • High truthfulness answer: The capital of Canada is Ottawa
  • Low truthfulness answer: The capital of Canada is Toronto

The following diagram illustrates the truthfulness workflow.

Evaluation methods

Deciding on who should conduct the evaluation can significantly impact results. Options include:

  • Human-in-the-Loop (HITL) – Human evaluators manually assess the accuracy and relevance of responses, offering nuanced insights that automated systems might miss. However, it is a slow process and difficult to scale.
  • LLM-aided evaluation – Automated methods, such as the Ragas framework, use language models to streamline the evaluation process. However, these might not fully capture the complexities of domain-specific knowledge.

Each of these preparatory and evaluative steps contributes to a structured approach to evaluating the accuracy and effectiveness of Amazon Q Business in supporting enterprise needs.

Solution overview

In this post, we explore two different solutions to provide you the details of an evaluation framework, so you can use it and adapt it for your own use case.

Solution 1: End-to-end evaluation solution

For a quick start evaluation framework, this solution uses a hybrid approach with Ragas (automated scoring) and HITL evaluation for robust accuracy and reliability. The architecture includes the following components:

  • User access and UI – Authenticated users interact with a frontend UI to upload datasets, review RAGAS output, and provide human feedback
  • Evaluation solution infrastructure – Core components include:
  • Ragas scoring – Automated metrics provide an initial layer of evaluation
  • HITL review – Human evaluators refine Ragas scores through the UI, providing nuanced accuracy and reliability

By integrating a metric-based approach with human validation, this architecture makes sure Amazon Q Business delivers accurate, relevant, and trustworthy responses for enterprise users. This solution further enhances the evaluation process by incorporating HITL reviews, enabling human feedback to refine automated scores for higher precision.

A quick video demo of this solution is shown below:

Solution architecture

The solution architecture is designed with the following core functionalities to support an evaluation framework for Amazon Q Business:

  1. User access and UI – Users authenticate through Amazon Cognito, and upon successful login, interact with a Streamlit-based custom UI. This frontend allows users to upload CSV datasets to Amazon Simple Storage Service (Amazon S3), review Ragas evaluation outputs, and provide human feedback for refinement. The application exchanges the Amazon Cognito token for an AWS IAM Identity Center token, granting scoped access to Amazon Q Business.UI
  2. infrastructure – The UI is hosted behind an Application Load Balancer, supported by Amazon Elastic Compute Cloud (Amazon EC2) instances running in an Auto Scaling group for high availability and scalability.
  3. Upload dataset and trigger evaluation – Users upload a CSV file containing queries and ground truth answers to Amazon S3, which triggers an evaluation process. A Lambda function reads the CSV, stores its content in a DynamoDB table, and initiates further processing through a DynamoDB stream.
  4. Consuming DynamoDB stream – A separate Lambda function processes new entries from the DynamoDB stream, and publishes messages to an SQS queue, which serves as a trigger for the evaluation Lambda function.
  5. Ragas scoring – The evaluation Lambda function consumes SQS messages, sending queries (prompts) to Amazon Q Business for generating answers. It then evaluates the prompt, ground truth, and generated answer using the Ragas evaluation framework. Ragas computes automated evaluation metrics such as context recall, context precision, answer relevancy, and truthfulness. The results are stored in DynamoDB and visualized in the UI.

HITL review – Authenticated users can review and refine RAGAS scores directly through the UI, providing nuanced and accurate evaluations by incorporating human insights into the process.

This architecture uses AWS services to deliver a scalable, secure, and efficient evaluation solution for Amazon Q Business, combining automated and human-driven evaluations.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Additionally, make sure that all the resources you deploy are in the same AWS Region.

Deploy the CloudFormation stack

Complete the following steps to deploy the CloudFormation stack:

  1. Clone the repository or download the files to your local computer.
  2. Unzip the downloaded file (if you used this option).
  3. Using your local computer command line, use the ‘cd’ command and change directory into ./sample-code-for-evaluating-amazon-q-business-applications-using-ragas-main/end-to-end-solution
  4. Make sure the ./deploy.sh script can run by executing the command chmod 755 ./deploy.sh.
  5. Execute the CloudFormation deployment script provided as follows:
    ./deploy.sh -s [CNF_STACK_NAME] -r [AWS_REGION]

You can follow the deployment progress on the AWS CloudFormation console. It takes approximately 15 minutes to complete the deployment, after which you will see a similar page to the following screenshot.

Add users to Amazon Q Business

You need to provision users for the pre-created Amazon Q Business application. Refer to Setting up for Amazon Q Business for instructions to add users.

Upload the evaluation dataset through the UI

In this section, you review and upload the following CSV file containing an evaluation dataset through the deployed custom UI.

This CSV file contains two columns: prompt and ground_truth. There are four prompts and their associated ground truth in this dataset:

  • What are the index types of Amazon Q Business and the features of each?
  • I want to use Q Apps, which subscription tier is required to use Q Apps?
  • What is the file size limit for Amazon Q Business via file upload?
  • What data encryption does Amazon Q Business support?

To upload the evaluation dataset, complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the evals stack that you already launched.
  3. On the Outputs tab, take note of the user name and password to log in to the UI application, and choose the UI URL.

The custom UI will redirect you to the Amazon Cognito login page for authentication.

The UI application authenticates the user with Amazon Cognito, and initiates the token exchange workflow to implement a secure Chatsync API call with Amazon Q Business.

  1. Use the credentials you noted earlier to log in.

For more information about the token exchange flow between IAM Identity Center and the identity provider (IdP), refer to Building a Custom UI for Amazon Q Business.

  • After you log in to the custom UI used for Amazon Q evaluation, choose Upload Dataset, then upload the dataset CSV file.

After the file is uploaded, the evaluation framework will send the prompt to Amazon Q Business to generate the answer, and then send the prompt, ground truth, and answer to Ragas to evaluate. During this process, you can also review the uploaded dataset (including the four questions and associated ground truth) on the Amazon Q Business console, as shown in the following screenshot.

After about 7 minutes, the workflow will finish, and you should see the evaluation result for first question.

Perform HITL evaluation

After the Lambda function has completed its execution, Ragas scoring will be shown in the custom UI. Now you can review metric scores generated using Ragas (an-LLM aided evaluation method), and you can provide human feedback as an evaluator to provide further calibration. This human-in-the-loop calibration can further improve the evaluation accuracy, because the HITL process is particularly valuable in fields where human judgment, expertise, or ethical considerations are crucial.

Let’s review the first question: “What are the index types of Amazon Q Business and the features of each?” You can read the question, Amazon Q Business generated answers, ground truth, and context.

Next, review the evaluation metrics scored by using Ragas. As discussed earlier, there are four metrics:

  • Answer relevancy – Measures relevancy of answers. Higher scores indicate better alignment with the user input, and lower scores are given if the response is incomplete or includes redundant information.
  • Truthfulness – Verifies factual accuracy by comparing responses to verified sources. Higher scores indicate a better consistency with verified sources.
  • Context precision – Assesses the relevance and conciseness of retrieved information. Higher scores indicate that the retrieved information closely matches the query intent, reducing irrelevant data.
  • Context recall – Measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out.

For this question, all metrics showed Amazon Q Business achieved a high-quality response. It’s worthwhile to compare your own evaluation with these scores generated by Ragas.

Next, let’s review a question that returned with a low answer relevancy score. For example: “I want to use Q Apps, which subscription tier is required to use Q Apps?”

Analyzing both question and answer, we can consider the answer relevant and aligned with the user question, but the answer relevancy score from Ragas doesn’t reflect this human analysis, showing a lower score than expected. It’s important to calibrate Ragas evaluation judgement as Human in the Lopp. You should read the question and answer carefully, and make necessary changes of the metric score to reflect the HITL analysis. Finally, the results will be updated in DynamoDB.

Lastly, save the metric score in the CSV file, and you can download and review the final metric scores.

Solution 2: Lambda based evaluation

If you’re already using Amazon Q Business, AmazonQEvaluationLambda allows for quick integration of evaluation methods into your application without setting up a custom UI application. It offers the following key features:

  • Evaluates responses from Amazon Q Business using Ragas against a predefined test set of questions and ground truth data
  • Outputs evaluation metrics that can be visualized directly in Amazon CloudWatch
  • Both solutions provide you results based on the input dataset and the responses from the Amazon Q Business application, using Ragas to evaluate four key evaluation metrics (context recall, context precision, answer relevancy, and truthfulness).

This solution provides you sample code to evaluate the Amazon Q Business application response. To use this solution, you need to have or create a working Amazon Q Business application integrated with IAM Identity Center or Amazon Cognito as an IdP. This Lambda function works in the same way as the Lambda function in the end-to-end evaluation solution, using RAGAS against a test set of questions and ground truth. This lightweight solution doesn’t have a custom UI, but it can provide result metrics (context recall, context precision, answer relevancy, truthfulness), for visualization in CloudWatch. For deployment instructions, refer to the following GitHub repo.

Using evaluation results to improve Amazon Q Business application accuracy

This section outlines strategies to enhance key evaluation metrics—context recall, context precision, answer relevance, and truthfulness—for a RAG solution in the context of Amazon Q Business.

Context recall

Let’s examine the following problems and troubleshooting tips:

  1. Aggressive query filtering – Overly strict search filters or metadata constraints might exclude relevant records. You should review the metadata filters or boosting settings applied in Amazon Q Business to make sure they don’t unnecessarily restrict results.
  2. Data source ingestion errors – Documents from certain data sources aren’t successfully ingested into Amazon Q Business. To address this, check the document sync history report in Amazon Q Business to confirm successful ingestion and resolve ingestion errors.

Context precision

Consider the following potential issues:

  • Over-retrieval of documents – Large top-K values might retrieve semi-related or off-topic passages, which the LLM might incorporate unnecessarily. To address this, refine metadata filters or apply boosting to improve passage relevance and reduce noise in the retrieved context.
  1. Poor query specificity – Broad or poorly formed user queries can yield loosely related results. You should make sure user queries are clear and specific. Train users or implement query refinement mechanisms to optimize query quality.

Answer relevance

Consider the following troubleshooting methods:

  • Partial coverage – Retrieved context addresses parts of the question but fails to cover all aspects, especially in multi-part queries. To address this, decompose complex queries into sub-questions. Instruct the LLM or a dedicated module to retrieve and answer each sub-question before composing the final response. For example:
    • Break down the query into sub-questions.
    • Retrieve relevant passages for each sub-question.
    • Compose a final answer addressing each part.
  • Context/answer mismatch – The LLM might misinterpret retrieved passages, omit relevant information, or merge content incorrectly due to hallucination. You can use prompt engineering to guide the LLM more effectively. For example, for the original query “What are the top 3 reasons for X?” you can use the rewritten prompt “List the top 3 reasons for X clearly labeled as #1, #2, and #3, based strictly on the retrieved context.”

Truthfulness

Consider the following:

  • Stale or inaccurate data sources – Outdated or conflicting information in the knowledge corpus might lead to incorrect answers. To address this, compare the retrieved context with verified sources to provide accuracy. Collaborate with SMEs to validate the data.
  • LLM hallucination – The model might fabricate or embellish details, even with accurate retrieved context. Although Amazon Q Business is a RAG generative AI solution, and should significantly reduce the hallucination, it’s not possible to eliminate hallucination totally. You can measure the frequency of low context precision answers to identify patterns and quantify the impact of hallucinations to gain an aggregated view with the evaluation solution.

By systematically examining and addressing the root causes of low evaluation metrics, you can optimize your Amazon Q Business application. From document retrieval and ranking to prompt engineering and validation, these strategies will help enhance the effectiveness of your RAG solution.

Clean up

Don’t forget to go back to the CloudFormation console and delete the CloudFormation stack to delete the underlying infrastructure that you set up, to avoid additional costs on your AWS account.

Conclusion

In this post, we outlined two evaluation solutions for Amazon Q Business: a comprehensive evaluation workflow and a lightweight Lambda based evaluation. These approaches combine automated evaluation approaches such as Ragas with human-in-the-loop validation, providing reliable and accurate assessments.

By using our guidance on how to improve evaluation metrics, you can continuously optimize your Amazon Q Business application to meet enterprise needs with Amazon Q Business. Whether you’re using the end-to-end solution or the lightweight approach, these frameworks provide a scalable and efficient path to improve accuracy and relevance.

To learn more about Amazon Q Business and how to evaluate Amazon Q Business results, explore these hands-on workshops:


About the authors

Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.

Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She is specialized in Generative AI, Applied Data Science and IoT architecture. Currently she is part of the Amazon Bedrock team, and a Gold member/mentor in Machine Learning Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome generative AI solutions. She is particularly passionate about leveraging Large Language Models for advanced data analytics and exploring practical applications that address real-world challenges.

Amit GuptaAmit Gupta is a Senior Q Business Solutions Architect Solutions Architect at AWS. He is passionate about enabling customers with well-architected generative AI solutions at scale.

Neil Desai is a technology executive with over 20 years of experience in artificial intelligence (AI), data science, software engineering, and enterprise architecture. At AWS, he leads a team of Worldwide AI services specialist solutions architects who help customers build innovative Generative AI-powered solutions, share best practices with customers, and drive product roadmap. He is passionate about using technology to solve real-world problems and is a strategic thinker with a proven track record of success.

Ricardo Aldao is a Senior Partner Solutions Architect at AWS. He is a passionate AI/ML enthusiast who focuses on supporting partners in building generative AI solutions on AWS.

Read More

Use Amazon Bedrock Intelligent Prompt Routing for cost and latency benefits

Use Amazon Bedrock Intelligent Prompt Routing for cost and latency benefits

In December, we announced the preview availability for Amazon Bedrock Intelligent Prompt Routing, which provides a single serverless endpoint to efficiently route requests between different foundation models within the same model family. To do this, Amazon Bedrock Intelligent Prompt Routing dynamically predicts the response quality of each model for a request and routes the request to the model it determines is most appropriate based on cost and response quality, as shown in the following figure.

Today, we’re happy to announce the general availability of Amazon Bedrock Intelligent Prompt Routing. Over the past several months, we drove several improvements in intelligent prompt routing based on customer feedback and extensive internal testing. Our goal is to enable you to set up automated, optimal routing between large language models (LLMs) through Amazon Bedrock Intelligent Prompt Routing and its deep understanding of model behaviors within each model family, which incorporates state-of-the-art methods for training routers for different sets of models, tasks and prompts.

In this blog post, we detail various highlights from our internal testing, how you can get started, and point out some caveats and best practices. We encourage you to incorporate Amazon Bedrock Intelligent Prompt Routing into your new and existing generative AI applications. Let’s dive in!

Highlights and improvements

Today, you can either use Amazon Bedrock Intelligent Prompt Routing with the default prompt routers provided by Amazon Bedrock or configure your own prompt routers to adjust for performance linearly between the performance of the two candidate LLMs. Default prompt routers—pre-configured routing systems to map performance to the more performant of the two models while lowering costs by sending easier prompts to the cheaper model—are provided by Amazon Bedrock for each model family. These routers come with predefined settings and are designed to work out-of-the-box with specific foundation models. They provide a straightforward, ready-to-use solution without needing to configure any routing settings. Customers who tested Amazon Bedrock Intelligent Prompt Routing in preview (thank you!), you could choose models in the Anthropic and Meta families. Today, you can choose more models from within the Amazon Nova, Anthropic, and Meta families, including:

  • Anthropic’s Claude family: Haiku, Sonnet3.5 v1, Haiku 3.5, Sonnet 3.5 v2
  • Llama family: Llama 3.1 8b, 70b, 3.2 11B, 90B and 3.3 70B
  • Nova family: Nova Pro and Nova lite

You can also configure your own prompt routers to define your own routing configurations tailored to specific needs and preferences. These are more suitable when you require more control over how to route your requests and which models to use. In GA, you can configure your own router by selecting any two models from the same model family and then configuring the response quality difference of your router.

Adding components before invoking the selected LLM with the original prompt can add overhead. We reduced overhead of added components by over 20% to approximately 85 ms (P90). Because the router preferentially invokes the less expensive model while maintaining the same baseline accuracy in the task, you can expect to get an overall latency and cost benefit compared to always hitting the larger/ more expensive model, despite the additional overhead. This is discussed further in the following benchmark results section.

We conducted several internal tests with proprietary and public data to evaluate Amazon Bedrock Intelligent Prompt Routing metrics. First, we used average response quality gain under cost constraints (ARQGC), a normalized (0–1) performance metric for measuring routing system quality for various cost constraints, referenced against a reward model, where 0.5 represents random routing and 1 represents optimal oracle routing performance. We also captured the cost savings with intelligent prompt routing relative to using the largest model in the family, and estimated latency benefit based on average recorded time to first token (TTFT) to showcase the advantages and report them in the following table.

Model family Router overall performance Performance when configuring the router to match performance of the strong model
Average ARQGC Cost savings (%) Latency benefit (%)
Nova 0.75 35% 9.98%
Anthropic 0.86 56% 6.15%
Meta 0.78 16% 9.38%

How to read this table?

It’s important to pause and understand these metrics. First, results shown in the preceding table are only meant for comparing against random routing within the family (that is, improvement in ARQGC over 0.5) and not across families. Second, the results are relevant only within the family of models and are different than other model benchmarks that you might be familiar with that are used to compare models. Third, because the real cost and price change frequently and are dependent on the input and output token counts, it’s challenging to compare the real cost. To solve this problem, we define the cost savings metric as the maximum cost saved compared to the strongest LLM cost for a router to achieve a certain level of response quality. Specifically, in the example shown in the table, there’s an average 35% cost savings using the Nova family router compared to using Nova Pro for all prompts without the router.

You can expect to see varying levels of benefit based on your use case. For example, in an internal test with hundreds of prompts, we achieve 60% cost savings using Amazon Bedrock Intelligent Prompt Routing with the Anthropic family, with the response quality matching that of Claude Sonnet3.5 V2.

What is response quality difference?

The response quality difference measures the disparity between the responses of the fallback model and the other models. A smaller value indicates that the responses are similar. A higher value indicates a significant difference in the responses between the fallback model and the other models. The choice of what you use as a fallback model is important. When configuring a response quality difference of 10% with Anthropic’s Claude 3 Sonnet as the fallback model, the router dynamically selects an LLM to achieve an overall performance with a 10% drop in the response quality from Claude 3 Sonnet. Conversely, if you use a less expensive model such as Claude 3 Haiku as the fallback model, the router dynamically selects an LLM to achieve an overall performance with a more than 10% increase from Claude 3 Haiku.

In the following figure, you can see that the response quality difference is set at 10% with Haiku as the fallback model. If customers want to explore optimal configurations beyond the default settings described previously, they can experiment with different response quality difference thresholds, analyze the router’s response quality, cost, and latency on their development dataset, and select the configuration that best fits their application’s requirements.

When configuring your own prompt router, you can set the threshold for response quality difference as shown in the following image of the Configure prompt router page, under Response quality difference (%) in the Amazon Bedrock console. To do this by using APIs, see How to use intelligent prompt routing.

Benchmark results

When using different model pairings, the ability of the smaller model to service a larger number of input prompts will have significant latency and cost benefits, depending on the model choice and the use case. For example, when comparing between usage of Claude 3 Haiku and Claude 3.5 Haiku along with Claude 3.5 Sonnet, we observe the following with one of our internal datasets:

Case 1: Routing between Claude 3 Haiku and Claude 3.5 Sonnet V2: Cost savings of 48% while maintaining the same response quality as Claude 3.5 Sonnet v2

Case 2: Routing between Claude 3.5 Haiku and Claude 3.5 Sonnet V2: Cost savings of 56% while maintaining the same response quality as Claude 3.5 Sonnet v2

As you can see in case 1 and case 2, as model capabilities for less expensive models improve with respect to more expensive models in the same family (for example Claude 3 Haiku to 3.5 Haiku), you can expect more complex tasks to be reliably solved by them, therefore causing a higher percentage of routing to the less expensive model while still maintaining the same overall accuracy in the task.

We encourage you to test the effectiveness of Amazon Bedrock Intelligent Prompt Routing on your specialized task and domain because results can vary. For example, when we tested Amazon Bedrock Intelligent Prompt Routing with open source and internal Retrieval Augmented Generation (RAG) datasets, we saw an average 63.6% cost savings because of a higher percentage (87%) of prompts being routed to Claude 3.5 Haiku while still maintaining the baseline accuracy with the larger/ more expensive model (Sonnet 3.5 v2 in the following figure) alone, averaged across RAG datasets.

Getting started

You can get started using the AWS Management Console for Amazon Bedrock. As mentioned earlier, you can create your own router or use a default router:

Use the console to configure a router:

  1. In the Amazon Bedrock console, choose Prompt Routers in the navigation pane, and then choose Configure prompt router.
  2. You can then use a previously configured router or a default router in the console-based playground. For example, in the following figure, we attached a 10K document from Amazon.com and asked a specific question about the cost of sales.
  3. Choose the router metrics icon (next to the refresh icon) to see which model the request was routed to. Because this is a nuanced question, Amazon Bedrock Intelligent Prompt Routing correctly routes to Claude 3.5 Sonnet V2 in this case, as shown in the following figure.

You can also use AWS Command Line Interface (AWS CLI) or API, to configure and use a prompt router.

To use the AWS CLI or API to configure a router:

AWS CLI:

aws bedrock create-prompt-router 
    --prompt-router-name my-prompt-router
    --models '[{"modelArn": "arn:aws:bedrock:<region>::foundation-model/<modelA>"}]'
    --fallback-model '[{"modelArn": "arn:aws:bedrock:<region>::foundation-model/<modelB>"}]'
    --routing-criteria '{"responseQualityDifference": 0.5}'

Boto3 SDK:

response = client.create_prompt_router(
    promptRouterName='my-prompt-router',
    models=[
        {
            'modelArn': 'arn:aws:bedrock:<region>::foundation-model/<modelA>'
        },
        {
            'modelArn': 'arn:aws:bedrock:<region>::foundation-model/<modelB>'
        },
    ],
    description='string',
    routingCriteria={
        'responseQualityDifference':0.5
    },
    fallbackModel={
        'modelArn': 'arn:aws:bedrock:<region>::foundation-model/<modelA>'
    },
    tags=[
        {
            'key': 'string',
            'value': 'string'
        },
    ]
)

Caveats and best practices

When using intelligent prompt routing in Amazon Bedrock, note that:

  • Amazon Bedrock Intelligent Prompt Routing is optimized for English prompts for typical chat assistant use cases. For use with other languages or customized use cases, conduct your own tests before implementing prompt routing in production applications or reach out to your AWS account team for help designing and conducting these tests.
  • You can select only two models to be part of the router (pairwise routing), with one of these two models being the fallback model. These two models have to be in the same AWS Region.
  • When starting with Amazon Bedrock Intelligent Prompt Routing, we recommend that you experiment using the default routers provided by Amazon Bedrock before trying to configure custom routers. After you’ve experimented with default routers, you can configure your own routers as needed for your use cases, evaluate the response quality in the playground, and use them for production application if they meet your requirements.
  • Amazon Bedrock Intelligent Prompt Routing can’t adjust routing decisions or responses based on application-specific performance data currently and might not always provide the most optimal routing for unique or specialized, domain-specific use cases. Contact your AWS account team for customization help on specific use cases.

Conclusion

In this post, we explored Amazon Bedrock Intelligent Prompt Routing, highlighting its ability to help optimize both response quality and cost by dynamically routing requests between different foundation models. Benchmark results demonstrate significant cost savings while maintaining high-quality responses and reduced latency benefits across model families. Whether you implement the pre-configured default routers or create custom configurations, Amazon Bedrock Intelligent Prompt Routing offers a powerful way to balance performance and efficiency in generative AI applications. As you implement this feature in your workflows, testing its effectiveness for specific use cases is recommended to take full advantage of the flexibility it provides. To get started, see Understanding intelligent prompt routing in Amazon Bedrock


About the authors

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Balasubramaniam Srinivasan is a Senior Applied Scientist at Amazon AWS, working on post training methods for generative AI models. He enjoys enriching ML models with domain-specific knowledge and inductive biases to delight customers. Outside of work, he enjoys playing and watching tennis and football (soccer).

Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.

Haibo Ding is a senior applied scientist at Amazon Machine Learning Solutions Lab. He is broadly interested in Deep Learning and Natural Language Processing. His research focuses on developing new explainable machine learning models, with the goal of making them more efficient and trustworthy for real-world problems. He obtained his Ph.D. from University of Utah and worked as a senior research scientist at Bosch Research North America before joining Amazon. Apart from work, he enjoys hiking, running, and spending time with his family.

Read More