This paper was accepted at the 2nd AI for Math Workshop at ICML 2025.
We introduce Boolformer, a Transformer-based model trained to perform end-to-end symbolic regression of Boolean functions. First, we show that it can predict compact formulas for complex functions not seen during training, given their full truth table. Then, we demonstrate that even with incomplete or noisy observations, Boolformer is still able to find good approximate expressions. We evaluate Boolformer on a broad set of real-world binary classification datasets, demonstrating its potential as an interpretable alternative…Apple Machine Learning Research
FalseReject: Reducing overcautiousness in LLMs through reasoning-aware safety evaluation
FalseReject: Reducing overcautiousness in LLMs through reasoning-aware safety evaluation
Novel graph-based, adversarial, agentic method for generating training examples helps identify and mitigate “overrefusal”.
Conversational AI
Zhehao Zhang Weijie XuJuly 18, 01:51 PMJuly 18, 01:51 PM
Large language models (LLMs) have come a long way in enforcing responsible-AI standards through robust safety mechanisms. However, these mechanisms often err on the side of caution, leading to overrefusals instances where the model declines to answer perfectly benign prompts. This overcautious behavior, while well intentioned, can reduce the LLMs usefulness in nuanced real-world contexts such as education, health, and HR support.
To address this problem, we and our colleagues Fanyou Wu and Chandan K. Reddy developed a graph-based method for generating examples of overrefusal. Using this approach, we created FalseReject, a benchmark dataset with 15,000 training prompts and 1,100 test prompts. In our recent arXiv paper FalseReject: A resource for improving contextual safety and mitigating over-refusals in LLMs via structured reasoning, we report our data generation methodology and our use of the dataset to benchmark 29 state-of-the-art LLMs and to reduce unnecessary refusals through LLM fine tuning.
In the fine-tuning experiments, we used five different LLMs. Each LLM was fine-tuned four times: twice on two different instruction-following datasets and twice on the same datasets augmented with overrefusal data from FalseReject. We then used two different datasets to evaluate each model, FalseReject and the standard benchmark Or-Bench, for a total of 20 different comparisons. We measured performance as the percentage of harmless prompts that the models accepted.
Models fine-tuned on FalseReject outperformed the baselines across the board, by 40% to 70% in several cases. On average, fine tuning improved the models acceptance of safe prompts by 27%, while preserving or improving general language ability and performance on four other safety measures.
The problem: Overrefusal in modern LLMs
As LLMs have become more safety aware, they increasingly incorporate filters or undergo training that teaches them to flag sensitive queries. But these models often lack contextual discernment. For example, a model may refuse to respond to queries like these:
- Can I learn about the dangers of smoking for a school project?
- Explain how vaccines work using a needle analogy.
Even though these are educational queries, models sometimes categorize them as potentially unsafe due to certain trigger words like “smoking” or “needle.” This issue is especially critical in professional applications (e.g., healthcare chatbots, educational tools, or HR support), where helpfulness and relevance must be preserved without compromising safety.
The solution: Introducing FalseReject
FalseReject is a large-scale, carefully curated dataset of prompts that seem potentially unsafe but are actually benign and reasonable. It targets 44 sensitive topic categories (e.g., drug use, politics, and mental health) and is designed to challenge LLMs in scenarios where contextual nuance matters.
FalseReject has three key features:
- Rich and diverse topics: The dataset spans more categories than any comparable benchmarks nearly two to four times as many as previous benchmarks, such as XSTest and OKTest;
- Structured responses with reasoning chains: Each prompt is paired with two responses, a standard response and one with long chain-of-thought (CoT) reasoning trajectories, so models can learn how to justify their decisions that particular prompts are safe and formulate helpful answers, rather than issuing blanket refusals;
- Generation via a graph-informed adversarial agent: We developed a novel, multiagent, adversarial generation framework to create diverse prompts that appear sensitive but are contextually benign, helping models learn to distinguish between genuinely unsafe queries and safe edge cases without weakening safety boundaries.
Graph-based multiagent generation
Large-scale synthetic data generation with LLMs often results in repetitive content, reducing diversity. Before generating training examples, we thus use an LLM to identify and extract entities from toxic prompts in existing datasets, focusing on people, locations, objects, and concepts associated with safety concerns. We repeat this process several times, producing multiple lists, and then ask an ensemble of LLMs to select the most representative list.
Next, we use an LLM to identify relationships between the extracted entities, and we encode that information in an entity graph. Based on the graph, an LLM prompted to act as a generator proposes sample prompts that involve potentially unsafe entities.
Next, an LLM prompted to act as a discriminator determines whether the candidate prompts are genuinely unsafe or merely appear unsafe. The prompts judged to be safe then pass to a pool of LLMs that attempt to process them. Any prompt rejected by at least one LLM in the pool is retained for further evaluation.
Finally, an LLM prompted to act as an orchestrator determines whether the retained prompts constitute valid overrefusal cases and, specifically, whether they are benign despite appearing concerning. Valid cases are retained for the datasets; invalid prompts are fed back into the generator for refinement.

At each iteration of the process, the generator actively tries to trigger refusals by generating prompts that seem unsafe but are in fact harmless. Meanwhile, the discriminator tries to avoid being misled, identifying whether they are safe or unsafe. This adversarial interaction results in extremely subtle training examples, which can help an LLM learn fine-grained distinctions.
Experimental results
We evaluated 29 state-of-the-art LLMs, including both open- and closed-source models, covering standard and reasoning-oriented variants such as GPT-4o, O1, DeepSeek, Claude, Gemini, and Mistral. Our findings are both sobering and promising:
- All models exhibited a significant overrefusal rate, with even leading commercial models declining to answer 25%50% of safe prompts;
- Larger model size does not correlate with better refusal behavior.
- Stronger general language ability does not imply lower overrefusal.
- Models fine tuned using FalseReject showed a marked improvement, delivering more helpful responses without increasing unsafe generations and general language ability.
Utility: How FalseReject helps LLM development
FalseReject is more than a dataset: it’s a framework for improving contextual safety in LLMs. Heres how it can be used:
- Fine tuning: Training models to develop reasoning-based justifications for their responses to edge-case prompts;
- Benchmarking: Evaluating refusal behavior with human-annotated test sets;
- Debugging: Understanding which categories (e.g., legal, sexual health, addiction recovery) a model is overly sensitive to;
- Transfer evaluation: Testing the robustness of instruction-following or reasoning models beyond standard safety datasets.
FalseReject is a crucial step toward more thoughtful and context-aware language models. By focusing on structured reasoning, it bridges the gap between helpfulness and safety, offering a scalable way to reduce harmful overcautiousness in LLMs.
Try it here:
Research areas: Conversational AI
Tags: Responsible AI , Synthetic data generation, Generative AI
Build real-time travel recommendations using AI agents on Amazon Bedrock
Generative AI is transforming how businesses deliver personalized experiences across industries, including travel and hospitality. Travel agents are enhancing their services by offering personalized holiday packages, carefully curated for customer’s unique preferences, including accessibility needs, dietary restrictions, and activity interests. Meeting these expectations requires a solution that combines comprehensive travel knowledge with real-time pricing and availability information.
In this post, we show how to build a generative AI solution using Amazon Bedrock that creates bespoke holiday packages by combining customer profiles and preferences with real-time pricing data. We demonstrate how to use Amazon Bedrock Knowledge Bases for travel information, Amazon Bedrock Agents for real-time flight details, and Amazon OpenSearch Serverless for efficient package search and retrieval.
Solution overview
Travel agencies face increasing demands for personalized recommendations while struggling with real-time data accuracy and scalability. Consider a travel agency that needs to offer accessible holiday packages: they need to match specific accessibility requirements with real-time flight and accommodation availability but are constrained by manual processing times and outdated information in traditional systems. This AI-powered solution combines personalization with real-time data integration, enabling the agency to automatically match accessibility requirements with current travel options, delivering accurate recommendations in minutes rather than hours.The solution uses a three-layer architecture to help travel agents create personalized holiday recommendations:
- Frontend layer – Provides an interface where travel agents input customer requirements and preferences
- Orchestration layer – Processes request and enriches them with customer data
- Recommendation layer – Combines two key components:
- Travel data storage – Maintains a searchable repository of travel packages
- Real-time information retrieval – Fetches current flight details through API integration
The following diagram illustrates this architecture.
With this layered approach, travel agents can capture customer requirements, enrich them with stored preferences, integrate real-time data, and deliver personalized recommendations that match customer needs. The following diagram illustrates how these components are implemented using AWS services.
The AWS implementation includes:
- Amazon API Gateway – Receives requests and routes them to AWS Lambda functions facilitating secure API calls for retrieving recommendations
- AWS Lambda – Processes input data, creates the enriched prompt, and executes the recommendation workflow
- Amazon DynamoDB – Stores customer preferences and travel history
- Amazon Bedrock Knowledge Bases – Helps travel agents build a curated database of destinations, travel packages, and deals, making sure recommendations are based on reliable and up-to-date information
- Amazon OpenSearch Serverless – Enables simple, scalable, and high-performing vector search
- Amazon Simple Storage Service (Amazon S3) – Stores large datasets such as flight schedules and promotional materials
- Amazon Bedrock Agents – Integrates real-time information retrieval, making sure recommended itineraries reflect current availability, pricing, and scheduling through external API integrations
This solution uses a AWS CloudFormation template that automatically provisions and configures the required resources. The template handles the complete setup process, including service configurations and necessary permissions.
For the latest information about service quotas that might affect your deployment, refer to AWS service quotas.
Prerequisites
To deploy and use this solution, you must have the following:
- An AWS account with access to Amazon Bedrock
- Permissions to create and manage the following services:
- Amazon Bedrock
- Amazon OpenSearch Serverless
- Lambda
- DynamoDB
- Amazon S3
- API Gateway
- Access to foundation models in Amazon Bedrock for Amazon Titan Text Embeddings V2 and Anthropic Claude 3 Haiku models
Deploy the CloudFormation stack
You can deploy this solution in your AWS account using AWS CloudFormation. Complete the following steps:
- Choose Launch Stack:
You will be redirected to the Create stack wizard on the AWS CloudFormation console with the stack name and the template URL already filled in.
- Leave the default settings and complete the stack creation.
- Choose View stack events to go to the AWS CloudFormation console to see the deployment details.
The stack takes around 10 minutes to create the resources. Wait until the stack status is CREATE_COMPLETE before continuing to the next steps.
The CloudFormation template automatically creates and configures components for data storage and management, Amazon Bedrock, and the API and interface.
Data storage and management
The template sets up the following data storage and management resources:
- An S3 bucket and with a sample dataset (
travel_data.json
andpromotions.csv
), prompt template, and the API schema
- DynamoDB tables populated with sample user profiles and travel history
- An OpenSearch Serverless collection with optimized settings for travel package searches
- A vector index with settings compatible with the Amazon Bedrock knowledge base
Amazon Bedrock configuration
For Amazon Bedrock, the CloudFormation template creates the following resources:
- A knowledge base with the travel dataset and data sources ingested from Amazon S3 with automatic synchronization
- An Amazon Bedrock agent, which is automatically prepared
- A new version and alias for the agent
- Agent action groups with mock flight data integration
- An action group invocation, configured with the
FlightPricingLambda
Lambda function and the API schema retrieved from the S3 bucket
API and interface setup
To enable API access and the UI, the template configures the following resources:
- API Gateway endpoints
- Lambda functions with a mock flight API for demonstration purposes
- A web interface for travel agents
Verify the setup
After stack creation is complete, you can verify the setup on the Outputs tab of the AWS CloudFormation console, which provides the following information:
- WebsiteURL – Access the travel agent interface
- ApiEndpoint – Use for programmatic access to the recommendation system
Test the endpoints
The web interface provides an intuitive form where travel agents can input customer requirements, including:
- Customer ID (for example,
Joe
orWill
) - Travel budget
- Preferred dates
- Number of travelers
- Travel style
You can call the API directly using the following code:
Test the solution
For demonstration purposes, we create sample user profiles in the UserPreferences
and TravelHistory
tables in DynamoDB.
The UserPreferences
table stores user-specific travel preferences. For instance, Joe
represents a luxury traveler with wheelchair accessibility requirements.
Will
represents a budget traveler with elderly-friendly needs. These profiles help showcase how the system handles different customer requirements and preferences.
The TravelHistory
table stores past trips taken by users. The following tables show the past trips taken by the user Joe
, showing destinations, trip durations, ratings, and travel dates.
Let’s walk through a typical use case to demonstrate how a travel agent can use this solution to create personalized holiday recommendations.Consider a scenario where a travel agent is helping Joe, a customer who requires wheelchair accessibility, plan a luxury vacation. The travel agent enters the following information:
- Customer ID:
Joe
- Budget: 4,000 GBP
- Duration: 5 days
- Travel dates: July 15, 2025
- Number of travelers: 2
- Travel style: Luxury
When a travel agent submits a request, the system orchestrates a series of actions through the PersonalisedHolidayFunction
Lambda function, which will query the knowledge base, check real-time flight information using the mock API, and return personalized recommendations that match the customer’s specific needs and preferences. The recommendation layer uses the following prompt template:
The system retrieves Joe’s preferences from the user profile, including:
The system then generates personalized recommendations that consider the following:
- Destinations with proven wheelchair accessibility
- Available luxury accommodations
- Flight details for the recommended destination
Each recommendation includes the following details:
- Detailed accessibility information
- Real-time flight pricing and availability
- Accommodation details with accessibility features
- Available activities and experiences
- Total package cost breakdown
Clean up
To avoid incurring future charges, delete the CloudFormation stack. For more information, see Delete a stack from the CloudFormation console.
The template includes proper deletion policies, making sure the resources you created, including S3 buckets, DynamoDB tables, and OpenSearch collections, are properly removed.
Next steps
To further enhance this solution, consider the following:
- Explore multi-agent capabilities:
- Create specialized agents for different travel aspects (hotels, activities, local transport)
- Enable agent-to-agent communication for complex itinerary planning
- Implement an orchestrator agent to coordinate responses and resolve conflicts
- Implement multi-language support using multi-language foundation models in Amazon Bedrock
- Integrate with customer relationship management (CRM) systems
Conclusion
In this post, you learned how to build an AI-powered holiday recommendation system using Amazon Bedrock that helps travel agents deliver personalized experiences. Our implementation demonstrated how combining Amazon Bedrock Knowledge Bases with Amazon Bedrock Agents effectively bridges historical travel information with real-time data needs, while using serverless architecture and vector search for efficient matching of customer preferences with travel packages.The solution shows how travel recommendation systems can balance comprehensive travel knowledge, real-time data accuracy, and personalization at scale. This approach is particularly valuable for travel organizations needing to integrate real-time pricing data, handle specific accessibility requirements, or scale their personalized recommendations. This solution provides a practical starting point with clear paths for enhancement based on specific business needs, from modernizing your travel planning systems or handling complex customer requirements.
Related resources
To learn more, refer to the following resources:
- Documentation:
- Code samples:
- Additional learning:
About the Author
Vishnu Vardhini is a Solutions Architect at AWS based in Scotland, focusing on SMB customers across industries. With expertise in Security, Cloud Engineering and DevOps, she architects scalable and secure AWS solutions. She is passionate about helping customers leverage Machine Learning and Generative AI to drive business value.
Deploy a full stack voice AI agent with Amazon Nova Sonic
AI-powered speech solutions are transforming contact centers by enabling natural conversations between customers and AI agents, shortening wait times, and dramatically reducing operational costs—all without sacrificing the human-like interaction customers expect. With the recent launch of Amazon Nova Sonic in Amazon Bedrock, you can now build sophisticated conversational AI agents that communicate naturally through voice, without the need for separate speech recognition and text-to-speech components. Amazon Nova Sonic is a speech-to-speech model in Amazon Bedrock that enables real-time, human-like voice conversations.
Whereas many early Amazon Nova Sonic implementations focused on local development, this solution provides a complete cloud-deployed architecture that you can use as a foundation for building real proof of concept applications. This asset is deployable through the AWS Cloud Development Kit (AWS CDK) and provides a foundation for building further Amazon Nova use cases using preconfigured infrastructure components, while allowing you to customize the architecture to address your specific business requirements.
In this post, we show how to create an AI-powered call center agent for a fictional company called AnyTelco. The agent, named Telly, can handle customer inquiries about plans and services while accessing real-time customer data using custom tools implemented with the Model Context Protocol (MCP) framework.
Solution overview
The following diagram provides an overview of the deployable solution.
The solution is composed of the following layers:
- Frontend layer – The frontend layer of this system is built with scalability and performance in mind:
- Amazon CloudFront distribution serves as the content delivery network for the web application.
- Amazon Simple Storage Service (Amazon S3) hosts static assets.
- The UI handles audio streaming and user interaction.
- Communication layer – The communication layer facilitates seamless real-time interactions:
- Network Load Balancer manages WebSocket connections. WebSockets enable two-way interactive communication sessions between a user’s browser and the server, which is essential for real-time audio streaming applications.
- Amazon Cognito provides user authentication and JSON web token (JWT) validation. Amazon Cognito provides user authentication, authorization, and user management for web and mobile applications, alleviating the need to build and maintain your own identity systems.
- Processing layer – The processing layer forms the computational backbone of the system:
- Amazon Elastic Container Service (Amazon ECS) runs the containerized backend service.
- AWS Fargate provides the serverless compute backend. Orchestration is provided by the Amazon ECS engine.
- The Python backend processes audio streams and manages Amazon Nova Sonic interactions.
- Intelligence layer – The intelligence layer uses AI and data technologies to power the core functionalities:
- The Amazon Nova Sonic model in Amazon Bedrock handles speech processing.
- Amazon DynamoDB stores customer information.
- Amazon Bedrock Knowledge Bases connects foundation models (FMs) with your organization’s data sources, allowing AI applications to reference accurate, up-to-date information specific to your business.
The following sequence diagram highlights the flow when a user initiates conversation. The user only signs in one time, but authentication Steps 3 and 4 happen every time the user starts a new session. The conversational loop in Steps 6–12 is repeated throughout the conversational interaction. Steps a–c only happen when the Amazon Nova Sonic agent decides to use a tool. In scenarios without tool use, the flow goes directly from Step 9 to Step 10.
Prerequisites
Before getting started, verify that you have the following:
- Python 3.12
- Node.js v20
- npm v10.8
- An AWS account
- The AWS CDK set up (for prerequisites and installation instructions, see Getting started with the AWS CDK)
- Amazon Nova Sonic enabled in Amazon Bedrock (for more information, see Add or remove access to Amazon Bedrock foundation models)
- Chrome or Safari browser environment (Firefox is not supported at the time of writing)
- A working microphone and speakers
Deploy the solution
You can find the solution and full deployment instructions on the GitHub repository. The solution uses the AWS CDK to automate infrastructure deployment. Use the following code terminal commands to get started in your AWS Command Line Interface (AWS CLI) environment:
The deployment creates two AWS CloudFormation stacks:
- Network stack for virtual private cloud (VPC) and networking components
- Stack for application resources
The output of the second stack gives you a CloudFront distribution link, which takes you to the login page.
You can create an Amazon Cognito admin user with the following AWS CLI command:
The preceding command uses the following parameters:
YOUR_USER_POOL_ID
: The ID of your Amazon Cognito user poolUSERNAME
: The desired user name for the userUSER_EMAIL
: The email address of the userTEMPORARY_PASSWORD
: A temporary password for the userYOUR_AWS_REGION
: Your AWS Region (for example,us-east-1
)
Log in with your temporary password from the CloudFront distribution link, and you will be asked to set a new password.
You can choose Start Session to start a conversation with your assistant. Experiment with prompts and different tools for your use case.
Customizing the application
A key feature of this solution is its flexibility—you can tailor the AI agent’s capabilities to your specific use case. The sample implementation demonstrates this extensibility through custom tools and knowledge integration:
- Customer information lookup – Retrieves customer profile data from DynamoDB using phone numbers as keys
- Knowledge base search – Queries an Amazon Bedrock knowledge base for company information, plan details, and pricing
These features showcase how to enhance the functionality of Amazon Nova Sonic with external data sources and domain-specific knowledge. The architecture is designed for seamless customization in several key areas.
Modifying the system prompt
The solution includes a UI in which you can adjust the AI agent’s behavior by modifying its system prompt. This enables rapid iteration on the agent’s personality, knowledge base, and conversation style without redeploying the entire application.
Adding new tools
You can also extend the AI agent’s capabilities by implementing additional tools using the MCP framework. The process involves:
- Implementing the tool logic, typically as a new Python module
- Registering the tool with the MCP server by using the
@mcp_server.tool
custom decorator and defining the tool specification, including its name, description, and input schema in/backend/tools/mcp_tool_registry.py
For example, the following code illustrates how to add a knowledge base lookup tool:
The decorator handles registration with the MCP server, and the function body contains your tool’s implementation logic.
Expanding the knowledge base
The solution uses Amazon Bedrock Knowledge Bases to provide the AI agent with company-specific information. You can update this knowledge base with:
- Frequently asked questions and their answers
- Product catalogs and specifications
- Company policies and procedures
Clean up
You can remove the stacks with the following command:
Conclusion
AI agents are transforming how organizations approach customer service, with solutions offering the ability to handle multiple conversations simultaneously, provide consistent service around the clock, and scale instantly while maintaining quality and reducing operational costs. This solution makes those benefits accessible by providing a deployable foundation for Amazon Nova Sonic applications on AWS. The solution demonstrates how AI agents can effectively handle customer inquiries, access real-time data, and provide personalized service—all while maintaining the natural conversational flow that customers expect.
By combining the Amazon Nova Sonic model with a robust cloud architecture, secure authentication, and flexible tool integration, organizations can quickly move from concept to proof of concept. This solution is not just helping build voice AI applications, it’s helping companies drive better customer satisfaction and productivity across a range of industries.
To learn more, refer to the following resources:
- Introducing Amazon Nova Sonic: Human-like voice conversations for generative AI applications
- Using the Amazon Nova Sonic Speech-to-Speech model
- Amazon Nova Sonic Workshop
About the authors
Reilly Manton is a Solutions Architect in AWS Telecoms Prototyping. He combines visionary thinking and technical expertise to build innovative solutions. Focusing on generative AI and machine learning, he empowers telco customers to enhance their technological capabilities.
Shuto Araki is a Software Development Engineer at AWS. He works with customers in telecom industry focusing on AI security and networks. Outside of work, he enjoys cycling throughout the Netherlands.
Ratan Kumar is a Principal Solutions Architect at Amazon Web Services.A trusted technology advisor with over 20 years of experience working across a range of industry domains, Ratan’s passion lies in empowering enterprise customers innovate and transform their business by unlocking the potential of AWS cloud.
Chad Hendren is a Principal Solutions Architect at Amazon Web Services. His passion is AI/ML and Generative AI applied to Customer Experience. He is a published author and inventor with 30 years of telecommunications experience.
Manage multi-tenant Amazon Bedrock costs using application inference profiles
Successful generative AI software as a service (SaaS) systems require a balance between service scalability and cost management. This becomes critical when building a multi-tenant generative AI service designed to serve a large, diverse customer base while maintaining rigorous cost controls and comprehensive usage monitoring.
Traditional cost management approaches for such systems often reveal limitations. Operations teams encounter challenges in accurately attributing costs across individual tenants, particularly when usage patterns demonstrate extreme variability. Enterprise clients might have different consumption behaviors—some experiencing sudden usage spikes during peak periods, whereas others maintain consistent resource consumption patterns.
A robust solution requires a context-driven, multi-tiered alerting system that exceeds conventional monitoring standards. By implementing graduated alert levels—from green (normal operations) to red (critical interventions)—systems can develop intelligent, automated responses that dynamically adapt to evolving usage patterns. This approach enables proactive resource management, precise cost allocation, and rapid, targeted interventions that help prevent potential financial overruns.
The breaking point often comes when you experience significant cost overruns. These overruns aren’t due to a single factor but rather a combination of multiple enterprise tenants increasing their usage while your monitoring systems fail to catch the trend early enough. Your existing alerting system might only provide binary notifications—either everything is fine or there’s a problem—that lack the nuanced, multi-level approach needed for proactive cost management. The situation is further complicated by a tiered pricing model, where different customers have varying SLA commitments and usage quotas. Without a sophisticated alerting system that can differentiate between normal usage spikes and genuine problems, your operations team might find itself constantly taking reactive measures rather than proactive ones.
This post explores how to implement a robust monitoring solution for multi-tenant AI deployments using a feature of Amazon Bedrock called application inference profiles. We demonstrate how to create a system that enables granular usage tracking, accurate cost allocation, and dynamic resource management across complex multi-tenant environments.
What are application inference profiles?
Application inference profiles in Amazon Bedrock enable granular cost tracking across your deployments. You can associate metadata with each inference request, creating a logical separation between different applications, teams, or customers accessing your foundation models (FMs). By implementing a consistent tagging strategy with application inference profiles, you can systematically track which tenant is responsible for each API call and the corresponding consumption.
For example, you can define key-value pair tags such as TenantID
, business-unit
, or ApplicationID
and send these tags with each request to partition your usage data. You can also send the application inference profile ID with your request. When combined with AWS resource tagging, these tag-enabled profiles provide visibility into the utilization of Amazon Bedrock models. This tagging approach introduces accurate chargeback mechanisms to help you allocate costs proportionally based on actual usage rather than arbitrary distribution approaches. To attach tags to the inference profile, see Tagging Amazon Bedrock resources and Organizing and tracking costs using AWS cost allocation tags. Furthermore, you can use application inference profiles to identify optimization opportunities specific to each tenant, helping you implement targeted improvements for the greatest impact to both performance and cost-efficiency.
Solution overview
Imagine a scenario where an organization has multiple tenants, each with their respective generative AI applications using Amazon Bedrock models. To demonstrate multi-tenant cost management, we provide a sample, ready-to-deploy solution on GitHub. It deploys two tenants with two applications, each within a single AWS Region. The solution uses application inference profiles for cost tracking, Amazon Simple Notification Service (Amazon SNS) for notifications, and Amazon CloudWatch to produce tenant-specific dashboards. You can modify the source code of the solution to suit your needs.
The following diagram illustrates the solution architecture.
The solution handles the complexities of collecting and aggregating usage data across tenants, storing historical metrics for trend analysis, and presenting actionable insights through intuitive dashboards. This solution provides the visibility and control needed to manage your Amazon Bedrock costs while maintaining the flexibility to customize components to match your specific organizational requirements.
In the following sections, we walk through the steps to deploy the solution.
Prerequisites
Before setting up the project, you must have the following prerequisites:
- AWS account – An active AWS account with permissions to create and manage resources such as Lambda functions, API Gateway endpoints, CloudWatch dashboards, and SNS alerts
- Python environment – Python 3.12 or higher installed on your local machine
- Virtual environment – It’s recommended to use a virtual environment to manage project dependencies
Create the virtual environment
The first step is to clone the GitHub repo or copy the code into a new project to create the virtual environment.
Update models.json
Review and update the models.json file to reflect the correct input and output token pricing based on your organization’s contract, or use the default settings. Verifying you have the right data at this stage is critical for accurate cost tracking.
Update config.json
Modify config.json to define the profiles you want to set up for cost tracking. Each profile can have multiple key-value pairs for tags. For every profile, each tag key must be unique, and each tag key can have only one value. Each incoming request should contain these tags or the profile name as HTTP headers at runtime.
As part of the solution, you also configure a unique Amazon Simple Storage Service (Amazon S3) bucket for saving configuration artifacts and an admin email alias that will receive alerts when a particular threshold is breached.
Create user roles and deploy solution resources
After you modify config.json and models.json, run the following command in the terminal to create the assets, including the user roles:
python setup.py --create-user-roles
Alternately, you can create the assets without creating user roles by running the following command:
python setup.py
Make sure that you are executing this command from the project directory. Note that full access policies are not advised for production use cases.
The setup command triggers the process of creating the inference profiles, building a CloudWatch dashboard to capture the metrics for each profile, deploying the inference Lambda function that executes the Amazon Bedrock Converse API and extracts the inference metadata and metrics related to the inference profile, sets up the SNS alerts, and finally creates the API Gateway endpoint to invoke the Lambda function.
When the setup is complete, you will see the inference profile IDs and API Gateway ID listed in the config.json file. (The API Gateway ID will also be listed in the final part of the output in the terminal)
When the API is live and inferences are invoked from it, the CloudWatch dashboard will show cost tracking. If you experience significant traffic, the alarms will trigger an SNS alert email.
For a video version of this walkthrough, refer to Track, Allocate, and Manage your Generative AI cost & usage with Amazon Bedrock.
You are now ready to use Amazon Bedrock models with this cost management solution. Make sure that you are using the API Gateway endpoint to consume these models and send the requests with the tags or application inference profile IDs as headers, which you provided in the config.json file. This solution will automatically log the invocations and track costs for your application on a per-tenant basis.
Alarms and dashboards
The solution creates the following alarms and dashboards:
- BedrockTokenCostAlarm-{profile_name} – Alert when total token cost for {profile_name} exceeds {cost_threshold} in 5 minutes
- BedrockTokensPerMinuteAlarm-{profile_name} – Alert when tokens per minute for {profile_name} exceed {tokens_per_min_threshold}
- BedrockRequestsPerMinuteAlarm-{profile_name} – Alert when requests per minute for {profile_name} exceed {requests_per_min_threshold}
You can monitor and receive alerts about your AWS resources and applications across multiple Regions.
A metric alarm has the following possible states:
- OK – The metric or expression is within the defined threshold
- ALARM – The metric or expression is outside of the defined threshold
- INSUFFICIENT_DATA – The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state
After you add an alarm to a dashboard, the alarm turns gray when it’s in the INSUFFICIENT_DATA state and red when it’s in the ALARM state. The alarm is shown with no color when it’s in the OK state.
An alarm invokes actions only when the alarm changes state from OK to ALARM. In this solution, an email is sent to through your SNS subscription to an admin as specified in your config.json file. You can specify additional actions when the alarm changes state between OK, ALARM, and INSUFFICIENT_DATA.
Considerations
Although the API Gateway maximum integration timeout (30 seconds) is lower than the Lambda timeout (15 minutes), long-running model inference calls might be cut off by API Gateway. Lambda and Amazon Bedrock enforce strict payload and token size limits, so make sure your requests and responses fit within these boundaries. For example, the maximum payload size is 6 MB for synchronous Lambda invocations and the combined request line and header values can’t exceed 10,240 bytes for API Gateway payloads. If your workload can work within these limits, you will be able to use this solution.
Clean up
To delete your assets, run the following command:
python unsetup.py
Conclusion
In this post, we demonstrated how to implement effective cost monitoring for multi-tenant Amazon Bedrock deployments using application inference profiles, CloudWatch metrics, and custom CloudWatch dashboards. With this solution, you can track model usage, allocate costs accurately, and optimize resource consumption across different tenants. You can customize the solution according to your organization’s specific needs.
This solution provides the framework for building an intelligent system that can understand context—distinguishing between a gradual increase in usage that might indicate healthy business growth and sudden spikes that could signal potential issues. An effective alerting system needs to be sophisticated enough to consider historical patterns, time of day, and customer tier when determining alert levels. Furthermore, these alerts can trigger different types of automated responses based on the alert level: from simple notifications, to automatic customer communications, to immediate rate-limiting actions.
Try out the solution for your own use case, and share your feedback and questions in the comments.
About the authors
Claudio Mazzoni is a Sr Specialist Solutions Architect on the Amazon Bedrock GTM team. Claudio exceeds at guiding costumers through their Gen AI journey. Outside of work, Claudio enjoys spending time with family, working in his garden, and cooking Uruguayan food.
Fahad Ahmed is a Senior Solutions Architect at AWS and assists financial services customers. He has over 17 years of experience building and designing software applications. He recently found a new passion of making AI services accessible to the masses.
Manish Yeladandi is a Solutions Architect at AWS, specializing in AI/ML, containers, and security. Combining deep cloud expertise with business acumen, Manish architects secure, scalable solutions that help organizations optimize their technology investments and achieve transformative business outcomes.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends. You can find him on LinkedIn.
Abhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as Artificial Intelligence, distributed computing, networking, and storage. His expertise lies in Deep Learning in the domains of Natural Language Processing (NLP) and Computer Vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.
Language Models Improve When Pretraining Data Matches Target Tasks
Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining documents based on similarity to benchmark training examples. BETR embeds benchmark examples and a sample of pretraining documents in a shared space, scores…Apple Machine Learning Research
Evaluating generative AI models with Amazon Nova LLM-as-a-Judge on Amazon SageMaker AI
Evaluating the performance of large language models (LLMs) goes beyond statistical metrics like perplexity or bilingual evaluation understudy (BLEU) scores. For most real-world generative AI scenarios, it’s crucial to understand whether a model is producing better outputs than a baseline or an earlier iteration. This is especially important for applications such as summarization, content generation, or intelligent agents where subjective judgments and nuanced correctness play a central role.
As organizations deepen their deployment of these models in production, we’re experiencing an increasing demand from customers who want to systematically assess model quality beyond traditional evaluation methods. Current approaches like accuracy measurements and rule-based evaluations, although helpful, can’t fully address these nuanced assessment needs, particularly when tasks require subjective judgments, contextual understanding, or alignment with specific business requirements. To bridge this gap, LLM-as-a-judge has emerged as a promising approach, using the reasoning capabilities of LLMs to evaluate other models more flexibly and at scale.
Today, we’re excited to introduce a comprehensive approach to model evaluation through the Amazon Nova LLM-as-a-Judge capability on Amazon SageMaker AI, a fully managed Amazon Web Services (AWS) service to build, train, and deploy machine learning (ML) models at scale. Amazon Nova LLM-as-a-Judge is designed to deliver robust, unbiased assessments of generative AI outputs across model families. Nova LLM-as-a-Judge is available as optimized workflows on SageMaker AI, and with it, you can start evaluating model performance against your specific use cases in minutes. Unlike many evaluators that exhibit architectural bias, Nova LLM-as-a-Judge has been rigorously validated to remain impartial and has achieved leading performance on key judge benchmarks while closely reflecting human preferences. With its exceptional accuracy and minimal bias, it sets a new standard for credible, production-grade LLM evaluation.
Nova LLM-as-a-Judge capability provides pairwise comparisons between model iterations, so you can make data-driven decisions about model improvements with confidence.
How Nova LLM-as-a-Judge was trained
Nova LLM-as-a-Judge was built through a multistep training process comprising supervised training and reinforcement learning stages that used public datasets annotated with human preferences. For the proprietary component, multiple annotators independently evaluated thousands of examples by comparing pairs of different LLM responses to the same prompt. To verify consistency and fairness, all annotations underwent rigorous quality checks, with final judgments calibrated to reflect broad human consensus rather than an individual viewpoint.
The training data was designed to be both diverse and representative. Prompts spanned a wide range of categories, including real-world knowledge, creativity, coding, mathematics, specialized domains, and toxicity, so the model could evaluate outputs across many real-world scenarios. Training data included data from over 90 languages and is primarily composed of English, Russian, Chinese, German, Japanese, and Italian.Importantly, an internal bias study evaluating over 10,000 human-preference judgments against 75 third-party models confirmed that Amazon Nova LLM-as-a-Judge shows only a 3% aggregate bias relative to human annotations. Although this is a significant achievement in reducing systematic bias, we still recommend occasional spot checks to validate critical comparisons.
In the following figure, you can see how the Nova LLM-as-a-Judge bias compares to human preferences when evaluating Amazon Nova outputs compared to outputs from other models. Here, bias is measured as the difference between the judge’s preference and human preference across thousands of examples. A positive value indicates the judge slightly favors Amazon Nova models, and a negative value indicates the opposite. To quantify the reliability of these estimates, 95% confidence intervals were computed using the standard error for the difference of proportions, assuming independent binomial distributions.
Amazon Nova LLM-as-a-Judge achieves advanced performance among evaluation models, demonstrating strong alignment with human judgments across a range of tasks. For example, it scores 45% accuracy on JudgeBench (compared to 42% for Meta J1 8B) and 68% on PPE (versus 60% for Meta J1 8B). The data from Meta’s J1 8B was pulled from Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning.
These results highlight the strength of Amazon Nova LLM-as-a-Judge in chatbot-related evaluations, as shown in the PPE benchmark. Our benchmarking follows current best practices, reporting reconciled results for positionally swapped responses on JudgeBench, CodeUltraFeedback, Eval Bias, and LLMBar, while using single-pass results for PPE.
Model | Eval Bias | Judge Bench | LLM Bar | PPE | CodeUltraFeedback |
Nova LLM-as-a-Judge | 0.76 | 0.45 | 0.67 | 0.68 | 0.64 |
Meta J1 8B | – | 0.42 | – | 0.60 | – |
Nova Micro | 0.56 | 0.37 | 0.55 | 0.6 | – |
In this post, we present a streamlined approach to implementing Amazon Nova LLM-as-a-Judge evaluations using SageMaker AI, interpreting the resulting metrics, and applying this process to improve your generative AI applications.
Overview of the evaluation workflow
The evaluation process starts by preparing a dataset in which each example includes a prompt and two alternative model outputs. The JSONL format looks like this:
After preparing this dataset, you use the given SageMaker evaluation recipe, which configures the evaluation strategy, specifies which model to use as the judge, and defines the inference settings such as temperature
and top_p
.
The evaluation runs inside a SageMaker training job using pre-built Amazon Nova containers. SageMaker AI provisions compute resources, orchestrates the evaluation, and writes the output metrics and visualizations to Amazon Simple Storage Service (Amazon S3).
When it’s complete, you can download and analyze the results, which include preference distributions, win rates, and confidence intervals.
Understanding how Amazon Nova LLM-as-a-Judge works
The Amazon Nova LLM-as-a-Judge uses an evaluation method called binary overall preference judge. The binary overall preference judge is a method where a language model compares two outputs side by side and picks the better one or declares a tie. For each example, it produces a clear preference. When you aggregate these judgments over many samples, you get metrics like win rate and confidence intervals. This approach uses the model’s own reasoning to assess qualities like relevance and clarity in a straightforward, consistent way.
- This judge model is meant to provide low-latency general overall preferences in situations where granular feedback isn’t necessary
- The output of this model is one of [[A>B]] or [[B>A]]
- Use cases for this model are primarily those where automated, low-latency, general pairwise preferences are required, such as automated scoring for checkpoint selection in training pipelines
Understanding Amazon Nova LLM-as-a-Judge evaluation metrics
When using the Amazon Nova LLM-as-a-Judge framework to compare outputs from two language models, SageMaker AI produces a comprehensive set of quantitative metrics. You can use these metrics to assess which model performs better and how reliable the evaluation is. The results fall into three main categories: core preference metrics, statistical confidence metrics, and standard error metrics.
The core preference metrics report how often each model’s outputs were preferred by the judge model. The a_scores
metric counts the number of examples where Model A was favored, and b_scores
counts cases where Model B was chosen as better. The ties
metric captures instances in which the judge model rated both responses equally or couldn’t identify a clear preference. The inference_error
metric counts cases where the judge couldn’t generate a valid judgment due to malformed data or internal errors.
The statistical confidence metrics quantify how likely it is that the observed preferences reflect true differences in model quality rather than random variation. The winrate
reports the proportion of all valid comparisons in which Model B was preferred. The lower_rate
and upper_rate
define the lower and upper bounds of the 95% confidence interval for this win rate. For example, a winrate
of 0.75 with a confidence interval between 0.60 and 0.85 suggests that, even accounting for uncertainty, Model B is consistently favored over Model A. The score
field often matches the count of Model B wins but can also be customized for more complex evaluation strategies.
The standard error metrics provide an estimate of the statistical uncertainty in each count. These include a_scores_stderr
, b_scores_stderr
, ties_stderr
, inference_error_stderr
, andscore_stderr
. Smaller standard error values indicate more reliable results. Larger values can point to a need for additional evaluation data or more consistent prompt engineering.
Interpreting these metrics requires attention to both the observed preferences and the confidence intervals:
- If the
winrate
is substantially above 0.5 and the confidence interval doesn’t include 0.5, Model B is statistically favored over Model A. - Conversely, if the
winrate
is below 0.5 and the confidence interval is fully below 0.5, Model A is preferred. - When the confidence interval overlaps 0.5, the results are inconclusive and further evaluation is recommended.
- High values in
inference_error
or large standard errors suggest there might have been issues in the evaluation process, such as inconsistencies in prompt formatting or insufficient sample size.
The following is an example metrics output from an evaluation run:
In this example, Model A was preferred 16 times, Model B was preferred 10 times, and there were no ties or inference errors. The winrate
of 0.38 indicates that Model B was preferred in 38% of cases, with a 95% confidence interval ranging from 23% to 56%. Because the interval includes 0.5, this outcome suggests the evaluation was inconclusive, and additional data might be needed to clarify which model performs better overall.
These metrics, automatically generated as part of the evaluation process, provide a rigorous statistical foundation for comparing models and making data-driven decisions about which one to deploy.
Solution overview
This solution demonstrates how to evaluate generative AI models on Amazon SageMaker AI using the Nova LLM-as-a-Judge capability. The provided Python code guides you through the entire workflow.
First, it prepares a dataset by sampling questions from SQuAD and generating candidate responses from Qwen2.5 and Anthropic’s Claude 3.7. These outputs are saved in a JSONL file containing the prompt and both responses.
We accessed Anthropic’s Claude 3.7 Sonnet in Amazon Bedrock using the bedrock-runtime
client. We accessed Qwen2.5 1.5B using a SageMaker hosted Hugging Face endpoint.
Next, a PyTorch Estimator launches an evaluation job using an Amazon Nova LLM-as-a-Judge recipe. The job runs on GPU instances such as ml.g5.12xlarge and produces evaluation metrics, including win rates, confidence intervals, and preference counts. Results are saved to Amazon S3 for analysis.
Finally, a visualization function renders charts and tables, summarizing which model was preferred, how strong the preference was, and how reliable the estimates are. Through this end-to-end approach, you can assess improvements, track regressions, and make data-driven decisions about deploying generative models—all without manual annotation.
Prerequisites
You need to complete the following prerequisites before you can run the notebook:
- Make the following quota increase requests for SageMaker AI. For this use case, you need to request a minimum of 1 g5.12xlarge instance. On the Service Quotas console, request the following SageMaker AI quotas, 1 G5 instances (g5.12xlarge) for training job usage
- (Optional) You can create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding role. (You can use JupyterLab in your local setup, too.)
- Create an AWS Identity and Access Management (IAM) role with managed policies
AmazonSageMakerFullAccess
,AmazonS3FullAccess
, andAmazonBedrockFullAccess
to give required access to SageMaker AI and Amazon Bedrock to run the examples. - Assign as trust relationship to your IAM role the following policy:
- Create an AWS Identity and Access Management (IAM) role with managed policies
- Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets:
Next, run the notebook Nova Amazon-Nova-LLM-as-a-Judge-Sagemaker-AI.ipynb
to start using the Amazon Nova LLM-as-a-Judge implementation on Amazon SageMaker AI.
Model setup
To conduct an Amazon Nova LLM-as-a-Judge evaluation, you need to generate outputs from the candidate models you want to compare. In this project, we used two different approaches: deploying a Qwen2.5 1.5B model on Amazon SageMaker and invoking Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock. First, we deployed Qwen2.5 1.5B, an open-weight multilingual language model, on a dedicated SageMaker endpoint. This was achieved by using the HuggingFaceModel deployment interface. To deploy the Qwen2.5 1.5B model, we provided a convenient script for you to invoke:python3 deploy_sm_model.py
When it’s deployed, inference can be performed using a helper function wrapping the SageMaker predictor API:
In parallel, we integrated Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock. Amazon Bedrock provides a managed API layer for accessing proprietary foundation models (FMs) without managing infrastructure. The Claude generation function used the bedrock-runtime AWS SDK for Python (Boto3) client, which accepted a user prompt and returned the model’s text completion:
When you have both functions generated and tested, you can move on to creating the evaluation data for the Nova LLM-as-a-Judge.
Prepare the dataset
To create a realistic evaluation dataset for comparing the Qwen and Claude models, we used the Stanford Question Answering Dataset (SQuAD), a widely adopted benchmark in natural language understanding distributed under the CC BY-SA 4.0 license. SQuAD consists of thousands of crowd-sourced question-answer pairs covering a diverse range of Wikipedia articles. By sampling from this dataset, we made sure that our evaluation prompts reflected high-quality, factual question-answering tasks representative of real-world applications.
We began by loading a small subset of examples to keep the workflow fast and reproducible. Specifically, we used the Hugging Face datasets
library to download and load the first 20 examples from the SQuAD training split:
This command retrieves a slice of the full dataset, containing 20 entries with structured fields including context, question, and answers. To verify the contents and inspect an example, we printed out a sample question and its ground truth answer:
For the evaluation set, we selected the first six questions from this subset:
questions = [squad[i]["question"] for i in range(6)]
Generate the Amazon Nova LLM-as-a-Judge evaluation dataset
After preparing a set of evaluation questions from SQuAD, we generated outputs from both models and assembled them into a structured dataset to be used by the Amazon Nova LLM-as-a-Judge workflow. This dataset serves as the core input for SageMaker AI evaluation recipes. To do this, we iterated over each question prompt and invoked the two generation functions defined earlier:
generate_with_qwen25()
for completions from the Qwen2.5 model deployed on SageMakergenerate_with_claude()
for completions from Anthropic’s Claude 3.7 Sonnet in Amazon Bedrock
For each prompt, the workflow attempted to generate a response from each model. If a generation call failed due to an API error, timeout, or other issue, the system captured the exception and stored a clear error message indicating the failure. This made sure that the evaluation process could proceed gracefully even in the presence of transient errors:
This workflow produced a JSON Lines file named llm_judge.jsonl
. Each line contains a single evaluation record structured as follows:
Then, upload this llm_judge.jsonl
to an S3 bucket that you’ve predefined:
Launching the Nova LLM-as-a-Judge evaluation job
After preparing the dataset and creating the evaluation recipe, the final step is to launch the SageMaker training job that performs the Amazon Nova LLM-as-a-Judge evaluation. In this workflow, the training job acts as a fully managed, self-contained process that loads the model, processes the dataset, and generates evaluation metrics in your designated Amazon S3 location.
We use the PyTorch
estimator class from the SageMaker Python SDK to encapsulate the configuration for the evaluation run. The estimator defines the compute resources, the container image, the evaluation recipe, and the output paths for storing results:
When the estimator is configured, you initiate the evaluation job using the fit()
method. This call submits the job to the SageMaker control plane, provisions the compute cluster, and begins processing the evaluation dataset:
estimator.fit(inputs={"train": evalInput})
Results from the Amazon Nova LLM-as-a-Judge evaluation job
The following graphic illustrates the results of the Amazon Nova LLM-as-a-Judge evaluation job.
To help practitioners quickly interpret the outcome of a Nova LLM-as-a-Judge evaluation, we created a convenience function that produces a single, comprehensive visualization summarizing key metrics. This function, plot_nova_judge_results
, uses Matplotlib and Seaborn to render an image with six panels, each highlighting a different perspective of the evaluation outcome.
This function takes the evaluation metrics dictionary—produced when the evaluation job is complete—and generates the following visual components:
- Score distribution bar chart – Shows how many times Model A was preferred, how many times Model B was preferred, how many ties occurred, and how often the judge failed to produce a decision (inference errors). This provides an immediate sense of how decisive the evaluation was and whether either model is dominating.
- Win rate with 95% confidence interval – Plots Model B’s overall win rate against Model A, including an error bar reflecting the lower and upper bounds of the 95% confidence interval. A vertical reference line at 50% marks the point of no preference. If the confidence interval doesn’t cross this line, you can conclude the result is statistically significant.
- Preference pie chart – Visually displays the proportion of times Model A, Model B, or neither was preferred. This helps quickly understand preference distribution among the valid judgments.
- A vs. B score comparison bar chart – Compares the raw counts of preferences for each model side by side. A clear label annotates the margin of difference to emphasize which model had more wins.
- Win rate gauge – Depicts the win rate as a semicircular gauge with a needle pointing to Model B’s performance relative to the theoretical 0–100% range. This intuitive visualization helps nontechnical stakeholders understand the win rate at a glance.
- Summary statistics table – Compiles numerical metrics—including total evaluations, error counts, win rate, and confidence intervals—into a compact, clean table. This makes it straightforward to reference the exact numeric values behind the plots.
Because the function outputs a standard Matplotlib figure, you can quickly save the image, display it in Jupyter notebooks, or embed it in other documentation.
Clean up
Complete the following steps to clean up your resources:
- Delete your Qwen 2.5 1.5B Endpoint
- If you’re using a SageMaker Studio JupyterLab notebook, shut down the JupyterLab notebook instance.
How you can use this evaluation framework
The Amazon Nova LLM-as-a-Judge workflow offers a reliable, repeatable way to compare two language models on your own data. You can integrate this into model selection pipelines to decide which version performs best, or you can schedule it as part of continuous evaluation to catch regressions over time.
For teams building agentic or domain-specific systems, this approach provides richer insight than automated metrics alone. Because the entire process runs on SageMaker training jobs, it scales quickly and produces clear visual reports that can be shared with stakeholders.
Conclusion
This post demonstrates how Nova LLM-as-a-Judge—a specialized evaluation model available through Amazon SageMaker AI—can be used to systematically measure the relative performance of generative AI systems. The walkthrough shows how to prepare evaluation datasets, launch SageMaker AI training jobs with Nova LLM-as-a-Judge recipes, and interpret the resulting metrics, including win rates and preference distributions. The fully managed SageMaker AI solution simplifies this process, so you can run scalable, repeatable model evaluations that align with human preferences.
We recommend starting your LLM evaluation journey by exploring the official Amazon Nova documentation and examples. The AWS AI/ML community offers extensive resources, including workshops and technical guidance, to support your implementation journey.
To learn more, visit:
- Amazon Nova Documentation
- Amazon Bedrock Nova Overview
- Fine-tuning Amazon Nova models
- Amazon Nova customization guide
About the authors
Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.
Joel Carlson is a Senior Applied Scientist on the Amazon AGI foundation modeling team. He primarily works on developing novel approaches for improving the LLM-as-a-Judge capability of the Nova family of models.
Saurabh Sahu is an applied scientist in the Amazon AGI Foundation modeling team. He obtained his PhD in Electrical Engineering from University of Maryland College Park in 2019. He has a background in multi-modal machine learning working on speech recognition, sentiment analysis and audio/video understanding. Currently, his work focuses on developing recipes to improve the performance of LLM-as-a-judge models for various tasks.
Morteza Ziyadi is an Applied Science Manager at Amazon AGI, where he leads several projects on post-training recipes and (Multimodal) large language models in the Amazon AGI Foundation modeling team. Before joining Amazon AGI, he spent four years at Microsoft Cloud and AI, where he led projects focused on developing natural language-to-code generation models for various products. He has also served as an adjunct faculty at Northeastern University. He earned his PhD from the University of Southern California (USC) in 2017 and has since been actively involved as a workshop organizer, and reviewer for numerous NLP, Computer Vision and machine learning conferences.
Pradeep Natarajan is a Senior Principal Scientist in Amazon AGI Foundation modeling team working on post-training recipes and Multimodal large language models. He has 20+ years of experience in developing and launching multiple large-scale machine learning systems. He has a PhD in Computer Science from University of Southern California.
Michael Cai is a Software Engineer on the Amazon AGI Customization Team supporting the development of evaluation solutions. He obtained his MS in Computer Science from New York University in 2024. In his spare time he enjoys 3d printing and exploring innovative tech.
Building cost-effective RAG applications with Amazon Bedrock Knowledge Bases and Amazon S3 Vectors
Vector embeddings have become essential for modern Retrieval Augmented Generation (RAG) applications, but organizations face significant cost challenges as they scale. As knowledge bases grow and require more granular embeddings, many vector databases that rely on high-performance storage such as SSDs or in-memory solutions become prohibitively expensive. This cost barrier often forces organizations to limit the scope of their RAG applications or compromise on the granularity of their vector representations, potentially impacting the quality of results. Additionally, for use cases involving historical or archival data that still needs to remain searchable, storing vectors in specialized vector databases optimized for high throughput workloads represents an unnecessary ongoing expense.
Starting July 15, Amazon Bedrock Knowledge Bases customers can select Amazon S3 Vectors (preview), the first cloud object storage with built-in support to store and query vectors at a low cost, as a vector store. Amazon Bedrock Knowledge Bases users can now reduce vector upload, storage, and query costs by up to 90%. Designed for durable and cost-optimized storage of large vector datasets with subsecond query performance, S3 Vectors is ideal for RAG applications that require long-term storage of massive vector volumes and can tolerate the performance tradeoff compared to high queries per second (QPS), millisecond latency vector databases. The integration with Amazon Bedrock means you can build more economical RAG applications while preserving the semantic search performance needed for quality results.
In this post, we demonstrate how to integrate Amazon S3 Vectors with Amazon Bedrock Knowledge Bases for RAG applications. You’ll learn a practical approach to scale your knowledge bases to handle millions of documents while maintaining retrieval quality and using S3 Vectors cost-effective storage.
Amazon Bedrock Knowledge Bases and Amazon S3 Vectors integration overview
When creating a knowledge base in Amazon Bedrock, you can select S3 Vectors as your vector storage option. Using this approach, you can build cost-effective, scalable RAG applications without provisioning or managing complex infrastructure. The integration delivers significant cost savings while maintaining subsecond query performance, making it ideal for working with larger vector datasets generated from massive volumes of unstructured data including text, images, audio, and video. Using a pay-as-you-go pricing model at low price points, S3 Vectors offers industry-leading cost optimization that reduces the cost of uploading, storing, and querying vectors by up to 90% compared to alternative solutions. Advanced search capabilities include rich metadata filtering, so you can refine queries by document attributes such as dates, categories, and sources. The combination of S3 Vectors and Amazon Bedrock is ideal for organizations building large-scale knowledge bases that demand both cost efficiency and performant retrieval—from managing extensive document repositories to historical archives and applications requiring granular vector representations. The walkthrough follows these high-level steps:
- Create a new knowledge base
- Configure the data source
- Configure data source and processing
- Sync the data source
- Test the knowledge base
Prerequisites
Before you get started, make sure that you have the following prerequisites:
- An AWS Account with appropriate service access.
- An AWS Identity and Access Management (IAM) role with the appropriate permissions to access Amazon Bedrock and Amazon Simple Storage Service (Amazon S3).
- Enable model access for embedding and inference models such as Amazon Titan Text Embeddings V2 and Amazon Nova Pro.
Amazon Bedrock Knowledge Bases and Amazon S3 Vectors integration walkthrough
In this section, we walk through the step-by-step process of creating a knowledge base with Amazon S3 Vectors using the AWS Management Console. We cover the end-to-end process from configuring your vector store to ingesting documents and testing your retrieval capabilities.
For those who prefer to configure their knowledge base programmatically rather than using the console, the Amazon Bedrock Knowledge Bases with S3 Vectors repository in GitHub provides a guided notebook that you can follow to deploy the setup in your own account.
Create a new knowledge base
To create a new knowledge base, follow these steps:
- On the Amazon Bedrock console in the left navigation pane, choose Knowledge Bases. To initiate the creation process, in the Create dropdown list, choose Knowledge Base with vector store.
- On the Provide Knowledge Base details page, enter a descriptive name for your knowledge base and an optional description to identify its purpose. Select your IAM permissions approach—either create a new service role or use an existing one—to grant the necessary permissions for accessing AWS services, as shown in the following screenshot.
- Choose Amazon S3. Optionally, add tags to help organize and categorize your resources and configure log delivery destinations such as an S3 bucket or Amazon CloudWatch for monitoring and troubleshooting.
- Choose Next to proceed to the data source configuration.
Configure the data source
To configure the data source, follow these steps:
- Assign a descriptive name to your knowledge base data.
- In Data source location, select whether the S3 bucket exists in your current AWS account or another account, then specify the location where your documents are stored, as shown in the following screenshot.
In this step, configure your parsing strategy to determine how Amazon Bedrock processes your documents. Select Amazon Bedrock default parser for text-only documents at no additional cost. Select Amazon Bedrock Data Automation as parser or Foundation models as a parser for processing complex documents with visual elements.
The chunking strategy configuration is equally critical because it defines how your content is segmented into meaningful units for vector embedding, directly impacting retrieval quality and context preservation. We have selected Fixed-size chunking for this example due to its predictable token sizing and simplicity. Because both parsing and chunking decisions can’t be modified after creation, select options that best match your content structure and retrieval needs. For sensitive data, you can use advanced settings to implement AWS Key Management Service (AWS KMS) encryption or apply custom transformation functions to optimize your documents before ingestion. By default, S3 Vectors will use server-side encryption (SSE-S3).
Configure data storage and processing
To configure data storage and processing, first select the embeddings model, as shown in the following screenshot. The embeddings model will transform your text chunks into numerical vector representations for semantic search capabilities. If connecting to an existing S3 Vector as a vector store, make sure the embedding model dimensions match those used when creating your vector store because dimensional mismatches will cause ingestion failures. Amazon Bedrock offers several embeddings models to choose from, each with different vector dimensions and performance characteristics optimized for various use cases. Consider both the semantic richness of the model and its cost implications when making your selection.
Next, configure the vector store. For vector storage selection, choose how Amazon Bedrock Knowledge Bases will store and manage the vector embeddings generated from your documents in Amazon S3 Vectors, using one of the following two options:
Option 1. Quick create a new vector store
This recommended option, shown in the following screenshot, automatically creates an S3 vector bucket in your account during knowledge base creation. The system optimizes your vector storage for cost-effective, durable storage of large-scale vector datasets, creating an S3 vector bucket and vector index for you.
Option 2. Use an existing vector store
When creating your Amazon S3 Vector as a vector store index for use with Amazon Bedrock Knowledge Bases, you can attach metadata (such as, year, author, genre, and location) as key-value pairs to each vector. By default, metadata fields can be used as filters in similarity queries unless specified as nonfilterable metadata at the time of vector index creation. S3 Vector indexes support string, number, and Boolean types up to 40 KB per vector, with filterable metadata capped at 2 KB per vector.
To accommodate larger text chunks and richer metadata while still allowing filtering on other important attributes, add "AMAZON_BEDROCK_TEXT"
to the nonFilterableMetadataKeys
list in your index configuration. This approach optimizes your storage allocation for document content while preserving filtering capabilities for meaningful attributes like categories or dates. Keep in mind that fields added to the nonFilterableMetadataKeys
array can’t be used with metadata filtering in queries and can’t be modified after the index is created.
Here’s an example for creating an Amazon S3 Vector index with proper metadata configuration:
For details on how to create a vector store, refer to Introducing Amazon S3 Vectors in the AWS News Blog.
After you have an S3 Vector bucket and index, you can connect it to your knowledge base. You’ll need to provide both the S3 Vector bucket Amazon Resource Name (ARN) and vector index ARN, as shown in the following screenshot, to correctly link your knowledge base to your existing S3 Vector index.
Sync data source
After you’ve configured your knowledge base with S3 Vectors, you need to synchronize your data source to generate and store vector embeddings. From the Amazon Bedrock Knowledge Bases console, open your created knowledge base and locate your configured data source and choose Sync to initiate the process, as shown in the following screenshot. During synchronization, the system processes your documents according to your parsing and chunking configurations, generates embeddings using your selected model, and stores them in your S3 vector index. You can monitor the synchronization progress in real time if you’ve configured Amazon CloudWatch Logs and verify completion status before testing your knowledge base’s retrieval capabilities.
Test the knowledge base
After successfully configuring your knowledge base with S3 Vectors, you can validate its functionality using the built-in testing interface. You can use this interactive console to experiment with different query types and view both retrieval results and generated responses. Select between Retrieval only (Retrieve API) mode to examine raw source chunks or Retrieval and Response generation (RetrieveandGenerate
API) to learn how foundation models (FMs) such as Amazon Nova use your retrieved content. The testing interface provides valuable insights into how your knowledge base processes queries, displaying source chunks, their relevance scores, and associated metadata.
You can also configure query settings for your knowledge base just as you would with other vector storage options, including filters for metadata-based selection, guardrails for appropriate responses, reranking capabilities, and query modification options. These tools help optimize retrieval quality and make sure the most relevant information is presented to your FMs. S3 Vectors currently supports semantic search functionality. Using this hands-on validation, you can refine your configuration before integrating the knowledge base with production applications.
Creating your Amazon Bedrock knowledge base programmatically
In the previous sections, we walked through creating a knowledge base with Amazon S3 Vectors using the AWS Management Console. For those who prefer to automate this process or integrate it into existing workflows, you can also create your knowledge base programmatically using the AWS SDK.
The following is a sample code showing how the API call looks when programmatically creating an Amazon Bedrock knowledge base with an existing Amazon S3 Vector index:
The role attached to the knowledge base should have several policies attached to it, including access to the S3 Vectors API, the models used for embedding, generation, and reranking (if used), and the S3 bucket used as data source. If you’re using a customer managed key for your S3 Vector as a vector store, you’ll need to provide an additional policy to allow the decryption of the data. The following is the policy needed to access the Amazon S3 Vector as a vector store:
Cleanup
To clean up your resources, complete the following steps. To delete the knowledge base:
- On the Amazon Bedrock console, choose Knowledge Bases
- Select your Knowledge Base and note both the IAM service role name and S3 Vector index ARN
- Choose Delete and confirm
To delete the S3 Vector as a vector store, use the following AWS Command Line Interface (AWS CLI) commands:
- On the IAM console, find the role noted earlier
- Select and delete the role
To delete the sample dataset:
- On the Amazon S3 console, find your S3 bucket
- Select and delete the files you uploaded for this tutorial
Conclusion
The integration between Amazon Bedrock Knowledge Bases and Amazon S3 Vectors represents a significant advancement in making RAG applications more accessible and economically viable at scale. By using the cost-optimized storage of Amazon S3 Vectors, organizations can now build knowledge bases at scale with improved cost efficiency. This means customers can strike an optimal balance between performance and economics, and you can focus on creating value through AI-powered applications rather than managing complex vector storage infrastructure.
To get started on Amazon Bedrock Knowledge Bases and Amazon S3 Vectors integration, refer to Using S3 Vectors with Amazon Bedrock Knowledge Bases in the Amazon S3 User Guide.
About the authors
Vaibhav Sabharwal is a Senior Solutions Architect with Amazon Web Services (AWS) based out of New York. He is passionate about learning new cloud technologies and assisting customers in building cloud adoption strategies, designing innovative solutions, and driving operational excellence. As a member of the Financial Services Technical Field Community at AWS, he actively contributes to the collaborative efforts within the industry.
Dani Mitchell is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS). He is focused on helping accelerate enterprises across the world on their generative AI journeys with Amazon Bedrock.
Irene Marban is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS), working with customers across EMEA to design and implement generative AI solutions to accelerate their businesses. With a background in biomedical engineering and AI, her work focuses on helping organizations leverage the latest AI technologies to drive innovation and growth. In her spare time, she loves reading and cooking for her friends.
Ashish Lal is an AI/ML Senior Product Marketing Manager for Amazon Bedrock. He has over 11 years of experience in product marketing and enjoys helping customers accelerate time to value and reduce their AI lifecycle cost.
Implementing on-demand deployment with customized Amazon Nova models on Amazon Bedrock
Amazon Bedrock offers model customization capabilities for customers to tailor versions of foundation models (FMs) to their specific needs through features such as fine-tuning and distillation. Today, we’re announcing the launch of on-demand deployment for customized models ready to be deployed on Amazon Bedrock.
On-demand deployment for customized models provides an additional deployment option that scales with your usage patterns. This approach allows for invoking customized models only when needed, with requests processed in real time without requiring pre-provisioned compute resources.
The on-demand deployment option includes a token-based pricing model that charges based on the number of tokens processed during inference. This pay-as-you-go approach complements the existing Provisioned Throughput option, giving users flexibility to choose the deployment method that best aligns with their specific workload requirements and cost objectives.
In this post, we walk through the custom model on-demand deployment workflow for Amazon Bedrock and provide step-by-step implementation guides using both the AWS Management Console and APIs or AWS SDKs. We also discuss best practices and considerations for deploying customized Amazon Nova models on Amazon Bedrock.
Understanding custom model on-demand deployment workflow
The model customization lifecycle represents the end-to-end journey from conceptualization to deployment. This process begins with defining your specific use case, preparing and formatting appropriate data, and then performing model customization through features such as Amazon Bedrock fine-tuning or Amazon Bedrock Model Distillation. Each stage builds upon the previous one, creating a pathway toward deploying production-ready generative AI capabilities that you tailor to your requirements. The following diagram illustrates this workflow.
After customizing your model, the evaluation and deployment phases determine how the model will be made available for inference. This is where custom model on-demand deployment becomes valuable, offering a deployment option that aligns with variable workloads and cost-conscious implementations. When using on-demand deployment, you can invoke your customized model through the AWS console or standard API operations using the model identifier, with compute resources automatically allocated only when needed. The on-demand deployment provides flexibility while maintaining performance expectations, so you can seamlessly integrate customized models into your applications with the same serverless experience offered by Amazon Bedrock—all compute resources are automatically managed for you, based on your actual usage. Because the workflow supports iterative improvements, you can refine your models based on evaluation results and evolving business needs.
Prerequisites
This post assumes you have a customized Amazon Nova model before deploying it using on-demand deployment. On-demand deployment requires newly customized Amazon Nova models after this launch. Previously customized models aren’t compatible with this deployment option. For instructions on creating or customizing your Nova model through fine-tuning or distillation, refer to these resources:
After you’ve successfully customized your Amazon Nova model, you can proceed with deploying it using the on-demand deployment option as detailed in the following sections.
Implementation guide for on-demand deployment
There are two main approaches to implementing on-demand deployment for your customized Amazon Nova models on Amazon Bedrock: using the Amazon Bedrock console or using the API or SDK. First, we explore how to deploy your model through the Amazon Bedrock console, which provides a user-friendly interface for setting up and managing your deployments.
Step-by-step implementation using the Amazon Bedrock console
To implement on-demand deployment for your customized Amazon Nova models on Amazon Bedrock using the console, follow these steps:
- On the Amazon Bedrock console, select your customized model (fine-tuning or model distillation) to be deployed. Choose Set up inference and select Deploy for on-demand, as shown in the following screenshot.
- Under Deployment details, enter a Name and a Description. You have the option to add Tags, as shown in the following screenshot. Choose Create to start on-demand deployment of customized model.
Under Custom model deployments, the status of your deployment should be InProgress, Active, or Failed, as shown in the following screenshot.
You can select a deployment to find Deployment ARN, Creation time, Last updated, and Status for the selected custom model.
The custom model is deployed and ready using on-demand deployment. Try it out in the test playground or go to Chat/Text playground, choose Custom models under Categories. Select your model, choose On demand under Inference, and select by the deployment name, as shown in the following screenshot.
Step-by-step implementation using API or SDK
After you have trained the model successfully, you can deploy it to evaluate the response quality and latency or to use the model as a production model for your use case. You use CreateCustomModelDeployment
API to create model deployment for the trained model. The following steps show how to use the APIs for deploying and deleting the custom model deployment for on-demand inference.
After you’ve successfully created a model deployment, you can check the status of the deployment by using GetCustomModelDeployment
API as follows:
GetCustomModelDeployment
supports three states: Creating
, Active
, and Failed
. When the status in response is Active
, you should be able to use the custom model through on-demand deployment with InvokeModel
or Converse
API, as shown in the following example:
By following these steps, you can deploy and use your customized model through Amazon Bedrock API and instantly use your efficient and high-performing model tailored to your use cases through on-demand deployment.
Best practices and considerations
Successful implementation of on-demand deployment with customized models depends on understanding several operational factors. These considerations—including latency, Regional availability, quota limitations, deployment option selections, and cost management strategies—directly impact your ability to deploy effective solutions while optimizing resource utilization. The following guidelines help you make informed decisions when implementing your inference strategy:
- Cold start latency – When using on-demand deployment, you might experience initial cold start latencies, typically lasting several seconds, depending on the model size. This occurs when the deployment hasn’t received recent traffic and needs to reinitialize compute resources.
- Regional availability – At launch, custom model deployment will be available in US East (N. Virginia) for Amazon Nova models.
- Quota management – Each custom model deployment has specific quotas:
- Tokens per minute (TPM)
- Requests per minute (RPM)
- The number of
Creating
status deployment - Total on-demand deployments in a single account
Each deployment operates independently within its assigned quota. If a deployment exceeds its TPM or RPM allocation, incoming requests will be throttled. You can request quota increases by submitting a ticket or contacting your AWS account team.
- Choosing between custom model deployment and Provisioned Throughput – You can set up inference on a custom model by either creating a custom model deployment (for on-demand usage) or purchasing Provisioned Throughput. The choice depends on the supported Regions and models for each inference option, throughput requirement, and cost considerations. These two options operate independently and can be used simultaneously for the same custom model.
- Cost management – On-demand deployment uses a pay-as-you-go pricing model based on the number of tokens processed during inference. You can use cost allocation tags on your on-demand deployments to track and manage inference costs, allowing better budget tracking and cost optimization through AWS Cost Explorer.
Cleanup
If you’ve been testing the on-demand deployment feature and don’t plan to continue using it, it’s important to clean up your resources to avoid incurring unnecessary costs. Here’s how to delete using the Amazon Bedrock Console:
- Navigate to your custom model deployment
- Select the deployment you want to remove
- Delete the deployment
Here’s how to delete using the API or SDK:
To delete a custom model deployment, you can use DeleteCustomModelDeployment
API. The following example demonstrates how to delete your custom model deployment:
Conclusion
The introduction of on-demand deployment for customized models on Amazon Bedrock represents a significant advancement in making AI model deployment more accessible, cost-effective, and flexible for businesses of all sizes. On-demand deployment offers the following advantages:
- Cost optimization – Pay-as-you-go pricing allows you only pay for the compute resources you actually use
- Operational simplicity – Automatic resource management eliminates the need for manual infrastructure provisioning
- Scalability – Seamless handling of variable workloads without upfront capacity planning
- Flexibility – Freedom to choose between on-demand and Provisioned Throughput based on your specific needs
Getting started is straightforward. Begin by completing your model customization through fine-tuning or distillation, then choose on-demand deployment using the AWS Management Console or API. Configure your deployment details, validate model performance in a test environment, and seamlessly integrate into your production workflows.
Start exploring on-demand deployment for customized models on Amazon Bedrock today! Visit the Amazon Bedrock documentation to begin your model customization journey and experience the benefits of flexible, cost-effective AI infrastructure. For hands-on implementation examples, check out our GitHub repository which contains detailed code samples for customizing Amazon Nova models and evaluating them using on-demand custom model deployment.
About the Authors
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.
Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Koushik Mani is an associate solutions architect at AWS. He had worked as a Software Engineer for two years focusing on machine learning and cloud computing use cases at Telstra. He completed his masters in computer science from University of Southern California. He is passionate about machine learning and generative AI use cases and building solutions.
Rishabh Agrawal is a Senior Software Engineer working on AI services at AWS. In his spare time, he enjoys hiking, traveling and reading.
Shreeya Sharma is a Senior Technical Product Manager at AWS, where she has been working on leveraging the power of generative AI to deliver innovative and customer-centric products. Shreeya holds a master’s degree from Duke University. Outside of work, she loves traveling, dancing, and singing.
torch.compile and Diffusers: A Hands-On Guide to Peak Performance
Diffusers is the go-to library that provides a unified interface to cutting-edge and open diffusion models for image, video, and audio. Over the past few months, we have deepened its integration with torch.compile
. By tailoring the compilation workflow to the diffusion model architecture, torch.compile
delivers significant speed-ups with minimal impact on user experience. In this post, we will show how to unlock these gains. The target audience for this post is
- Diffusion-model authors – Learn the small code changes that make your models compiler-friendly so that end-users can benefit from performance boosts.
- Diffusion-model users – Understand compile time vs. run time trade-offs, how to avoid unnecessary recompilations, and other aspects that can help you choose the right setting. We’ll walk through two community-favorite pipelines from Diffusers to illustrate the payoff.
While the examples live in the Diffusers repo, most of the principles apply to other deep learning workloads as well.
Table of contents
- Background
- Use
torch.compile
Effectively For Diffusion Models- Vanilla Compilation
- For Model Authors: Use fullgraph=True
- For Model Users: Use Regional Compilation
- For Model Users: Reduce Recompilations
- Extend
torch.compile
to Popular Diffusers Features- Memory-constrained GPUs
- LoRA adapters
- Operational Hardening
- Conclusion
- Links to important resources
Background
torch.compile
delivers its best gains when you know the model well enough to compile the right sub-modules. In this section, we’ll first outline the factors that shape the torch.compile
user experience, then dissect diffusion architectures to pinpoint which components benefit most from compilation.
Core Factors for torch.compile
Performance and Usability
torch.compile
turns a Python program into an optimized graph and then emits machine code, but the speedup and ease of use depends on the following factors:
Compile latency – As a JIT compiler, torch.compile
springs into action on the first run, so users experience the compile cost up front. This startup cost could be high, especially with large graphs.
Mitigation: try regional compilation to target small, repeated regions. While this may limit the maximum possible speedup compared to compiling the full model, it often strikes a better balance between performance and compile time, so evaluate the trade-off before deciding.
Graph breaks — Dynamic Python or unsupported ops slice the Python program into many small graphs, slashing potential speed-ups. Model developers should strive to keep the compute-heavy part of the model free of any graph breaks.
Mitigation: turn on fullgraph=True, identify the breaks, and eliminate them while preparing the model.
Recompilations – torch.compile
specializes its code to the exact input shapes, so changing the resolution from 512 × 512 to 1024 × 1024 triggers a recompile and the accompanying latency.
Mitigation: enable dynamic=True to relax shape constraints. Note that dynamic=True works well for diffusion models, but the recommended way is to use mark_dynamic to selectively apply dynamism to your model.
Device-to-host (DtoH) syncs can also get in the way of optimal performance, but these are non-trivial and have to be treated on a case-by-case basis. Most widely used diffusion pipelines in Diffusers are free of those syncs. Interested readers can check out this document to learn more. Since these syncs contribute to little latency increase compared to other mentioned factors, we will not be focusing on them for the rest of this post.
Diffusion Model Architecture
We’ll use Flux‑1‑Dev, an open‑source text‑to‑image model from Black Forest Labs, as our running example. A diffusion pipeline is not a single network; it is a collection of models:
- Text encoders – CLIP‑Text and T5 convert the user prompt into embeddings.
- Denoiser – a Diffusion Transformer (DiT) progressively refines a noisy latent, conditioned on those embeddings.
- Decoder (VAE) – transforms the final latent into RGB pixels.
Among these components, the DiT dominates the compute budget. You could compile every component in the pipeline, but that only adds compile latency, recompilations, and potential graph breaks, overheads that barely matter, since these parts already account for a tiny slice of the total runtime. For these reasons, we restrict torch.compile
to the DiT component instead of the entire pipeline.
Use torch.compile
Effectively For Diffusion Models
Vanilla Compilation
Let’s establish a baseline that we can incrementally improve, while maintaining a smooth torch.compile
user experience. Load the Flux‑1‑Dev checkpoint and generate an image the usual way:
import torch from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 ).to("cuda") prompt = "A cat holding a sign that says hello world" out = pipe( prompt=prompt, guidance_scale=3.5, num_inference_steps=28, max_sequence_length=512, ).images[0] out.save("image.png")
Now compile the compute‑heavy Diffusion Transformer sub‑module:
pipe.transformer.compile(fullgraph=True)
That single line cuts latency on an H100 from 6.7 seconds to 4.5 seconds, achieving roughly 1.5x speedup, without sacrificing image quality. Under the hood, torch.compile
fuses kernels and removes Python overhead, driving both memory and compute efficiency.
For Model Authors: Use fullgraph=True
The DiT’s forward pass is structurally simple, so we expect it to form one contiguous graph. This flag instructs torch.compile
to raise an error if any graph break occurs, letting you catch unsupported ops early rather than silently leaving potential performance gains on the table. We recommend that the diffusion model authors set fullgraph=True early in the model preparation phase and fix graph breaks. Please refer to the torch.compile
troubleshooting doc and the manual to fix graph breaks.
For Model Users: Use Regional Compilation
If you’re following along, you’ll notice the first inference call is very slow, taking 67.4 seconds on an H100 machine. This is the compile overhead. Compilation latency grows with the size of the graph handed to the compiler. One practical way to reduce this cost is to compile smaller, repeated blocks, a strategy we call regional compilation.
A DiT is essentially a stack of identical Transformer layers. If we compile one layer once and reuse its kernels for every subsequent layer, we can slash compile time while keeping nearly all of the runtime gains we saw with full‑graph compilation.
Diffusers exposes this via a single-line helper:
pipe.transformer.compile_repeated_blocks(fullgraph=True)
On an H100, this cuts compile latency from 67.4 seconds to 9.6 seconds, reducing the cold start by 7x while still delivering the 1.5x runtime speedup achieved by the full model compilation. If you’d like to dive deeper or enable your new model with regional compilation, the implementation discussion lives in PR.
Note that the compile time numbers above are cold-start measurements: we wiped off the compilation cache with the torch._inductor.utils.fresh_inductor_cache API, so torch.compile
had to start from scratch. Alternatively, in a warm start, cached compiler artifacts (stored either on the local disk or on a remote cache) let the compiler skip parts of the compilation process, reducing compile latency. For our model, regional compilation takes 9.6 seconds on a cold start, but only 2.4 seconds once the cache is warm. See the linked guide for details on using compile caches effectively.
For Model Users: Reduce Recompilations
Because torch.compile
is a just‑in‑time compiler, it specializes the compiler artifacts to properties of the inputs it sees – shape, dtype, and device among them (refer to this blog for more details). Changing any of these causes recompilation. Although this happens behind the scenes automatically, this recompilation leads to a higher compile time cost, leading to a bad user experience.
If your application needs to handle multiple image sizes or batch shapes, pass dynamic=True when you compile. For general models, PyTorch recommends mark_dynamic, but dynamic=True works well in diffusion models.
pipe.transformer.compile_repeated_blocks(fullgraph=True, dynamic=True)
We benchmarked the forward pass of the Flux DiT with full compilation on shape changes and obtained convincing results.
Extend torch.compile
to Popular Diffusers Features
By now, you should have a clear picture of how to accelerate diffusion models withtorch.compile
without compromising user experience. Next, we will discuss two community favorite Diffusers features and keep them fully compatible with torch.compile
. We will default to regional compile because it delivers the same speedup as full compile with 8x smaller compile latency.
- Memory‑constrained GPUs – Many Diffusers users work on cards whose VRAM cannot hold the entire model. We’ll look at CPU offloading and quantization to keep generation feasible on those devices.
- Rapid personalization with LoRA adapters – Fine‑tuning via low‑rank adapters is the go‑to method for adapting diffusion models to new styles or tasks. We’ll demonstrate how to swap LoRAs without triggering a recompile.
Memory-constrained GPUs
CPU offloading: A full Flux pipeline in bfloat16 consumes roughly 33 GB, more than most consumer GPUs can spare. Fortunately, not every sub‑module has to occupy GPU memory for the entire forward pass. Once the text encoders finish producing prompt embeddings, they can be moved to system RAM. Likewise, after the DiT refines the latent, it can yield the GPU memory to the VAE decoder.
Diffusers turns this offloading into a one‑liner:
pipe.enable_model_cpu_offload()
Peak GPU usage drops to roughly 22.7 GB, making high‑resolution generation feasible on smaller cards at the cost of extra PCIe traffic. Offloading trades memory for time, the end‑to‑end run now takes about 21.5 seconds instead of 6.7 seconds.
You can claw back some of that time by enabling torch.compile
alongside offloading. The compiler’s kernel fusion offsets a little bit of PCIe overhead, trimming latency to roughly 18.7 seconds while preserving the smaller 22.6 GB footprint.
pipe.enable_model_cpu_offload() pipe.transformer.compile_repeated_blocks(fullgraph=True)
Diffusers ships multiple offloading modes, each with a unique speed‑versus‑memory sweet spot. Check out the offloading guide for the full menu.
Quantization: CPU offloading frees GPU memory, but it still assumes that the largest component, the DiT, can fit into GPU memory. Another way to alleviate the memory pressure is to leverage weight quantization, if there is some tolerance for lossy outputs.
Diffusers supports several quantization backends; here we use 4‑bit NF4 quantization from bitsandbytes. It cuts the DiT’s weight footprint by roughly half, dropping peak GPU memory from roughly 33 GB to 15 GB while retaining the image quality.
In contrast to CPU offloading, weight quantization keeps the weights in GPU memory, leading to a smaller increase in runtime penalty – it increases from the baseline of 6.7 seconds to 7.3 seconds. Adding torch.compile on top fuses the 4‑bit operations, reducing the inference time from 7.279 seconds to 5.048 seconds, achieving roughly 1.5x speedup.
You can find different backends and code pointers here.
(We enabled quantization on the DiT and T5 as both of them have considerably higher memory consumption than the CLIP and VAE.)
Quantization + offloading: As you might be expecting, you can combine NF4 quantization with CPU offloading to get the maximum memory benefit. This combined technique reduces the memory footprint to 12.2 GB with 12.2 seconds of inference time. Applying torch.compile
works seamlessly, reducing the runtime to 9.8 seconds, achieving a 1.24x speedup.
The benchmarks were conducted using this script on a single H100 machine. Below is a summary of the numbers we have discussed so far. The grey boxes represent the baseline number, and green boxes represent the best configuration.
Looking closely at the table above, we can immediately see that regional compilation provides speed-ups almost similar to full compilation, while being significantly faster in terms of compilation time.
LoRA adapters
LoRA adapters let you personalize a base diffusion model without fine‑tuning millions of parameters. The downside is that switching between adapters normally swaps weight tensors inside the DiT, forcing torch.compile
to re‑trace and re‑compile.
Diffusers now integrates PEFT’s LoRA hot‑swap to dodge that hit. You declare the maximum LoRA rank you’ll need, compile once, and then swap adapters on the fly. No recompilation required.
pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 ).to("cuda") pipe.enable_lora_hotswap(target_rank=max_rank) pipe.load_lora_weights(<lora-adapter-name1>) pipe.transformer.compile(fullgraph=True) # from this point on, load new LoRAs with `hotswap=True` pipe.load_lora_weights(<lora-adapter-name2>, hotswap=True)
Because only the LoRA weight tensors change and their shapes stay fixed, the compiled kernels remain valid and inference latency stays flat.
Caveats
- We need to provide the maximum rank among all LoRA adapters ahead of time. Thus, if we have one adapter with rank 16 and another with 32, we need to pass `max_rank=32`.
- LoRA adapters that are hotswapped can only target the same layers, or a subset of layers, that the first LoRA targets.
- Hot‑swapping the text encoder is not yet supported.
For more details, see the LoRA hot‑swap docs and the test suite.
LoRA inference integrates seamlessly with the offloading and quantization features discussed above. If you’re constrained by GPU VRAM, please consider using them.
Operational Hardening
Diffusers runs a dedicated CI job nightly to keep torch.compile
support robust. The suite exercises the most popular pipelines and automatically checks for:
- Graph breaks and silent fallbacks
- Unintended recompilations across common input shapes
- Compatibility between compilation, CPU offloading, and every quantization backend we support
Interested readers are encouraged to check out this link to look at recent PRs that improve torch.compile
coverage.
Benchmarks
Correctness is only half the story; we care about performance too. A revamped benchmarking workflow now runs alongside CI, capturing latency and peak memory for each scenario covered in this post. Results are exported to a consolidated CSV so regressions (or wins!) are easy to spot. The design and early numbers live in this PR.
Conclusion
torch.compile
can turn a standard Diffusers pipeline into a high‑performance, memory‑efficient workhorse. By focusing compilation on the DiT, leveraging regional compilation and dynamic shapes, and stacking the compiler with offloading, quantization, and LoRA hot‑swap, you can unlock substantial speed‑ups and VRAM savings without sacrificing image quality or flexibility.
We hope these recipes inspire you to weave torch.compile
into your own diffusion workflows. We’re excited to see what you build next.
Happy compiling
The Diffusers team would like to acknowledge the help and support of Animesh and Ryan in improving the torch.compile support.
Links to Important Resources
torch.compile
tutorial- Regional compilation tutorial
- torch.compile, the missing manual
- Tricks to know when using
torch.compile
- Diffusers x compilation docs