Unlock retail intelligence by transforming data into actionable insights using generative AI with Amazon Q Business

Unlock retail intelligence by transforming data into actionable insights using generative AI with Amazon Q Business

Businesses often face challenges in managing and deriving value from their data. According to McKinsey, 78% of organizations now use AI in at least one business function (as of 2024), showing the growing importance of AI solutions in business. Additionally, 21% of organizations using generative AI have fundamentally redesigned their workflows, showing how AI is transforming business operations.

Gartner identifies AI-powered analytics and reporting as a core investment area for retail organizations, with most large retailers expected to deploy or scale such solutions within the next 12–18 months. The retail sector’s data complexity demands sophisticated solutions that can integrate seamlessly with existing systems. Amazon Q Business offers features that can be tailored to meet specific business needs, including integration capabilities with popular retail management systems, point-of-sale systems, inventory management software, and ecommerce systems. Through advanced AI algorithms, the system analyzes historical data and current trends, helping businesses prepare effectively for seasonal fluctuations in demand and make data-driven decisions.

Amazon Q Business for Retail Intelligence is an AI-powered assistant designed to help retail businesses streamline operations, improve customer service, and enhance decision-making processes. This solution is specifically engineered to be scalable and adaptable to businesses of various sizes, helping them compete more effectively. In this post, we show how you can use Amazon Q Business for Retail Intelligence to transform your data into actionable insights.

Solution overview

Amazon Q Business for Retail Intelligence is a comprehensive solution that transforms how retailers interact with their data using generative AI. The solution architecture combines the powerful generative AI capabilities of Amazon Q Business and Amazon QuickSight visualizations to deliver actionable insights across the entire retail value chain. Our solution also uses Amazon Q Apps so retail personas and users can create custom AI-powered applications to streamline day-to-day tasks and automate workflows and business processes.

The following diagram illustrates the solution architecture.

SolutionArchitecture

The solution uses the AWS architecture above to deliver a secure, high-performance, and reliable solution for retail intelligence. Amazon Q Business serves as the primary generative AI engine, enabling natural language interactions and powering custom retail-specific applications. The architecture incorporates AWS IAM Identity Center for robust authentication and access control, and Amazon Simple Storage Service (Amazon S3) provides secure data lake storage for retail data sources. We use QuickSight for interactive visualizations, enhancing data interpretation. The solution’s flexibility is further enhanced by AWS Lambda for serverless processing, Amazon API Gateway for efficient endpoint management, and Amazon CloudFront for optimized content delivery. This solution uses the Amazon Q Business custom plugin to call the API endpoints to start the automated workflows directly from the Amazon Q Business web application interface based on customer queries and interactions.

This setup implements a three-tier architecture: a data integration layer that securely ingests data from multiple retail sources, a processing layer where Amazon Q Business analyzes queries and generates insights, and a presentation layer that delivers personalized, role-based insights through a unified interface.

We have provided an AWS CloudFormation template, sample datasets, and scripts that you can use to set up the environment for this demonstration.

In the following sections, we dive deeper on how this solution works.

Deployment

We have provided the Amazon Q Business for Retail Intelligence solution as open source—you can use it as a starting point for your own solution and help us make it better by contributing fixes and features through GitHub pull requests. Visit the GitHub repository to explore the code, choose Watch to be notified of new releases, and check the README for the latest documentation updates.

After you set up the environment, you can access the Amazon Q Business for Retail Intelligence dashboard, as shown in the following screenshot.

RetailItelligenceDashboard

You can interact with the QuickSight visualizations and Amazon Q Business chat interface to ask questions using natural language.

Key features and capabilities

Retail users can interact with this solution in many ways. In this section, we explore the key features.

For C-suite executives or senior leadership wanting to know how your business is performing, our solution provides a single pane of glass and makes it straightforward to access and interact with your enterprise’s qualitative and quantitative data using natural language. For example, users can analyze quantitative data like product sales or marketing campaign performance using the interactive visualizations powered by QuickSight and qualitative data like customer feedback from Amazon Q Business using natural language, all from a single interface.

Consider that you are a marketing analyst and you want to evaluate campaign performance and reach across channels and conduct an analysis on ad spend vs. revenue. With Amazon Q Business, you can run complex queries with natural language questions and with share the Q Apps with multiple teams. The solution provides automated insights about customer behavior and campaign effectiveness, helping marketing teams make faster decisions and quick adjustments to maximize ROI.

marketingCampaignInfo

Similarly, let’s assume you are a merchandising planner or a vendor manager and you want to understand the impact of cost-prohibitive events for your international business that deals with importing and exporting of goods and services. You can add inputs to Amazon Q Apps and get responses based on that specific product or product family.

AlternativeProducts

Users can also send requests through APIs using Amazon Q Business custom plugins for real-time interactions with downstream applications. For example, a store manager might want to know which items in the current inventory they need to replenish or rebalance for the next week based on weather predictions or local sporting events.

To learn more, refer to the following complete demo.

For this post, we haven’t used the generative business intelligence (BI) capabilities of Amazon Q with our QuickSight visualizations. To learn more, see Amazon Q in QuickSight.

Empowering retail personas with AI-driven intelligence

Amazon Q Business for Retail Intelligence transforms how retailers handle their data challenges through a generative AI-powered assistant. This solution integrates seamlessly with existing systems, using Retrieval Augmented Generation (RAG) to unify disparate data sources and deliver actionable insights in real time.The following are some of the key benefits for various roles:

  • C-Suite executives – Access comprehensive real-time dashboards for company-wide metrics and KPIs while using AI-driven recommendations for strategic decisions. Use predictive analytics to anticipate consumer shifts and enable proactive strategy adjustments for business growth.
  • Merchandisers – Gain immediate insights into sales trends, profit margins, and inventory turnover rates through automated analysis tools and AI-powered pricing strategies. Identify and capitalize on emerging trends through predictive analytics for optimal product mix and category management.
  • Inventory managers – Implement data-driven stock level optimization across multiple store locations while streamlining operations with automated reorder point calculations. Accurately predict and prepare for seasonal demand fluctuations to maintain optimal inventory levels during peak periods.
  • Store managers – Maximize operational efficiency through AI-predicted staffing optimization while accessing detailed insights about local conditions affecting store performance. Compare store metrics against other locations using sophisticated benchmarking tools to identify improvement opportunities.
  • Marketing analysts – Monitor and analyze marketing campaign effectiveness across channels in real time while developing sophisticated customer segments using AI-driven analysis. Calculate and optimize marketing ROI across channels for efficient budget allocation and improved campaign performance.

Amazon Q Business for Retail Intelligence makes complex data analysis accessible to different users through its natural language interface. This solution enables data-driven decision-making across organizations by providing role-specific insights that break down traditional data silos. By providing each retail persona tailored analytics and actionable recommendations, organizations can achieve greater operational efficiency and maintain a competitive edge in the dynamic retail landscape.

Conclusion

Amazon Q Business for Retail Intelligence combines generative AI capabilities with powerful visualization tools to revolutionize retail operations. By enabling natural language interactions with complex data systems, this solution democratizes data access across organizational levels, from C-suite executives to store managers. The system’s ability to provide role-specific insights, automate workflows, and facilitate real-time decision-making positions it as a crucial tool for retail businesses seeking to maintain competitiveness in today’s dynamic landscape. As retailers continue to embrace AI-driven solutions, Amazon Q Business for Retail Intelligence can help meet the industry’s growing needs for sophisticated data analysis and operational efficiency.

To learn more about our solutions and offerings, refer to Amazon Q Business and Generative AI on AWS. For expert assistance, AWS Professional Services, AWS Generative AI partner solutions, and AWS Generative AI Competency Partners are here to help.


About the authors

Suprakash Dutta is a Senior Solutions Architect at Amazon Web Services, leading strategic cloud transformations for Fortune 500 retailers and large enterprises. He specializes in architecting mission-critical retail solutions that drive significant business outcomes, including cloud-native based systems, generative AI implementations, and retail modernization initiatives. He’s a multi-cloud certified architect and has delivered transformative solutions that modernized operations across thousands of retail locations while driving breakthrough efficiencies through AI-powered retail intelligence solutions.

Alberto Alonso is a Specialist Solutions Architect at Amazon Web Services. He focuses on generative AI and how it can be applied to business challenges.

Abhijit Dutta is a Sr. Solutions Architect in the Retail/CPG vertical at AWS, focusing on key areas like migration and modernization of legacy applications, data-driven decision-making, and implementing AI/ML capabilities. His expertise lies in helping organizations use cloud technologies for their digital transformation initiatives, with particular emphasis on analytics and generative AI solutions.

Ramesh Venkataraman is a Solutions Architect who enjoys working with customers to solve their technical challenges using AWS services. Outside of work, Ramesh enjoys following stack overflow questions and answers them in any way he can.

Girish Nazhiyath is a Sr. Solutions Architect in the Amazon Web Services Retail/CPG vertical. He enjoys working with retail/CPG customers to enable technology-driven retail innovation, with over 20 years of expertise in multiple retail segments and domains worldwide.

Krishnan Hariharan is a Sr. Manager, Solutions Architecture at AWS based out of Chicago. In his current role, he uses his diverse blend of customer, product, technology, and operations skills to help retail/CPG customers build the best solutions using AWS. Prior to AWS, Krishnan was President/CEO at Kespry, and COO at LightGuide. He has an MBA from The Fuqua School of Business, Duke University and a Bachelor of Science in Electronics from Delhi University.

Read More

Democratize data for timely decisions with text-to-SQL at Parcel Perform

Democratize data for timely decisions with text-to-SQL at Parcel Perform

This post was co-written with Le Vy from Parcel Perform.

Access to accurate data is often the true differentiator of excellent and timely decisions. This is even more crucial for customer-facing decisions and actions. A correctly implemented state-of-the-art AI can help your organization simplify access to data for accurate and timely decision-making for the customer-facing business team, while reducing the undifferentiated heavy lifting done by your data team. In this post, we share how Parcel Perform, a leading AI Delivery Experience Platform for e-commerce businesses worldwide, implemented such a solution.

Accurate post-purchase deliveries tracking can be crucial for many ecommerce merchants. Parcel Perform provides an AI-driven, intelligent end-to-end data and delivery experience and software as a service (SaaS) system for ecommerce merchants. The system uses AWS services and state-of-the-art AI to process hundreds of millions of daily parcel delivery movement data and provide a unified tracking capability across couriers for the merchants, with emphasis on accuracy and simplicity.

The business team in Parcel Perform often needs access to data to answer questions related to merchants’ parcel deliveries, such as “Did we see a spike in delivery delays last week? If so, in which transit facilities were this observed, and what was the primary cause of the issue?” Previously, the data team had to manually form the query and run it to fetch the data. With the new generative AI-powered text-to-SQL capability in Parcel Perform, the business team can self-serve their data needs by using an AI assistant interface. In this post, we discuss how Parcel Perform incorporated generative AI, data storage, and data access through AWS services to make timely decisions.

Data analytics architecture

The solution starts with data ingestion, storage, and access. Parcel Perform adopted the data analytics architecture shown in the following diagram.

Architecture diagram of the parcel event data ingestion at Parcel Perform

One key data type in the Parcel Perform parcel monitoring application is the parcel event data, which can reach billions of rows. This includes the parcel’s shipment status change, location change, and much more. This day-to-day data from multiple business units lands in relational databases hosted on Amazon Relational Database Service (Amazon RDS).

Although relational databases are suitable for rapid data ingestion and consumption from the application, a separate analytics stack is needed to handle analytics in a scalable and performant way without disrupting the main application. These analytics needs include answering aggregation queries from questions like “How many parcels were delayed last week?”

Parcel Perform uses Amazon Simple Storage Service (Amazon S3) with a query engine provided by Amazon Athena to meet their analytics needs. With this approach, Parcel Perform benefits from cost-effective storage while still being able to run SQL queries as needed on the data through Athena, which is priced on usage.

Data in Amazon S3 is stored in Apache Iceberg data format that allows data updates, which is useful in this case because the parcel events sometimes get updated. It also supports partitioning for better performance. Amazon S3 Tables, launched in late 2024, is a managed Iceberg tables feature that can also be an option for you.

Parcel Perform uses an Apache Kafka cluster managed by Amazon Managed Streaming for Apache Kafka (Amazon MSK) as the stream to move the data from the source to the S3 bucket. Amazon MSK Connect with a Debezium connector streams data with change data capture (CDC) from Amazon RDS to Amazon MSK.

Apache Flink, running on Amazon Elastic Kubernetes Service (Amazon EKS), processes data streams from Amazon MSK. It writes this data to an S3 bucket according to the Iceberg format, and updates the data schema in the AWS Glue Data Catalog. The data schema enables Athena to correctly query the data in the S3 bucket.

Now that you understand how the data is ingested and stored, let’s show how the data is consumed using the generative AI-powered data serving assistant for the business teams in Parcel Perform.

AI agent that can query data

The users of the data serving AI agent in Parcel Perform are customer-facing business team members who often query the parcel event data to answer questions from ecommerce merchants regarding the parcel deliveries and to proactively assist them. The following screenshot shows the UI experience for the AI agent assistant, powered by text-to-SQL with generative AI.

A screenshot of the AI assistant

This functionality helped the Parcel Perform team and their customers save time, which we discuss later in this post. In the following section, we present the architecture that powers this feature.

Text-to-SQL AI agent architecture

The data serving AI assistant architecture in Parcel Perform is shown in the following diagram.

Architecture diagram of the AI assistantThe AI assistant UI is powered by an application built with the Fast API framework hosted on Amazon EKS. It is also fronted by an Application Load Balancer to allow for potential horizontal scalability.

The application uses LangGraph to orchestrate the workflow of large language model (LLM) invocations, the use of tools, and the memory checkpointing. The graph uses multiple tools, including those from SQLDatabase Toolkit, to automatically fetch the data schema through Athena. The graph also uses an Amazon Bedrock Knowledge Bases retriever to retrieve business information from a knowledge base. Parcel Perform uses Anthropic’s Claude models in Amazon Bedrock to generate SQL.

Although the function of Athena as a query engine to query the parcel event data on Amazon S3 is clear, Parcel Perform still needs a knowledge base. In this use case, the SQL generation performs better when the LLM has more business contextual information to help interpret database fields and translate logistics terminology into data representations. This is better illustrated with the following two examples:

  • Parcel Perform’s data lake operations use specific codes: c for create and u for update. When analyzing data, Parcel Perform sometimes needs to focus only on initial creation records, where operation code is equal to c. Because this business logic might not be inherent in the training of LLMs in general, Parcel Perform explicitly defines this in their business context.
  • In logistics terminology, transit time has specific industry conventions. It’s measured in days, and same-day deliveries are recorded as transit_time = 0. Although this is intuitive for logistics professionals, an LLM might incorrectly interpret a request like “Get me all shipments with same-day delivery” by using WHERE transit_time = 1 instead of WHERE transit_time = 0 in the generated SQL statement.

Therefore, each incoming question goes to a Retrieval Augmented Generation (RAG) workflow to find potentially relevant stored business information, to enrich the context. This mechanism helps provide the specific rules and interpretations that even advanced LLMs might not be able to derive from general training data.

Parcel Perform uses Amazon Bedrock Knowledge Bases as a managed solution for the RAG workflow. They ingest business contextual information by uploading files to Amazon S3. Amazon Bedrock Knowledge Bases processes the files, converts them to chunks, uses embedding models to generate vectors, and stores the vectors in a vector database to make them searchable. The steps are fully managed by Amazon Bedrock Knowledge Bases. Parcel Perform stores the vectors in Amazon OpenSearch Serverless as the vector database of choice to simplify infrastructure management.

Amazon Bedrock Knowledge Bases provides the Retrieve API, which takes in an input (such as a question from the AI assistant), converts it into a vector embedding, searches for relevant chunks of business context information in the vector database, and returns the top relevant document chunks. It is integrated with the LangChain Amazon Bedrock Knowledge Bases retriever by calling the invoke method.

The next step involves invoking an AI agent with the supplied business contextual information and the SQL generation prompt. The prompt was inspired by a prompt in LangChain Hub. The following is a code snippet of the prompt:

You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain, always limit your query to at most {top_k} results.
Relevant context:
{rag_context}
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
- Only use the below tools. Only use the information returned by the below tools to construct your final answer.
- DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.
- To start querying for final answer you should ALWAYS look at the tables in the database to see what you can query. Do NOT skip this step.
- Then you should query the schema of the most relevant tables

The prompt sample is part of the initial instruction for the agent. The data schema is automatically inserted by the tools from the SQLDatabase Toolkit at a later step of this agentic workflow. The following steps occur after a user enters a question in the AI assistant UI:

  1. The question triggers a run of the LangGraph graph.
  2. The following processes happen in parallel:
    1. The graph fetches the database schema from Athena through SQLDatabase Toolkit.
    2. The graph passes the question to the Amazon Bedrock Knowledge Bases retriever and gets a list of relevant business information regarding the question.
  3. The graph invokes an LLM using Amazon Bedrock by passing the question, the conversation context, data schema, and business context information. The result is the generated SQL.
  4. The graph uses SQLDatabase Toolkit again to run the SQL through Athena and fetch the data output.
  5. The data output is passed into an LLM to generate the final response based on the initial question asked. Amazon Bedrock Guardrails is used as a safeguard to avoid inappropriate inputs and responses.
  6. The final response is returned to the user through the AI assistant UI.

The following diagram illustrates these steps.

Architecture diagram of the AI assistant with numbered steps

This implementation demonstrates how Parcel Perform transforms raw inquiries into actionable data for timely decision-making. Security is also implemented in multiple components. From a network perspective, the EKS pods are placed in private subnets in Amazon Virtual Private Cloud (Amazon VPC) to improve network security of the AI assistant application. This AI agent is placed behind a backend layer that requires authentication. For data security, sensitive data is masked at rest in the S3 bucket. Parcel Perform also limits the permissions of the AWS Identity and Access Management (IAM) role used to access the S3 bucket so it can only access certain tables.

In the following sections, we discuss Parcel Perform’s approach to building this data transformation solution.

From idea to production

Parcel Perform started with the idea of freeing their data team from manually serving the request from the business team, while also improving the timeliness of the data availability to support the business team’s decision-making.

With the help of the AWS Solutions Architect team, Parcel Perform completed a proof of concept using AWS services and a Jupyter notebook in Amazon SageMaker Studio. After an initial success, Parcel Perform integrated the solution with their orchestration tool of choice, LangGraph.

Before going into production, Parcel Perform conducted extensive testing to verify the results were consistent. They added LangSmith Tracing to log the AI agent’s steps and results to evaluate its performance.

The Parcel Perform team discovered challenges during their journey, which we discuss in the following section. They performed prompt engineering to address those challenges. Eventually, the AI agent was integrated into production to be used by the business team. Afterward, Parcel Perform collected user feedback internally and monitored logs from LangSmith Tracing to verify performance was maintained.

The challenges

This journey isn’t free from challenges. Firstly, some ecommerce merchants might have several records in the data lake under various names. For example, a merchant with the name “ABC” might have multiple records such, as “ABC Singapore Holdings Pte. Ltd.,” “ABC Demo Account,” “ABC Test Group,” and so on. For a question like “Was there any parcel shipment delay by ABC last week?”, the generated SQL has the element of WHERE merchant_name LIKE '%ABC%', which might result in ambiguity. During the proof of concept stage, this problem caused incorrect matching of the result.

For this challenge, Parcel Perform relies on careful prompt engineering to instruct the LLM to identify when the name was potentially ambiguous. The AI agent then calls Athena again to look for matching names. The LLM decides which merchant name to use based on multiple factors, including the significance in data volume contribution and the account status in the data lake. In the future, Parcel Perform intends to implement a more sophisticated technique by prompting the user to resolve the ambiguity.

The second challenge is about unrestricted questions that might yield expensive queries running across large amounts of data and resulting in longer query waiting time. Some of these questions might not have a LIMIT clause imposed in the query. To solve this, Parcel Perform instructs the LLM to add a LIMIT clause with a certain number of maximum results if the user doesn’t specify the intended number of results. In the future, Parcel Perform plans to use the query EXPLAIN results to identify heavy queries.

The third challenge is related to tracking usage and incurred cost of this particular solution. Having started multiple generative AI projects using Amazon Bedrock and sometimes with the same LLM ID, Parcel Perform must distinguish usage incurred by projects. Parcel Perform creates an inference profile for each project, associates the profile with tags, and includes that profile in each LLM call for that project. With this setup, Parcel Perform is able to segregate costs based on projects to improve cost visibility and monitoring.

The impact

To extract data, the business team clarifies details with the data team, makes a request, checks feasibility, and waits for bandwidth. This process lengthens when requirements come from customers or teams in different time zones, with each clarification adding 12–24 hours due to asynchronous communication. Simpler requests made early in the workday might complete within 24 hours, whereas more complex requests or those during busy periods can take 3–5 business days.

With the text-to-SQL AI agent, this process is dramatically streamlined—minimizing the back-and-forth communication for requirement clarification, removing the dependency on data team bandwidth, and automating result interpretation.

Parcel Perform’s measurements show that the text-to-SQL AI agent reduces the average time-to-insight by 99%, from 2.3 days to an average of 10 minutes, saving approximately 3,850 total hours of wait time per month across requesters while maintaining data accuracy.

Users can directly query the data without intermediaries, receiving results in minutes rather than days. Teams across time zones can now access insights any time of day, alleviating the frustrating “wait until Asia wakes up” or “catch EMEA before they leave” delays, leading to happier customers and faster problem-solving.

This transformation has profoundly impacted the data analytics team’s capacity and focus, freeing the data team for more strategic work and helping everyone make faster, more informed decisions. Before, the analysts spent approximately 25% of their working hours handling routine data extraction requests—equivalent to over 260 hours monthly across the team. Now, with basic and intermediate queries automated, this number has dropped to just 10%, freeing up nearly 160 hours each month for high-impact work. Analysts now focus on complex data analysis rather than spending time on basic data retrieval tasks.

Conclusion

Parcel Perform’s solution demonstrates how you can use generative AI to enhance productivity and customer experience. Parcel Perform has built a text-to-SQL AI agent that transforms a business team’s question into SQL that can fetch the actual data. This improves the timeliness of data availability for decision-making that involves customers. Furthermore, the data team can avoid the undifferentiated heavy lifting to focus on complex data analysis tasks.

This solution uses multiple AWS services like Amazon Bedrock and tools like LangGraph. You can start with a proof of concept and consult your AWS Solutions Architect or engage with AWS Partners. If you have questions, post them on AWS re:Post. You can also make the development more straightforward with the help of Amazon Q Developer. When you face challenges, you can iterate to find the solution, which might include prompt engineering or adding additional steps to your workflow.

Security is a top priority. Make sure your AI assistant has proper guardrails in place to protect against prompt threats, inappropriate topics, profanity, leaked data, and other security issues. You can integrate Amazon Bedrock Guardrails with your generative AI application through an API.To learn more, refer to the following resources:


About the authors

Yudho Ahmad Diponegoro profile pictureYudho Ahmad Diponegoro is a Senior Solutions Architect at AWS. Having been part of Amazon for 10+ years, he has had various roles from software development to solutions architecture. He helps startups in Singapore when it comes to architecting in the cloud. While he keeps his breadth of knowledge across technologies and industries, he focuses in AI and machine learning where he has been guiding various startups in ASEAN to adopt machine learning and generative AI at AWS.

Le Vy is the AI Team Lead at Parcel Perform, where she drives the development of AI applications and explores emerging AI research. She started her career in data analysis and deepened her focus on AI through a Master’s in Artificial Intelligence. Passionate about applying data and AI to solve real business problems, she also dedicates time to mentoring aspiring technologists and building a supportive community for youth in tech. Through her work, Vy actively challenges gender norms in the industry and champions lifelong learning as a key to innovation.

Loke Jun Kai is a GenAI/ML Specialist Solutions Architect in AWS, covering strategic customers across the ASEAN region. He works with customers ranging from Start-up to Enterprise to build cutting-edge use cases and scalable GenAI Platforms. His passion in the AI space, constant research and reading, have led to many innovative solutions built with concrete business outcomes. Outside of work, he enjoys a good game of tennis and chess.

Read More

Query Amazon Aurora PostgreSQL using Amazon Bedrock Knowledge Bases structured data

Query Amazon Aurora PostgreSQL using Amazon Bedrock Knowledge Bases structured data

Amazon Bedrock Knowledge Bases offers a fully managed Retrieval Augmented Generation (RAG) feature that connects large language models (LLMs) to internal data sources. This feature enhances foundation model (FM) outputs with contextual information from private data, making responses more relevant and accurate.

At AWS re:Invent 2024, we announced Amazon Bedrock Knowledge Bases support for natural language querying to retrieve structured data from Amazon Redshift and Amazon SageMaker Lakehouse. This feature provides a managed workflow for building generative AI applications that access and incorporate information from structured and unstructured data sources. Through natural language processing, Amazon Bedrock Knowledge Bases transforms natural language queries into SQL queries, so users can retrieve data directly from supported sources without understanding database structure or SQL syntax.

In this post, we discuss how to make your Amazon Aurora PostgreSQL-Compatible Edition data available for natural language querying through Amazon Bedrock Knowledge Bases while maintaining data freshness.

Structured data retrieval in Amazon Bedrock Knowledge Bases and Amazon Redshift Zero-ETL

Structured data retrieval in Amazon Bedrock Knowledge Bases enables natural language interactions with your database by converting user queries into SQL statements. When you connect a supported data source like Amazon Redshift, Amazon Bedrock Knowledge Bases analyzes your database schema, table relationships, query engine, and historical queries to understand the context and structure of your information. This understanding allows the service to generate accurate SQL queries from natural language questions.

At the time of writing, Amazon Bedrock Knowledge Bases supports structured data retrieval directly from Amazon Redshift and SageMaker Lakehouse. Although direct support for Aurora PostgreSQL-Compatible isn’t currently available, you can use the zero-ETL integration between Aurora PostgreSQL-Compatible and Amazon Redshift to make your data accessible to Amazon Bedrock Knowledge Bases structured data retrieval. Zero-ETL integration automatically replicates your Aurora PostgreSQL tables to Amazon Redshift in near real time, alleviating the need for complex extract, transform, and load (ETL) pipelines or data movement processes.

This architectural pattern is particularly valuable for organizations seeking to enable natural language querying of their structured application data stored in Amazon Aurora database tables. By combining zero-ETL integration with Amazon Bedrock Knowledge Bases, you can create powerful applications like AI assistants that use LLMs to provide natural language responses based on their operational data.

Solution overview

The following diagram illustrates the architecture we will implement to connect Aurora PostgreSQL-Compatible to Amazon Bedrock Knowledge Bases using zero-ETL.

Architecture Diagram

The workflow consists of the following steps:

  1. Data is stored in Aurora PostgreSQL-Compatible within the private subnet. We use a bastion host to connect securely to the database from the public subnet.
  2. Using zero-ETL integration, this data is made available in Amazon Redshift, also located in the private subnet.
  3. Amazon Bedrock Knowledge Bases uses Amazon Redshift as its structured data source.
  4. Users can interact with Amazon Bedrock Knowledge Bases using the AWS Management Console or an AWS SDK client, which sends natural language queries. These queries are processed by Amazon Bedrock Knowledge Bases to retrieve information stored in Amazon Redshift (sourced from Aurora).

Prerequisites

Make sure you’re logged in with a user role with access to create an Aurora database, run DDL (CREATE, ALTER, DROP, RENAME) and DML (SELECT, INSERT, UPDATE, DELETE) statements, create a Redshift database, set up zero-ETL integration, and create an Amazon Bedrock knowledge base.

Set up the Aurora PostgreSQL database

In this section, we walk through creating and configuring an Aurora PostgreSQL database with a sample schema for our demonstration. We create three interconnected tables: products, customers, and orders.

Provision the database

Let’s begin by setting up our database environment. Create a new Aurora PostgreSQL database cluster and launch an Amazon Elastic Compute Cloud (Amazon EC2) instance that will serve as our access point for managing the database. The EC2 instance will make it straightforward to create tables and manage data throughout this post.

The following screenshot shows the details of our database cluster and EC2 instance.

Aurora PostgreSQL cluster

For instructions to set up your database, refer to Creating and connecting to an Aurora PostgreSQL DB cluster.

Create the database schema

After you connect to your database using SSH on your EC2 instance (described in Creating and connecting to an Aurora PostgreSQL DB cluster), it’s time to create your data structure. We use the following DDL statements to create three tables:

-- Create Product table
CREATE TABLE product (
    product_id SERIAL PRIMARY KEY,
    product_name VARCHAR(100) NOT NULL,
    price DECIMAL(10, 2) NOT NULL
);

-- Create Customer table
CREATE TABLE customer (
    customer_id SERIAL PRIMARY KEY,
    customer_name VARCHAR(100) NOT NULL,
    pincode VARCHAR(10) NOT NULL
);

-- Create Orders table
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    product_id INTEGER NOT NULL,
    customer_id INTEGER NOT NULL,
    FOREIGN KEY (product_id) REFERENCES product(product_id),
    FOREIGN KEY (customer_id) REFERENCES customer(customer_id)
);

Populate the tables with data

After you create the tables, you can populate them with sample data. When inserting data into the orders table, remember to maintain referential integrity by verifying the following:

  • The product_id exists in the product table
  • The customer_id exists in the customer table

We use the following example code to populate the tables:

INSERT INTO product (product_id, product_name, price) VALUES (1, 'Smartphone X', 699.99);
INSERT INTO product (product_id, product_name, price) VALUES (2, 'Laptop Pro', 1299.99);
INSERT INTO product (product_id, product_name, price) VALUES (3, 'Wireless Earbuds', 129.99);
INSERT INTO customer (customer_id, customer_name, pincode) VALUES (1, 'John Doe', '12345');
INSERT INTO customer (customer_id, customer_name, pincode) VALUES (2, 'Jane Smith', '23456');
INSERT INTO customer (customer_id, customer_name, pincode) VALUES (3, 'Robert Johnson', '34567');
INSERT INTO orders (order_id, product_id, customer_id) VALUES (1, 1, 1);
INSERT INTO orders (order_id, product_id, customer_id) VALUES (2, 1, 2);
INSERT INTO orders (order_id, product_id, customer_id) VALUES (3, 2, 3);
INSERT INTO orders (order_id, product_id, customer_id) VALUES (4, 2, 1);
INSERT INTO orders (order_id, product_id, customer_id) VALUES (5, 3, 2);
INSERT INTO orders (order_id, product_id, customer_id) VALUES (6, 3, 3);

Make sure to maintain referential integrity when populating the orders table to avoid foreign key constraint violations.

You can also use similar examples to build your schema and populate data for this.

Set up the Redshift cluster and configure zero-ETL

Now that you have set up your Aurora PostgreSQL database, you can establish the zero-ETL integration with Amazon Redshift. This integration automatically syncs your data between Aurora PostgreSQL-Compatible and Amazon Redshift.

Set up Amazon Redshift

First, create an Amazon Redshift Serverless workgroup and namespace. For instructions, see Creating a data warehouse with Amazon Redshift Serverless.

Create a zero-ETL integration

The zero-ETL integration process involves two main steps:

  1. Create the zero-ETL integration from your Aurora PostgreSQL database to Redshift Serverless.
  2. After you establish the integration on the Aurora side, create the corresponding mapping database in Amazon Redshift. This step is crucial for facilitating proper data synchronization between the two services.

The following screenshot shows our zero-ETL integration details.

Zero ETL Integration

Verify the integration

After you complete the integration, you can verify its success through several checks.

Firstly, you can check the zero-ETL integration details in the Amazon Redshift console. You should see an Active status for your integration, along with source and destination information, as shown in the following screenshot.

Redshift Zero ETL

Additionally, you can use the Redshift Query Editor v2 to verify that your data has been successfully populated. A simple query like SELECT * FROM customer; should return the synchronized data from your Aurora PostgreSQL database, as shown in the following screenshot.

Amazon Redshift Query Editor

Set up the Amazon Bedrock knowledge base with structured data

The final step is to create an Amazon Bedrock knowledge base that will enable natural language querying of our data.

Create the Amazon Bedrock knowledge base

Create a new Amazon Bedrock knowledge base with the structured data option. For instructions, see Build a knowledge base by connecting to a structured data store. Then you must synchronize the query engine to enable data access.

Configure data access permissions

Before the sync process can succeed, you need to grant appropriate permissions to the Amazon Bedrock Knowledge Bases AWS Identity and Access Management (IAM) role. This involves executing GRANT SELECT commands for each table in your Redshift database.

Run the following command in Redshift Query Editor v2 for each table:GRANT SELECT ON <table_name> TO "IAMR:<KB Role name>";For example:GRANT SELECT ON customer TO "IAMR:AmazonBedrockExecutionRoleForKnowledgeBase_ej0f0";

For production setups, integrating the end-user identity into the data access flow requires identity federation. Refer to AWS documentation on structured database access for the role-based access model. For federating identities from web clients, Amazon Cognito or SAML federation with AWS Security Token Service (AWS STS) might be required depending on your architecture.

Verify the setup

After you complete the configuration, your knowledge base should show the following details:

  • Status as Available
  • Query engine successfully synced with Amazon Redshift
  • COMPLETE status for the database synchronization

You can now start querying your data using natural language.

Example natural language queries

Now that you have set up your Amazon Bedrock knowledge base, you can begin testing its capabilities by running natural language queries against your structured data. Amazon Bedrock Knowledge Bases structured data translates plain English questions into SQL and uses FMs to generate human-readable responses.

You can test your Amazon Bedrock knowledge base in two ways:

  • Amazon Bedrock console – On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane, and select your configured knowledge base. Open the Test tab to interact with your data.
  • AWS SDK or API – Use the AWS SDK (for example, Python’s Boto3) or directly call the RetrieveAndGenerate API to send queries programmatically.

In this section, we illustrate the console experience. On the Amazon Bedrock console, you can interact with your Amazon Bedrock knowledge base in two modes:

  • Generate SQL queries – This option shows you the raw SQL query that gets generated from your natural language question

Amazon Bedrock Knowledge base

  • Retrieval and response generation – This provides a natural language response using an LLM

Retrieval and response generation

The following table contains some examples of queries and their respective SQL and model response generation.

Natural Language Query Generate SQL API Result Retrieval and Response Generation Model Used for Response Generation
How many customers do we have?
SELECT COUNT(DISTINCT customer_id) AS num_customers 
FROM public.customer;

We currently have 11 unique customers. Amazon Nova Lite
Which all customers have purchased the most products?
SELECT c. "customer_name", COUNT(o."product_id") 
AS "total_products_purchased" 
FROM public.customer c 
INNER JOIN public.orders o 
ON c. "customer_id" = o. "customer_id" 
GROUP BY c. "customer_name" 
ORDER BY "total_products_purchased" DESC;

Based on the data, the customers who have purchased
the most products are Charlie Davis, Alice Brown,
and John Doe, each having purchased 14 products.
Following closely are Jane Smith, Grace Lee, and Bob Johnson,
who have each purchased 13 products. Henry Taylor,
Frank Miller, and Eva Wilson have each purchased 12 products,
while Ivy Anderson has purchased 11 products.
Amazon Nova Lite
Who all have purchased more than one number of the most expensive product?
SELECT c.customer_name 
FROM public.customer c 
INNER JOIN public.orders o 
ON c.customer_id = o.customer_id 
INNER JOIN public.product p 
ON o.product_id = p.product_id 
WHERE p.price = (SELECT MAX(price) 
FROM public.product) 
GROUP BY c.customer_name 
HAVING COUNT(DISTINCT o.order_id);

The customers who have purchased more than one number of the
most expensive product are Grace Lee, Jane Smith, Alice Brown,
and Eva Wilson.
Amazon Nova Micro

Clean up

When you’re done using this solution, clean up the resources you created to avoid ongoing charges.

Conclusion

In this post, we demonstrated how to enable natural language querying of Aurora PostgreSQL data using Amazon Bedrock Knowledge Bases through zero-ETL integration with Amazon Redshift. We showed how to set up the database, configure zero-ETL integration, and establish the knowledge base connection for seamless data access. Although this solution provides an effective way to interact with your data using natural language, you should consider the additional storage costs in Amazon Redshift when implementing this architecture for your use case.

Please try out this solution for yourself and share your feedback in the comments.


About the authors

Girish B is a Senior Solutions Architect at AWS India Pvt Ltd based in Bengaluru. Girish works with many ISV customers to design and architect innovative solutions on AWS

Dani Mitchell is a Generative AI Specialist Solutions Architect at AWS. He is focused on helping accelerate enterprises across the world on their generative AI journeys with Amazon Bedrock

Read More

Configure fine-grained access to Amazon Bedrock models using Amazon SageMaker Unified Studio

Configure fine-grained access to Amazon Bedrock models using Amazon SageMaker Unified Studio

Enterprises adopting advanced AI solutions recognize that robust security and precise access control are essential for protecting valuable data, maintaining compliance, and preserving user trust. As organizations expand AI usage across teams and applications, they require granular permissions to safeguard sensitive information and manage who can access powerful models. Amazon SageMaker Unified Studio addresses these needs so organizations can configure fine-grained access policies, making sure that only authorized users can interact with foundation models (FMs) while supporting secure, collaborative innovation at scale.

Launched in 2025, SageMaker Unified Studio is a single data and AI development environment where you can find and access the data in your organization and act on it using the best tools across use cases. SageMaker Unified Studio brings together the functionality and tools from existing AWS analytics and AI/ML services, including Amazon EMR, AWS Glue, Amazon Athena, Amazon Redshift, Amazon Bedrock, and Amazon SageMaker AI.

Amazon Bedrock in SageMaker Unified Studio provides various options for discovering and experimenting with Amazon Bedrock models and applications. For example, you can use a chat playground to try a prompt with Anthropic’s Claude without having to write code. You can also create a generative AI application that uses an Amazon Bedrock model and features, such as a knowledge base or a guardrail. To learn more, refer to Amazon Bedrock in SageMaker Unified Studio.

In this post, we demonstrate how to use SageMaker Unified Studio and AWS Identity and Access Management (IAM) to establish a robust permission framework for Amazon Bedrock models. We show how administrators can precisely manage which users and teams have access to specific models within a secure, collaborative environment. We guide you through creating granular permissions to control model access, with code examples for common enterprise governance scenarios. By the end, you will understand how to tailor access to generative AI capabilities to meet your organization’s requirements—addressing a core challenge in enterprise AI adoption by enabling developer flexibility while maintaining strong security standards.

Solution overview

In SageMaker Unified Studio, a domain serves as the primary organizational structure, so you can oversee multiple AWS Regions, accounts, and workloads from a single interface. Each domain is assigned a unique URL and offers centralized control over studio configurations, user accounts, and network settings.

Inside each domain, projects facilitate streamlined collaboration. Projects can span different Regions or accounts within a Region, and their metadata includes details such as the associated Git repository, team members, and their permissions. Every account where a project has resources is assigned at least one project role, which determines the tools, compute resources, datasets, and artificial intelligence and machine learning (AI/ML) assets accessible to project members. To manage data access, you can adjust the IAM permissions tied to the project’s role. SageMaker Unified Studio uses several IAM roles. For a complete list, refer to Identity and access management for Amazon SageMaker Unified Studio.

There are two primary methods for users to interact with Amazon Bedrock models in SageMaker Unified Studio: the SageMaker Unified Studio playground and SageMaker Unified Studio projects.

In the SageMaker Unified Studio playground scenario, model consumption roles provide secure access to Amazon Bedrock FMs. You can choose between automatic role creation for individual models or configuring a single role for all models. The default AmazonSageMakerBedrockModelConsumptionRole comes with preconfigured permissions to consume Amazon Bedrock models, including invoking the Amazon Bedrock application inference profile created for a particular SageMaker Unified Studio domain. To fine-tune access control, you can add inline policies to these roles that explicitly allow or deny access to specific Amazon Bedrock models.

The following diagram illustrates this architecture.

Playground Access

The workflow consists of the following steps:

  1. Initial access path:
    1. The SageMakerDomain execution role connects to the SageMaker Unified Studio domain.
    2. The connection flows to the SageMaker user profile.
    3. The user profile accesses SageMaker Unified Studio.
  2. Restricted access path (top flow):
    1. The direct access attempt from SageMaker Unified Studio to Amazon Bedrock is denied.
    2. The IAM policy blocks access to FMs.
    3. Anthropic’s Claude 3.7 Sonnet access is denied (marked with X).
    4. Anthropic’s Claude 3.5 Haiku access is denied (marked with X).
  3. Permitted access path (bottom flow):
    1. SageMakerBedrockModelConsumptionRole is used.
    2. The appropriate IAM policy allows access (marked with a checkmark).
    3. Amazon Bedrock access is permitted.
    4. Anthropic’s Claude 3.7 Sonnet access is allowed (marked with checkmark).
    5. Anthropic’s Claude 3.5 Haiku access is allowed (marked with checkmark).
  4. Governance mechanism:
    1. IAM policies serve as the control point for model access.
    2. Different roles determine different levels of access permission.
    3. Access controls are explicitly defined for each FM.

In the SageMaker Unified Studio project scenario, SageMaker Unified Studio uses a model provisioning role to create an inference profile for an Amazon Bedrock model in a project. The inference profile is required for the project to interact with the model. You can either let SageMaker Unified Studio automatically create a unique model provisioning role, or you can provide a custom model provisioning role. The default AmazonSageMakerBedrockModelManagementRole has the AWS policy AmazonDataZoneBedrockModelManagementPolicy attached. You can restrict access to specific account IDs through custom trust policies. You can also attach inline policies and use the statement CreateApplicationInferenceProfileUsingFoundationModels to allow or deny access to specific Amazon Bedrock models in your project.

The following diagram illustrates this architecture.

Projects Access

The workflow consists of the following steps:

  1. Initial access path:
    1. The SageMakerDomain execution role connects to the SageMaker Unified Studio domain.
    2. The connection flows to the SageMaker user profile.
    3. The user profile accesses SageMaker Unified Studio.
  2. Restricted access path (top flow):
    1. The direct access attempt from SageMaker Unified Studio to Amazon Bedrock is denied.
    2. The IAM policy blocks access to FMs.
    3. Anthropic’s Claude 3.7 Sonnet access is denied (marked with X).
    4. Anthropic’s Claude 3.5 Haiku access is denied (marked with X).
  3. Permitted access path (bottom flow):
    1. SageMakerBedrockModelManagementRole is used.
    2. The appropriate IAM policy allows access (marked with a checkmark).
    3. Amazon Bedrock access is permitted.
    4. Anthropic’s Claude 3.7 Sonnet access is allowed (marked with checkmark).
    5. Anthropic’s Claude 3.5 Haiku access is allowed (marked with checkmark).
  4. Governance mechanism:
    1. IAM policies serve as the control point for model access.
    2. Different roles determine different levels of access permission.
    3. Access controls are explicitly defined for each FM.

By customizing the policies attached to these roles, you can control which actions are permitted or denied, thereby governing access to generative AI capabilities.

To use a specific model from Amazon Bedrock, SageMaker Unified Studio uses the model ID of the chosen model as part of the API calls. At the time of writing, SageMaker Unified Studio supports the following Amazon Bedrock models (the full list of current models can be found here), grouped by model provider:

  • Amazon:
    • Amazon Titan Text G1 – Premier: amazon.titan-text-premier-v1:0
    • Amazon Nova Pro: amazon.nova-pro-v1:0
    • Amazon Nova Lite: amazon.nova-lite-v1:0
    • Amazon Nova Canvas: amazon.nova-canvas-v1:0
  • Stability AI:
    • SDXL 1.0: stability.stable-diffusion-xl-v1
  • AI21 Labs:
    • Jamba-Instruct: ai21.jamba-instruct-v1:0
    • Jamba 1.5 Large: ai21.jamba-1-5-large-v1:0
    • Jamba 1.5 Mini: ai21.jamba-1-5-mini-v1:0
  • Anthropic:
    • Claude 3.7 Sonnet: anthropic.claude-3-7-sonnet-20250219-v1:0
  • Cohere:
    • Command R+: cohere.command-r-plus-v1:0
    • Command Light: cohere.command-light-text-v14
    • Embed Multilingual: cohere.embed-multilingual-v3
  • DeepSeek:
    • DeepSeek-R1: deepseek.r1-v1:0
  • Meta:
    • Llama 3.3 70B Instruct: meta.llama3-3-70b-instruct-v1:0
    • Llama 4 Scout 17B Instruct: meta.llama4-scout-17b-instruct-v1:0
    • Llama 4 Maverick 17B Instruct: meta.llama4-maverick-17b-instruct-v1:0
  • Mistral AI:
    • Mistral 7B Instruct: mistral.mistral-7b-instruct-v0:2
    • Pixtral Large (25.02): mistral.pixtral-large-2502-v1:0

Create a model consumption role for the playground scenario

In the following steps, you create an IAM role with a trust policy, add two inline policies, and attach them to the role.

Create the IAM role with a trust policy

Complete the following steps to create an IAM role with a trust policy:

  1. On the IAM console, in the navigation pane, choose Roles, then choose Create role.
  2. For Trusted entity type, select Custom trust policy.
  3. Delete the default policy in the editor and enter the following trust policy (replace account-id for the aws:SourceAccount field with your AWS account ID):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "datazone.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:SetContext"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "[account-id]"
                }
            }
        }
    ]
}
  1. Choose Next.
  2. Skip the Add permissions page by choosing Next.
  3. Enter a name for the role (for example, DataZoneBedrock-Role) and an optional description.
  4. Choose Create role.

Add the first inline policy

Complete the following steps to add an inline policy:

  1. On the IAM console, open the newly created role details page.
  2. On the Permissions tab, choose Add permissions and then Create inline policy.
  3. On the JSON tab, delete the default policy and enter the first inline policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeDomainInferenceProfiles",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": "arn:aws:bedrock:*:*:application-inference-profile/*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/AmazonDataZoneDomain": "${datazone:domainId}",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                },
                "Null": {
                    "aws:ResourceTag/AmazonDataZoneProject": "true"
                }
            }
        }
    ]
}
  1. Choose Review policy.
  2. Name the policy (for example, DataZoneDomainInferencePolicy) and choose Create policy.

Add the second inline policy

Complete the following steps to add another inline policy:

  1. On the role’s Permissions tab, choose Add permissions and then Create inline policy.
  2. On the JSON tab, delete the default policy and enter the second inline policy (replace account-id in the bedrock:InferenceProfileArn field with your account ID):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowInferenceProfileToInvokeFoundationModels",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-haiku-20241022-v1:0",
                "arn:aws:bedrock:us-east-2::foundation-model/anthropic.claude-3-5-haiku-20241022-v1:0",
                "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-3-5-haiku-20241022-v1:0"
            ],
            "Condition": {
                "ArnLike": {
                    "bedrock:InferenceProfileArn": "arn:aws:bedrock:*:[account-id]:application-inference-profile/*"
                }
            }
        }
    ]
}
  1. Choose Review policy.
  2. Name the policy (for example, BedrockFoundationModelAccessPolicy) and choose Create policy.

Explanation of the policies

In this section, we discuss the details of the policies.

Trust policy explanation

This trust policy defines who can assume the IAM role:

  • It allows the Amazon DataZone service (datazone.amazonaws.com) to assume this role
  • The service can perform sts:AssumeRole and sts:SetContext actions
  • A condition restricts this to only when the AWS account you specify is your AWS account

This makes sure that only Amazon DataZone from your specified account can assume this role.

First inline policy explanation

This policy controls access to Amazon Bedrock inference profiles:

  • It allows invoking Amazon Bedrock models (bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream), but only for resources that are application inference profiles (arn:aws:bedrock:::application-inference-profile/*)
  • It has three important conditions:
  • The profile must be tagged with AmazonDataZoneDomain matching the domain ID of the caller
  • The resource must be in the same AWS account as the principal making the request
  • The resource must not have an AmazonDataZoneProject tag set to "true"

This effectively limits access to only those inference profiles that belong to the same Amazon DataZone domain as the caller and are not associated with a specific project.

Second inline policy explanation

This policy controls which specific FMs can be accessed:

  • It allows the same Amazon Bedrock model invocation actions, but only for specific Anthropic Claude 3.5 Haiku models in three Regions:
    • us-east-1
    • us-east-2
    • us-west-2
  • It has a condition that the request must come through an inference profile from your account

Combined effect on the SageMaker Unified Studio domain generative AI playground

Together, these policies create a secure, controlled environment for using Amazon Bedrock models in the SageMaker Unified Studio domain through the following methods:

  • Limiting model access – Only the specified Anthropic Claude 3.5 Haiku model can be used, not other Amazon Bedrock models
  • Enforcing access through inference profiles – Models can only be accessed through properly configured application inference profiles
  • Maintaining domain isolation – Access is restricted to inference profiles tagged with the user’s Amazon DataZone domain
  • Helping to prevent cross-account access – Resources must be in the same account as the principal
  • Regional restrictions – Access the model is only allowed in three specific AWS Regions

This implementation follows the principle of least privilege by providing only the minimum permissions needed for the intended use case, while maintaining proper security boundaries between different domains and projects.

Create a model provisioning role for the project scenario

In this section, we walk through the steps to create an IAM role with a trust policy and add the required inline policy to make sure that the models are limited to the approved ones.

Create the IAM role with a trust policy

Complete the following steps to create an IAM role with a trust policy:

  1. On the IAM console, in the navigation pane, choose Roles, then choose Create role.
  2. For Trusted entity type, select Custom trust policy.
  3. Delete the default policy in the editor and enter the following trust policy (replace account-id for the aws:SourceAccount field with your account ID):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "datazone.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:SetContext"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "[account-id]"
                }
            }
        }
    ]
}
  1. Choose Next.
  2. Skip the Add permissions page by choosing Next.
  3. Enter a name for the role (for example, SageMakerModelManagementRole) and an optional description, such as Role for managing Bedrock model access in SageMaker Unified Studio.
  4. Choose Create role.

Add the inline policy

Complete the following steps to add an inline policy:

  1. On the IAM console, open the details page of the newly created role.
  2. On the Permissions tab, choose Add permissions and then Create inline policy.
  3. On the JSON tab, delete the default policy and enter the following inline policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ManageApplicationInferenceProfile",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateInferenceProfile",
                "bedrock:TagResource"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:application-inference-profile/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                },
                "ForAnyValue:StringEquals": {
                    "aws:TagKeys": [
                        "AmazonDataZoneProject"
                    ]
                },
                "Null": {
                    "aws:ResourceTag/AmazonDataZoneProject": "false",
                    "aws:RequestTag/AmazonDataZoneProject": "false"
                }
            }
        },
        {
            "Sid": "DeleteApplicationInferenceProfile",
            "Effect": "Allow",
            "Action": [
                "bedrock:DeleteInferenceProfile"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:application-inference-profile/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                },
                "Null": {
                    "aws:ResourceTag/AmazonDataZoneProject": "false"
                }
            }
        },
        {
            "Sid": "CreateApplicationInferenceProfileUsingFoundationModels",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateInferenceProfile"
            ],
            "Resource": [
                "arn:aws:bedrock:*::foundation-model/anthropic.claude-3-5-haiku-20241022-v1:0"
            ]
        },
        {
            "Sid": "CreateApplicationInferenceProfileUsingBedrockModels",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateInferenceProfile"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:inference-profile/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        }
    ]
}
  1. Choose Review policy.
  2. Name the policy (for example, BedrockModelManagementPolicy) and choose Create policy.

Explanation of the policies

In this section, we discuss the details of the policies.

Trust policy explanation

This trust policy defines who can assume the IAM role:

  • It allows the Amazon DataZone service to assume this role
  • The service can perform sts:AssumeRole and sts:SetContext actions
  • A condition restricts this to only when the source AWS account is your AWS account

This makes sure that only Amazon DataZone from your specific account can assume this role.

Inline policy explanation

This policy controls access to Amazon Bedrock inference profiles and models:

  • It allows creating and managing application inference profiles (bedrock:CreateInferenceProfile, bedrock:TagResource, bedrock:DeleteInferenceProfile)
  • It specifically permits creating inference profiles for only the anthropic.claude-3-5-haiku-20241022-v1:0 model
  • Access is controlled through several conditions:
    • Resources must be in the same AWS account as the principal making the request
    • Operations are restricted based on the AmazonDataZoneProject tag
    • Creating profiles requires proper tagging with AmazonDataZoneProject
    • Deleting profiles is only allowed for resources with the appropriate AmazonDataZoneProject tag

Combined effect on SageMaker Unified Studio domain project

Together, these policies create a secure, controlled environment for using Amazon Bedrock models in SageMaker Unified Studio domain projects through the following methods:

  • Limiting model access – Only the specified Anthropic Claude 3.5 Haiku model can be used
  • Enforcing proper tagging – Resources must be properly tagged with AmazonDataZoneProject identifiers
  • Maintaining account isolation – Resources must be in the same account as the principal
  • Implementing least privilege – Only specific actions (create, tag, delete) are permitted on inference profiles
  • Providing project-level isolation – Access is restricted to inference profiles tagged with appropriate project identifiers

This implementation follows the principle of least privilege by providing only the minimum permissions needed for project-specific use cases, while maintaining proper security boundaries between different projects and making sure that only approved FMs can be accessed.

Configure a SageMaker Unified Studio domain to use the roles

Complete the following steps to create a SageMaker Unified Studio domain and configure it to use the roles you created:

  1. On the SageMaker console, choose the appropriate Region.
  2. Choose Create a Unified Studio domain and then choose Quick setup.
  3. For Name, enter a meaningful domain name.
  4. Scroll down to Generative AI resources.
  5. Under Model provisioning role, choose the model management role created in the previous section.
  6. Under Model consumption role, select Use a single existing role for all models and choose the model consumption role created in the previous section.
  7. Complete the remaining steps according to your AWS IAM Identity Center configurations and create your domain.

Clean up

To avoid incurring future charges related to various services used in SageMaker Unified Studio, log out of the SageMaker Unified Studio domain and delete the domain in SageMaker Unified Studio.

Conclusion

In this post, we demonstrated how the SageMaker Unified Studio playground and SageMaker Unified Studio projects invoke large language models powered by Amazon Bedrock, and how enterprises can govern access to these models, whether you want to limit access to specific models or to every model from the service. You can combine the IAM policies shown in this post in the same IAM role to provide complete control. By following these guidelines, enterprises can make sure their use of generative AI models is both secure and aligned with organizational policies. This approach not only safeguards sensitive data but also empowers business analysts and data scientists to harness the full potential of AI within a controlled environment.

Now that your environment is configured with strong identity based policies, we suggest reading the following posts to learn how Amazon SageMaker Unified Studio enables you to securely innovate quickly, and at scale, with generative AI:


About the authors

VarunVarun Jasti is a Solutions Architect at Amazon Web Services, working with AWS Partners to design and scale artificial intelligence solutions for public sector use cases to meet compliance standards. With a background in Computer Science, his work covers broad range of ML use cases primarily focusing on LLM training/inferencing and computer vision. In his spare time, he loves playing tennis and swimming.

Saptarshi BanarjeeSaptarshi Banarjee serves as a Senior Solutions Architect at AWS, collaborating closely with AWS Partners to design and architect mission-critical solutions. With a specialization in generative AI, AI/ML, serverless architecture, Next-Gen Developer Experience tools and cloud-based solutions, Saptarshi is dedicated to enhancing performance, innovation, scalability, and cost-efficiency for AWS Partners within the cloud ecosystem.

JonJon Turdiev is a Senior Solutions Architect at Amazon Web Services, where he helps startup customers build well-architected products in the cloud. With over 20 years of experience creating innovative solutions in cybersecurity, AI/ML, healthcare, and Internet of Things (IoT), Jon brings deep technical expertise to his role. Previously, Jon founded Zehntec, a technology consulting company, and developed award-winning medical bedside terminals deployed in hospitals worldwide. Jon holds a Master’s degree in Computer Science and shares his knowledge through webinars, workshops, and as a judge at hackathons.

LijanLijan Kuniyil is a Senior Technical Account Manager at AWS. Lijan enjoys helping AWS enterprise customers build highly reliable and cost-effective systems with operational excellence. Lijan has over 25 years of experience in developing solutions for financial, healthcare and consulting companies.

Read More

Improve conversational AI response times for enterprise applications with the Amazon Bedrock streaming API and AWS AppSync

Improve conversational AI response times for enterprise applications with the Amazon Bedrock streaming API and AWS AppSync

Many enterprises are using large language models (LLMs) in Amazon Bedrock to gain insights from their internal data sources. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Organizations implementing conversational AI systems often face a common challenge: although their APIs can quickly find answers to targeted questions, more complex queries requiring reasoning-actioning (ReAct) logic can take substantial time to process, negatively impacting user experience. This issue is particularly pronounced in regulated industries where security requirements add additional complexity. For instance, a global financial services organization with over $1.5 trillion in assets under management encountered this exact challenge. Despite successfully implementing a conversational AI system that integrated with multiple LLMs and data sources, they needed a solution that could maintain their rigorous security protocols—including AWS services operating within virtual private cloud (VPC) environments and enterprise OAuth integration—while improving response times for complex queries.

AWS AppSync is a fully managed service that enables developers to build serverless GraphQL APIs with real-time capabilities. This post demonstrates how to combine AWS AppSync subscriptions with Amazon Bedrock streaming endpoints to deliver LLM responses incrementally. We provide an enterprise-grade implementation blueprint that helps organizations in regulated industries maintain security compliance while optimizing user experience through immediate real-time response delivery.

Solution overview

The solution discussed in this post uses AWS AppSync to start the asynchronous conversational workflow. An AWS Lambda function does the heavy lifting of interacting with the Amazon Bedrock streaming API. As the LLM produces tokens, they are streamed to the frontend using AWS AppSync mutations and subscriptions.

A reference implementation of the Lambda function and AWS AppSync API is provided in the sample code in this post. The following diagram illustrates the reference architecture. It provides a high-level overview of how the various AWS services are integrated to achieve the desired outcome.

Solution Architecture

Let’s traverse how a user’s request is handled in the solution, and how the user receives real-time responses from an LLM in Amazon Bedrock:

  1. When the user loads the UI application, the application subscribes to the GraphQL subscription onSendMesssage(), which returns whether the WebSocket connection was successful or not.
  2. After the user enters a query, it invokes a GraphQL query (getLLMResponse) and triggers the Data Source Lambda function.
  3. The Data Source Lambda function publishes an event to the Amazon Simple Notification Service (Amazon SNS) topic, and a 201 message is sent to the user, completing the synchronous flow.

These steps are better illustrated in the following sequence diagram.

Sequence Diagram 1

  1. The Orchestrator Lambda function gets triggered by a published SNS event and initiates the stream with the Amazon Bedrock API call InvokeModelWithResponseStream.
  2. Amazon Bedrock receives the user query, initiates the stream, and starts sending stream tokens back to the Lambda function.
  3. When the Orchestrator Lambda function receives a stream token from Amazon Bedrock, the function invokes the GraphQL mutation sendMessage.
  4. The mutation triggers the onSendMessage subscription containing the LLM partial response, and the UI prints those stream tokens as it receives it.

The following diagram illustrates these steps in more detail.

Sequence Diagram 2

In the following sections, we discuss the components that make up the solution in more detail.

Data and API design

The AppSync API GraphQL schema consists of query, subscription, and mutation operations.

The following code is the query operation:

input GetLlmResponseInput {
	sessionId: String!
	message: String!
	locale: String!
}
type Query {
	getLlmResponse(input: GetLlmResponseInput!): GetLlmResponse
		@aws_api_key
}

The query operation, getLLMResponse, is synchronous and accepts sessionId, locale, and user-provided message.

The frontend must send a unique sessionId; this session ID uniquely identifies the user’s chat session. The session ID doesn’t change for the duration of active conversation. For example, if the user reloads the frontend, a new sessionId is generated and sent to the query operation.

The frontend must also send locale, which indicates the user’s preferred language. For a list of supported locales, see Languages and locales supported by Amazon Lex V2. For example, we use en_US for North American English.

Finally, the user’s message (or query) is set in the message attribute. The value of the message attribute is passed to the LLM for analysis.

The following code is the subscription operation:

type Subscription {
	onSendMessage(sessionId: String!): SendMessageResponse
		@aws_subscribe(mutations: ["sendMessage"])
        @aws_api_key
}

The AWS AppSync subscription operation, onSendMessage, accepts sessionId as a parameter. The frontend calls the onSendMessage subscription operation to subscribe to a WebSocket connection using sessionId. This allows the frontend to receive messages from the AWS AppSync API whenever a mutation operation successfully executes for the given sessionId.

The following code is the mutation operation:

input SendMessageInput {
	sessionId: String!
	message: String!
	locale: String!
}
type Mutation {
	sendMessage(input: SendMessageInput!): SendMessageResponse
		@aws_api_key
        @aws_iam
}

The mutation operation, sendMessage, accepts a payload of type SendMessageInput. The caller must provide all required attributes in the SendMessageInput type, indicated by the exclamation point in the GraphQL schema excerpt, to successfully send a message to the frontend using the mutation operation.

The Orchestrator Lambda function calls the sendMessage mutation to send partially received LLM tokens to the frontend. We discuss the Orchestrator Lambda function in more detail later in this post.

AWS AppSync Data Source Lambda function

AWS AppSync invokes the AWS AppSync Data Source Lambda function when the frontend calls the GraphQL query operation, getLLMResponse. The GraphQL query is a synchronous operation.

The implementation of the AWS AppSync Data Source Lambda function is in the following GitHub repo, called bedrock-appsync-ds-lambda. This Lambda function extracts the user’s message from the incoming GraphQL query operation and sends the value to the SNS topic. The Lambda function then returns success status code to the caller indicating that the message is submitted to the backend for processing.

AWS AppSync Orchestrator Lambda function

The AWS AppSync Orchestrator Lambda function runs whenever an event is published to the SNS topic. This function initiates the Amazon Bedrock streaming API using the converse_stream Boto3 API call.

The following code snippet shows how the Orchestrator Lambda function receives the SNS event, processes it, and then calls the Boto3 API:

brt = boto3.client(service_name='bedrock-runtime', region_name="us-west-2")
messages = []
message = {
    "role": "user",
    "content": [{"text": parsed_event["message"]}]
}
messages.append(message)
response = brt.converse_stream(
    modelId=model_id,
    messages=messages
)

The code first instantiates the Boto3 client using the bedrock-runtime service name. The Lambda function receives the SNS event and parses it using the Python JSON library. The parsed contents are stored in the sns_event dictionary. The code creates a Amazon Bedrock Messages API style prompt with role and content attributes:

message = {
    "role": "user",
    "content": [{"text": parsed_event["message"]}]
}

The content attribute’s value comes from the sns_event["message"] attribute in the SNS event. Refer to the converse_stream Boto3 API documentation for list of role values.

The converse_stream API accepts modelId and messages parameters. The value of modelId comes from an environment variable set on the Lambda function. The messages parameter is of type dictionary, and it must only contain Amazon Bedrock Messages API style prompts.

When the converse_stream API successfully runs, it returns an object that the Lambda code further analyzes to send partial tokens to the frontend:

stream = response.get('body')
if stream:
    self.appsync = AppSync(locale="en_US", session_id=session_id)
    self.appsync.invoke_mutation(DEFAULT_STREAM_START_TOKEN)
    event_count = 0
    buffer = ""
    for event in stream:
        if event:
            if list(event)[0] == "contentBlockDelta":
                event_count += 1
                buffer += event["contentBlockDelta"]["delta"]["text"]
            if event_count > 5:
                self.appsync.invoke_mutation(buffer)
                event_count = 0
                buffer = ""
        if len(buffer) != 0:
            self.appsync.invoke_mutation(buffer)

As the LLM generates a token in response to the prompt it received, Lambda first sends DEFAULT_STREAM_START_TOKEN to the frontend using the AWS AppSync mutation operation. This token is a mechanism to alert the frontend to start rendering tokens. As the Lambda function receives chunks from the converse_stream API, it calls the AWS AppSync mutation operation, sending a partial token to the frontend to render.

To improve the user experience and reduce network overhead, the Lambda function doesn’t invoke the AWS AppSync mutation operation for every chunk it receives from the Amazon Bedrock converse_stream API. Instead, the Lambda code buffers partial tokens and invokes the AWS AppSync mutation operation after receiving five chunks. This avoids the overhead of AWS AppSync network calls, thereby reducing latency and improving the user experience.

After the Lambda function has finished sending the tokens, it sends DEFAULT_STREAM_END_TOKEN:

self.appsync.invoke_mutation(DEFAULT_STREAM_END_TOKEN)This token alerts the frontend that LLM streaming is complete.

For more details, refer to the GitHub repo. It contains a reference implementation of the Orchestrator Lambda function called bedrock-orchestrator-lambda.

Prerequisites

To deploy the solution, you must have the Terraform CLI installed in your environment. Complete all the steps in the Prerequisites section in the accompanying GitHub documentation.

Deploy the solution

Complete the following steps to deploy the solution:

  1. Open a command line terminal window.
  2. Change to the deployment folder.
  3. Edit the sample.tfvars file. Replace the variable values to match your AWS environment.
region = "us-west-2"
lambda_s3_source_bucket_name = "YOUR_DEPLOYMENT_BUCKET"
lambda_s3_source_bucket_key  = "PREFIX_WITHIN_THE_BUCKET"
  1. Run the following commands to deploy the solution:
$ terraform init
$ terraform apply -var-file=”sample.tfvars”

Detailed deployment steps are in the Deploy the solution section in the accompanying GitHub repository.

Test the solution

To test the solution, use the provided sample web UI and run it inside VS Code. For more information, refer to accompanying README documentation.

Clean up

Use the following code to clean your AWS environment from the resources deployed in the previous section. You must use the same sample.tfvars that you used to deploy the solution.

$ terraform destroy -var-file=”sample.tfvars”

Conclusion

This post demonstrated how integrating an Amazon Bedrock streaming API with AWS AppSync subscriptions significantly enhances AI assistant responsiveness and user satisfaction. By implementing this streaming approach, the global financial services organization reduced initial response times for complex queries by approximately 75%—from 10 seconds to just 2–3 seconds—empowering users to view responses as they’re generated rather than waiting for complete answers. The business benefits are clear: reduced abandonment rates, improved user engagement, and a more responsive AI experience. Organizations can quickly implement this solution using the provided Lambda and Terraform code, quickly bringing these improvements to their own environments.

For even greater flexibility, AWS AppSync Events offers an alternative implementation pattern that can further enhance real-time capabilities using a fully managed WebSocket API. By addressing the fundamental tension between comprehensive AI responses and speed, this streaming approach enables organizations to maintain high-quality interactions while delivering the responsive experience modern users expect.


About the authors

Salman Moghal, a Principal Consultant at AWS Professional Services Canada, specializes in crafting secure generative AI solutions for enterprises. With extensive experience in full-stack development, he excels in transforming complex technical challenges into practical business outcomes across banking, finance, and insurance sectors. In his downtime, he enjoys racquet sports and practicing Funakoshi Genshin’s teachings at his martial arts dojo.

Philippe Duplessis-Guindon is a cloud consultant at AWS, where he has worked on a wide range of generative AI projects. He has touched on most aspects of these projects, from infrastructure and DevOps to software development and AI/ML. After earning his bachelor’s degree in software engineering and a master’s in computer vision and machine learning from Polytechnique Montreal, Philippe joined AWS to put his expertise to work for customers. When he’s not at work, you’re likely to find Philippe outdoors—either rock climbing or going for a run.

Read More

Scale generative AI use cases, Part 1: Multi-tenant hub and spoke architecture using AWS Transit Gateway

Scale generative AI use cases, Part 1: Multi-tenant hub and spoke architecture using AWS Transit Gateway

Generative AI continues to reshape how businesses approach innovation and problem-solving. Customers are moving from experimentation to scaling generative AI use cases across their organizations, with more businesses fully integrating these technologies into their core processes. This evolution spans across lines of business (LOBs), teams, and software as a service (SaaS) providers. Although many AWS customers typically started with a single AWS account for running generative AI proof of concept use cases, the growing adoption and transition to production environments have introduced new challenges.

These challenges include effectively managing and scaling implementations, as well as abstracting and reusing common concerns such as multi-tenancy, isolation, authentication, authorization, secure networking, rate limiting, and caching. To address these challenges effectively, a multi-account architecture proves beneficial, particularly for SaaS providers serving multiple enterprise customers, large enterprises with distinct divisions, and organizations with strict compliance requirements. This multi-account approach helps maintain a well-architected system by providing better organization, security, and scalability for your AWS environment. It also enables you to more efficiently manage these common concerns across your expanding generative AI implementations.

In this two-part series, we discuss a hub and spoke architecture pattern for building a multi-tenant and multi-account architecture. This pattern supports abstractions for shared services across use cases and teams, helping create secure, scalable, and reliable generative AI systems. In Part 1, we present a centralized hub for generative AI service abstractions and tenant-specific spokes, using AWS Transit Gateway for cross-account interoperability. The hub account serves as the entry point for end-user requests, centralizing shared functions such as authentication, authorization, model access, and routing decisions. This approach alleviates the need to implement these functions separately in each spoke account. Where applicable, we use virtual private cloud (VPC) endpoints for accessing AWS services.

In Part 2, we discuss a variation of this architecture using AWS PrivateLink to securely share the centralized endpoint in the hub account to teams within your organization or with external partners.

The focus in both posts is on centralizing authentication, authorization, model access, and multi-account secure networking for onboarding and scaling generative AI use cases with Amazon Bedrock. We don’t discuss other system capabilities such as prompt catalog, prompt caching, versioning, model registry, and cost. However, those could be extensions of this architecture.

Solution overview

Our solution implements a hub and spoke pattern that provides a secure, scalable system for managing generative AI implementations across multiple accounts. At its core, the architecture consists of a centralized hub account that serves as the entry point for requests, complemented by spoke accounts that contain tenant-specific resources. The following diagram illustrates this architecture.

Architecture Diagram

The hub account serves as the centralized account that provides common services across tenants and serves as the entry point for end-user requests. It centralizes shared functions such as authentication, authorization, and routing decisions, alleviating the need to implement these functions separately for each tenant. The hub account is operated and maintained by a core engineering team.

The hub infrastructure includes public and private VPCs, an internet-facing Application Load Balancer (ALB), Amazon Cognito for authentication, and necessary VPC endpoints for AWS services.

The spoke accounts contain tenant-specific resources, such as AWS Identity and Access Management (IAM) role permissions and Amazon Bedrock resources. Spoke accounts can be managed by either the core engineering team or the tenant, depending on organizational needs.

Each spoke account maintains its own private VPC, VPC interface endpoints for Amazon Bedrock, specific IAM roles and permissions, and account-level controls. These components are connected through Transit Gateway, which provides secure cross-account networking and manages traffic flow between hub and spoke VPCs. The flow of requests through the system as shown in the preceding architecture includes the following steps:

  1. A user (representing Tenant 1, 2, or N) accesses the client application.
  2. The client application in the hub account’s public subnet authenticates the user and receives an ID/JWT token. In our example, we use an Amazon Cognito user pool as the identity provider (IdP).
  3. The client application uses custom attributes in the JWT token to determine the corresponding route in the ALB. The ALB, based on the context path, routes the request to the tenant’s AWS Lambda function target group.
  4. The tenant-specific Lambda function in the hub account’s private subnet is invoked.
  5. The function assumes a cross-account role in the tenant’s account. The function invokes Amazon Bedrock in the spoke account by referring to the regional DNS name of the Amazon Bedrock VPCE. The model is invoked and the result is sent back to the user.

This architecture makes sure that requests flow through a central entry point while maintaining tenant isolation. By invoking Amazon Bedrock in the spoke account, each request inherits that account’s limits, access control, cost assignments, service control policies (SCPs), and other account-level controls.

The sample code for this solution is separated into two sections. The first section shows the solution for a single hub and spoke account. The second section extends the solution by deploying another spoke account. Detailed instructions for each step are provided in the repository README. In the following sections, we provide an outline of the deployment steps.

Prerequisites

We assume you are familiar with the fundamentals of AWS networking, including Amazon Virtual Private Cloud (Amazon VPC) and VPC constructs like route tables and VPC interconnectivity options. We assume you are also familiar with multi-tenant architectures and their core principles of serving multiple tenants from a shared infrastructure while maintaining isolation.

To implement the solution, you must have the following prerequisites:

  • Hub and spoke accounts (required):
    • Two AWS accounts: one hub account and one spoke account
    • Access to the amazon.titan-text-lite-v1 model in the spoke account
  • Additional spoke account (optional):
    • A third AWS account (spoke account for a second tenant)
    • Access to the anthropic.claude-3-haiku-20240307-v1:0 model in the second spoke account

Design considerations

The implementation of this architecture involves several important design choices that affect how the solution operates, scales, and can be maintained. In this section, we explore these considerations across different components, explaining the rationale behind each choice and potential alternatives where applicable.

Lambda functions

In our design, we have the ALB target group as Lambda functions running the hub account instead of the spoke account. This allows for centralized management of business logic and centralized logging and monitoring. As the architecture evolves to include shared functionality such as prompt caching, semantic routing, or using large language model (LLM) proxies (middleware services that provide unified access to multiple models while handling, rate limiting, and request routing, as discussed in Part 2), implementing these features in the hub account provides consistency across tenants. We chose Lambda functions to implement the token validation and routing logic, but you can use other compute options such as Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS) depending on your organizations’ preferences.

We use 1-to-1 mapping for Lambda functions to each tenant. Even though the current logic in each function is similar, having a dedicated function for each tenant can minimize noisy neighbor issues and tenant tier-specific configurations such as memory and concurrency.

VPC endpoints

In this solution, we use dedicated Amazon Bedrock runtime VPC endpoints in the spoke accounts. Dedicated VPC endpoints for each spoke account are suited for organizations where operators of the spoke account manage the tenant features, such as allowing access to models, setting up knowledge bases, and guardrails. Depending on your organization’s policies, a different variation of this architecture can be achieved by using a centralized Amazon Bedrock runtime VPC in the hub account (as shown in Part 2). Centralized VPC endpoints are suited for organizations where a central engineering team manages the features for the tenants.

Other factors such as costs, access control, and endpoint quotas need to be considered when choosing a centralized or dedicated approach for the location of the Amazon Bedrock VPC endpoints. VPC endpoint policies with the centralized approach might run into the 20,480-character limit as the number of tenants grows. There are hourly fees for VPC endpoints and transit gateway attachments provisioned regardless of usage. If VPC endpoints are provisioned in spoke accounts, each tenant will incur additional hourly fees.

Client application

For demonstration purposes, the client application in this solution is deployed in the public subnet in the hub VPC. The application can be deployed in an account outside of the either the hub or spoke VPCs, or deployed at the edge as a single-page application using Amazon CloudFront and Amazon Simple Storage Service (Amazon S3).

Tenancy

Enterprises use various tenancy models when scaling generative AI, each with distinct advantages and disadvantages. Our solution implements a silo model, assigning each tenant to a dedicated spoke account. For smaller organizations with fewer tenants and less stringent isolation requirements, an alternative approach using a pooled model (multiple tenants per spoke account) might be more appropriate unless they plan to scale significantly in the future or have specific compliance requirements. For more information on multi-tenancy design, see Let’s Architect! Designing architectures for multi-tenancy. Cell-based architectures for multi-tenant applications can provide benefits such as fault isolation and scaling. See Reducing the Scope of Impact with Cell-Based Architecture for more information.

Frontend gateway

In this solution, we chose ALB as the entry point for requests. ALB offers several advantages for our generative AI use case:

  • Long-running connections – ALB supports connections up to 4,000 seconds, which is beneficial for LLM responses that might take longer than 30 seconds to complete
  • Scalability – ALB can handle a high volume of concurrent connections, making it suitable for enterprise-scale deployments
  • Integration with AWS WAF – ALB seamlessly integrates with AWS WAF, providing enhanced security and protection against common web exploits

Amazon API Gateway is an alternative option when API versioning, usage plans, or granular API management capabilities are required, and when expected message sizes and response times align with its quotas. AWS AppSync is another option suitable when exposing the LLMs through a GraphQL interface.

Choose the gateway that best serves your customers. ALB handles high-volume, long-running connections efficiently. API Gateway provides comprehensive REST API management. AWS AppSync delivers real-time GraphQL capabilities. Evaluate each based on your application’s response time requirements, API needs, scale demands, and specific use case.

Although this post demonstrates connectivity using HTTP for simplicity, this is not recommended for production use. Production deployments should always implement HTTPS with proper SSL/TLS certificates to maintain secure communication.

IP addressing

The AWS CloudFormation template to deploy solution resources uses example CIDRs. When deploying this architecture in a second spoke account, use unique IP addresses that don’t overlap with your existing environments. Transit Gateway operates at Layer 3 and requires distinct IP spaces to route traffic between VPCs.

Deploy a hub and spoke account

In this section, we set up the local AWS Command Line Interface (AWS CLI) environment to deploy this solution in two AWS accounts. Detailed instructions are provided in the repository README.

  1. Deploy a CloudFormation stack in the hub account, and another stack in the spoke account.
  2. Configure connectivity between the hub and spoke VPCs using Transit Gateway attachments.
  3. Create an Amazon Cognito user with tenant1 as the value for a custom user attribute, tenant_id.
  4. Create an item in an Amazon DynamoDB table that maps the tenant ID to model access and routing information specific to a tenant, tenant1 in our case.

The following screenshots show the custom attribute value tenant1 for the user, and the item in the DynamoDB table that maps spoke account details for tenant1.

Tenant user attributes

Tenant mappings

Validate connectivity

In this section, we validate connectivity from a test application in the hub account to the Amazon Bedrock model in the spoke account. We do so by sending a curl request from an EC2 instance (representing our client application) to the ALB. Both the EC2 instance and the ALB are located in the public subnet of the hub account. The request and response are then routed through Transit Gateway attachments between the hub and spoke VPCs. The following screenshot shows the execution of a utility script on your local workstation that authenticates a user and exports the necessary variables. These variables will be used to construct the curl request on the EC2 instance.

Tenant user attributes

The following screenshot shows the curl request being executed from the EC2 instance to the ALB. The response confirms that the request was successfully processed and served by the amazon.titan-text-lite-v1 model, which is the model mapped to this user (tenant1). The model is hosted in the spoke account.

Tenant1 validation

Deploy a second spoke account

In this section, we extend the deployment to include a second spoke account for an additional tenant. We validate the multi-tenant connectivity by sending another curl request from the same EC2 instance to the ALB in the hub account. Detailed instructions are provided in the repository README.

The following screenshot shows the response to this request, demonstrating that the system correctly identifies and routes requests based on tenant information. In this case, the user’s tenant_id attribute value is tenant2, and the request is successfully routed to the anthropic.claude-3-haiku-20240307-v1:0 model, which is mapped to tenant2 in the second spoke account.

Tenant2 validation

Clean up

To clean up your resources, complete the following steps:

  1. If you created the optional resources for a second spoke account, delete them:
    1. Change the directory to genai-secure-patterns/hub-spoke-transit-gateway/scripts/optional
    2. Run the cleanup script ./cleanupOptionalStack.sh.
  2. Clean up the main stack:
    1. Change the directory to genai-secure-patterns/hub-spoke-transit-gateway/scripts/
    2. Run the cleanup script ./cleanup.sh.

Conclusion

As organizations increasingly adopt and scale generative AI use cases across different teams and LOBs, there is a growing need for secure, scalable, and reliable multi-tenant architectures. This two-part series addresses this need by providing guidance on implementing a hub and spoke architecture pattern. By adopting such well-architected practices from the outset, you can build scalable and robust solutions that unlock the full potential of generative AI across your organization.

In this post, we covered how to set up a centralized hub account hosting shared services like authentication, authorization, and networking using Transit Gateway. We also demonstrated how to configure spoke accounts to host tenant-specific resources like Amazon Bedrock. Try out the provided code samples to see this architecture in action.

Part 2 will explore an alternative implementation using PrivateLink to interconnect the VPCs in the hub and spoke accounts.


About the Authors

Nikhil Penmetsa is a Senior Solutions Architect at AWS. He helps organizations understand best practices around advanced cloud-based solutions. He is passionate about diving deep with customers to create solutions that are cost-effective, secure, and performant. Away from the office, you can often find him putting in miles on his road bike or hitting the open road on his motorbike.

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 3-year-old Sheepadoodle!

Read More

Accelerate AI development with Amazon Bedrock API keys

Accelerate AI development with Amazon Bedrock API keys

Today, we’re excited to announce a significant improvement to the developer experience of Amazon Bedrock: API keys. API keys provide quick access to the Amazon Bedrock APIs, streamlining the authentication process so that developers can focus on building rather than configuration.

CamelAI is an open-source, modular framework for building intelligent multi-agent systems for data generation, world simulation, and task automation.

“As a startup with limited resources, streamlined customer onboarding is critical to our success. The Amazon Bedrock API keys enable us to onboard enterprise customers in minutes rather than hours. With Bedrock, our customers can quickly provision access to leading AI models and seamlessly integrate them into CamelAI,”

said Miguel Salinas, CTO, CamelAI.

In this post, explore how API keys work and how you can start using them today.

API key authentication

Amazon Bedrock now provides API key access to streamline integration with tools and frameworks that expect API key-based authentication. The Amazon Bedrock and Amazon Bedrock runtime SDKs support API key authentication for methods including on-demand inference, provisioned throughput inference, model fine-tuning, distillation, and evaluation.

The diagram compares the default authentication process to Amazon Bedrock (in orange) with the API keys approach (in blue). In the default process, you must create an identity in AWS IAM Identity Center or IAM, attach IAM policies to provide permissions to perform API operations, and generate credentials, which you can then use to make API calls. The grey boxes in the diagram highlight the steps that Amazon Bedrock now streamlines when generating an API key. Developers can now authenticate and access Amazon Bedrock APIs with minimal setup overhead.

You can generate API keys in the Amazon Bedrock console, choosing between two types.

With long-term API keys, you can set expiration times ranging from 1 day to no expiration. These keys are associated with an IAM user that Amazon Bedrock automatically creates for you. The system attaches the AmazonBedrockLimitedAccess managed policy to this IAM user, and you can then modify permissions as needed through the IAM service. We recommend using long-term keys primarily for exploration of Amazon Bedrock.

Short-term API keys use the IAM permissions from your current IAM principal and expire when your account’s session ends or can last up to 12 hours. Short-term API keys use AWS Signature Version 4 for authentication. For continuous application use, you can implement API key refreshing with a script as shown in this example. We recommend that you use short-term API keys for setups that require a higher level of security.

Making Your First API Call

Once you have access to foundation models, getting started with Amazon Bedrock API key is straightforward. Here’s how to make your first API call using the AWS SDK for Python (Boto3 SDK) and API keys:

Generate an API key

To generate an API key, follow these steps:

  1. Sign in to the AWS Management Console and open the Amazon Bedrock console
  2. In the left navigation panel, select API keys
  3. Choose either Generate short-term API key or Generate long-term API key
  4. For long-term keys, set your desired expiration time and optionally configure advanced permissions
  5. Choose Generate and copy your API key

Set Your API Key as Environment Variable

You can set your API key as an environment variable so that it’s automatically recognized when you make API requests:

# To set the API key as an environment variable, you can open a terminal and run the following command:
export AWS_BEARER_TOKEN_BEDROCK=${api-key}

The Boto3 SDK automatically detects your environment variable when you create an Amazon Bedrock client.

Make Your First API Call

You can now make API calls to Amazon Bedrock in multiple ways:

  1. Using curl
    curl -X POST "https://bedrock-runtime.us-east-1.amazonaws.com/model/us.anthropic.claude-3-5-haiku-20241022-v1:0/converse" 
      -H "Content-Type: application/json" 
      -H "Authorization: Bearer $AWS_BEARER_TOKEN_BEDROCK" 
      -d '{
        "messages": [
            {
                "role": "user",
                "content": [{"text": "Hello"}]
            }
        ]
      }'

  2. Using the Amazon Bedrock SDK:
    import boto3
    
    # Create an Amazon Bedrock client
    client = boto3.client(
        service_name="bedrock-runtime",
        region_name="us-east-1"     # If you've configured a default region, you can omit this line
    ) 
    
    # Define the model and message
    model_id = "us.anthropic.claude-3-5-haiku-20241022-v1:0"
    messages = [{"role": "user", "content": [{"text": "Hello"}]}]
       
    response = client.converse(
        modelId=model_id,
        messages=messages,
    )
    
    # Print the response
    print(response['output']['message']['content'][0]['text'])

  3. You can also use native libraries like Python Requests:
    import requests
    import os
    
    url = "https://bedrock-runtime.us-east-1.amazonaws.com/model/us.anthropic.claude-3-5-haiku-20241022-v1:0/converse"
    
    payload = {
        "messages": [
            {
                "role": "user",
                "content": [{"text": "Hello"}]
            }
        ]
    }
    
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.environ['AWS_BEARER_TOKEN_BEDROCK']}"
    }
    
    response = requests.request("POST", url, json=payload, headers=headers)
    
    print(response.text)

Bridging developer experience and enterprise security requirements

Enterprise administrators can now streamline their user onboarding to Amazon Bedrock foundation models. With setups that require a higher level of security, administrators can enable short-term API keys for their users. Short-term API keys use AWS Signature Version 4 and existing IAM principals, maintaining established access controls implemented by administrators.

For audit and compliance purposes, all API calls are logged in AWS CloudTrail. API keys are passed as authorization headers to API requests and aren’t logged.

Conclusion

Amazon Bedrock API keys are available in 20 AWS Regions where Amazon Bedrock is available: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Hyderabad, Mumbai, Osaka, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Milan, Paris, Spain, Stockholm, Zurich), and South America (São Paulo). To learn more about API keys in Amazon Bedrock, visit the API Keys documentation in the Amazon Bedrock user guide.

Give API keys a try in the Amazon Bedrock console today and send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.


About the Authors

Sofian Hamiti is a technology leader with over 10 years of experience building AI solutions, and leading high-performing teams to maximize customer outcomes. He is passionate in empowering diverse talent to drive global impact and achieve their career aspirations.

Ajit Mahareddy is an experienced Product and Go-To-Market (GTM) leader with over 20 years of experience in product management, engineering, and go-to-market. Prior to his current role, Ajit led product management building AI/ML products at leading technology companies, including Uber, Turing, and eHealth. He is passionate about advancing generative AI technologies and driving real-world impact with generative AI.

Nakul Vankadari Ramesh is a Software Development Engineer with over 7 years of experience building large-scale distributed systems. He currently works on the Amazon Bedrock team, helping accelerate the development of generative AI capabilities. Previously, he contributed to Amazon Managed Blockchain, focusing on scalable and reliable infrastructure.

Huong Nguyen is a Principal Product Manager at AWS. She is a product leader at Amazon Bedrock, with 18 years of experience building customer-centric and data-driven products. She is passionate about democratizing responsible machine learning and generative AI to enable customer experience and business innovation. Outside of work, she enjoys spending time with family and friends, listening to audiobooks, traveling, and gardening.

Massimiliano Angelino is Lead Architect for the EMEA Prototyping team. During the last 3 and half years he has been an IoT Specialist Solution Architect with a particular focus on edge computing, and he contributed to the launch of AWS IoT Greengrass v2 service and its integration with Amazon SageMaker Edge Manager. Based in Stockholm, he enjoys skating on frozen lakes.

Read More

Carnegie Mellon University at ICML 2025

Carnegie Mellon University at ICML 2025

CMU researchers are presenting 127 papers at the Forty-Second International Conference on Machine Learning (ICML 2025), held from July 13th-19th at the Vancouver Convention Center. Here is a quick overview of the areas our researchers are working on:

Here are our most frequent collaborator institutions:

Oral Papers

Expected Variational Inequalities

Authors: Brian Zhang, Ioannis Anagnostides, Emanuel Tewolde, Ratip Emin Berker, Gabriele Farina, Vincent Conitzer, Tuomas Sandholm

This paper introduces expected variational inequalities (EVIs), a relaxed version of variational inequalities (VIs) where the goal is to find a distribution that satisfies the VI condition in expectation. While VIs are generally hard to solve, the authors show that EVIs can be solved efficiently, even under challenging, non-monotone conditions, by leveraging ideas from game theory. EVIs generalize the concept of correlated equilibria and unify various results across smooth games, constrained games, and settings with non-concave utilities, making them broadly applicable beyond traditional game-theoretic contexts.

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Authors: Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan Zhang

This paper shows that voting-based benchmarks for evaluating LLMs (such as Chatbot Arena) can be vulnerable to adversarial manipulation if proper defenses aren’t in place. The authors show that an attacker can identify which model generated a response and then strategically vote to boost or demote specific models, altering the leaderboard with only around a thousand votes in a simulated environment. They collaborate with Chatbot Arena’s developers to propose and implement security measures such as reCAPTCHA and login requirements that significantly raise the cost of such attacks and enhance the platform’s robustness.

High-Dimensional Prediction for Sequential Decision Making

Authors: Georgy Noarov, Ramya Ramalingam, Aaron Roth, Stephan Xie

This paper presents a new algorithmic framework for making reliable, multi-dimensional forecasts in adversarial, nonstationary environments. Unlike existing online learning methods, this approach offers simultaneous performance guarantees for many agents, even when they face different objectives, act over large action spaces, or care about specific conditions (e.g. weather or route choice). The algorithm ensures low bias across many conditional events and enables each agent to achieve strong guarantees like diminishing regret. Applications include efficient solutions for online combinatorial optimization and multicalibration.

LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Authors: Parshin Shojaee, Ngoc Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa Doan, Chandan Reddy

This paper introduces LLM-SRBench, a new benchmark designed to rigorously evaluate the ability of LLMs to discover scientific equations (rather than merely recall them from training data). Existing tests often rely on well-known equations, making it hard to tell whether models are truly reasoning or just memorizing. LLM-SRBench addresses this by including 239 challenging problems across four scientific domains, split into two categories: one that disguises familiar physics equations (LSR-Transform) and another that features fully synthetic, reasoning-driven tasks (LSR-Synth). Evaluations show that even the best current models only achieve 31.5% accuracy, highlighting the difficulty of the task and establishing LLM-SRBench as a valuable tool for driving progress in LLM-based scientific discovery.

On Differential Privacy for Adaptively Solving Search Problems via Sketching

Authors: Shiyuan Feng, Ying Feng, George Li, Zhao Song, David Woodruff, Lichen Zhang

This paper explores how to use differential privacy to protect against information leakage in adaptive search queries, a harder problem than traditional private estimation tasks. Unlike prior work that only returns numerical summaries (e.g., cost), the authors design algorithms that return actual solutions, like nearest neighbors or regression vectors, even when the inputs or queries change over time. They show how key problem parameters (like the number of approximate near neighbors or condition number of the data matrix) affect the performance of these private algorithms. This work has practical implications for AI systems that rely on private database searches or real-time regression, enabling them to provide useful results while safeguarding sensitive information from attackers.

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction

Authors: Vaishnavh Nagarajan, Chen Wu, Charles Ding, Aditi Raghunathan

This paper proposes a set of simple, abstract tasks designed to probe the creative limits of today’s language models in a controlled and measurable way. These tasks mimic real-world open-ended challenges like generating analogies or designing puzzles, where success requires discovering new connections or constructing novel patterns. The authors show that standard next-token prediction tends to be short-sighted and overly reliant on memorization, while alternative approaches like teacherless training and diffusion models produce more diverse, original outputs. They also introduce a technique called seed-conditioning, which adds randomness at the input rather than the output and can improve coherence without sacrificing creativity.

Training a Generally Curious Agent

Authors: Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Rahman, Zico Kolter, Jeff Schneider, Russ Salakhutdinov

This paper introduces Paprika, a fine-tuning method that equips language models with general decision-making and exploration strategies, enabling them to adapt to new tasks through interaction alone (i.e. without further training). Paprika trains models on synthetic environments requiring different exploration behaviors, encouraging them to learn flexible strategies rather than memorizing solutions. To improve efficiency, it uses a curriculum learning-based approach that prioritizes tasks with high learning value, making the most of limited interaction data. Models trained with Paprika show strong transfer to completely new tasks, suggesting a promising direction for building AI agents that can learn to solve unfamiliar, sequential problems with minimal supervision.

Spotlight Papers

GMAIL: Generative Modality Alignment for generated Image Learning

Authors: Shentong Mo, Sukmin Yun

Generative models can create realistic images that could help train machine learning models, but using them as if they were real images can lead to problems because of differences between the two. This paper introduces a method called GMAIL that treats real and generated images as separate types (or modalities) and aligns them in a shared latent space during training, rather than just mixing them at the pixel level. The approach fine-tunes models on generated data using a special loss to bridge the gap, then uses these aligned models to improve training on tasks like image captioning and retrieval. The results show that GMAIL improves performance on several vision-language tasks and scales well as more generated data is added.

LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Authors: Paul McVay, Sergio Arnaud, Ada Martin, Arjun Majumdar, Krishna Murthy Jatavallabhula, Phillip Thomas, Ruslan Partsey, Daniel Dugas, Abha Gejji, Alexander Sax, Vincent-Pierre Berges, Mikael Henaff, Ayush Jain, Ang Cao, Ishita Prasad, Mrinal Kalakrishnan, Michael Rabbat, Nicolas Ballas, Mahmoud Assran, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier

LOCATE 3D is a model that can find specific objects in 3D scenes based on natural language descriptions (like “the small coffee table between the sofa and the lamp”). It achieves state-of-the-art performance on standard benchmarks and works well in real-world settings, like on robots or AR devices, by using RGB-D sensor data. A key component is 3D-JEPA, a new self-supervised learning method that uses features from 2D vision models (like CLIP or DINO) to understand 3D point clouds through masked prediction tasks. The model is trained on a newly introduced large dataset (130K+ examples), helping it generalize better across different environments.

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Authors: Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj

This paper introduces MAETok, a masked autoencoder designed to create a high-quality, semantically meaningful latent space for diffusion models. The authors show that having a well-structured latent space, meaning fewer Gaussian modes and more discriminative features, leads to better image generation without needing complex variational autoencoders. MAETok outperforms existing methods on ImageNet using just 128 tokens, and it’s also much faster: 76× quicker to train and 31× faster during inference. The key takeaway is that the structure of the latent space, not variational constraints, is what truly matters for high-quality diffusion-based generation.

Position: In-House Evaluation Is Not Enough. Towards Robust Third-Party Evaluation and Flaw Disclosure for General-Purpose AI

Authors: Shayne Longpre, Kevin Klyman, Ruth Elisabeth Appel, Sayash Kapoor, Rishi Bommasani, Michelle Sahar, Sean McGregor, Avijit Ghosh, Borhane Blili-Hamelin, Nathan Butters, Alondra Nelson, Amit Elazari, Andrew Sellars, Casey Ellis, Dane Sherrets, Dawn Song, Harley Geiger, Ilona Cohen, Lauren McIlvenny, Madhulika Srikumar, Mark Jaycox, Markus Anderljung, Nadine Johnson, Nicholas Carlini, Nicolas Miailhe, Nik Marda, Peter Henderson, Rebecca Portnoff, Rebecca Weiss, Victoria Westerhoff, Yacine Jernite, Rumman Chowdhury, Percy Liang, Arvind Narayanan

This paper highlights the lack of robust systems for identifying and reporting flaws in general-purpose AI (GPAI), especially compared to mature fields like software security. The authors propose three key solutions: (1) standardized reporting formats and engagement rules to streamline flaw reporting and triaging, (2) formal disclosure programs with legal protections for researchers (similar to bug bounties), and (3) better infrastructure for distributing flaw reports to relevant stakeholders. These steps aim to address growing risks like jailbreaks and cross-system vulnerabilities, ultimately improving the safety and accountability of GPAI systems.

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Authors: Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar

This paper explores how to best scale test-time compute for large language models (LLMs), comparing two strategies: (1) distilling search traces (verifier-free, or VF) and (2) using verifiers or rewards to guide learning (verifier-based, or VB). The authors show—both theoretically and through experiments—that VB methods significantly outperform VF ones when working with limited compute or data. They explain that this performance gap grows as models and tasks get more complex, especially when solution paths vary in style or quality. Ultimately, the paper argues that verification is essential for effectively scaling LLM performance, especially for reasoning tasks.

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Authors: Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen

As long-context LLMs become more common, their growing memory demands during inference slow down performance, especially due to the expanding key-value (KV) cache. This paper introduces ShadowKV, a system that significantly improves throughput by compressing the key cache using low-rank representations and offloading the value cache without major latency costs. It reconstructs only the necessary KV pairs during decoding to maintain speed and accuracy. Experiments show ShadowKV supports much larger batch sizes (up to 6×) and improves throughput by over 3× on standard hardware, all while preserving model quality across several LLMs and benchmarks.

Poster Papers

Accountability, Transparency, And Interpretability

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

Authors: Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang

Validating Mechanistic Interpretations: An Axiomatic Approach

Authors: Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina Pasareanu, Somesh Jha

Active Learning And Interactive Learning

Optimistic Algorithms for Adaptive Estimation of the Average Treatment Effect

Authors: Ojash Neopane, Aaditya Ramdas, Aarti Singh

Applications

Agent Workflow Memory

Authors: Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Authors: Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zheng Hui

Causality

A Sample Efficient Conditional Independence Test in the Presence of Discretization

Authors: Boyang Sun, Yu Yao, Xinshuai Dong, Zongfang Liu, Tongliang Liu, Yumou Qiu, Kun Zhang

Extracting Rare Dependence Patterns via Adaptive Sample Reweighting

Authors: YIQING LI, Yewei Xia, Xiaofei Wang, Zhengming Chen, Liuhua Peng, Mingming Gong, Kun Zhang

Isolated Causal Effects of Natural Language

Authors: Victoria Lin, Louis-Philippe Morency, Eli Ben-Michael

Latent Variable Causal Discovery under Selection Bias

Authors: Haoyue Dai, Yiwen Qiu, Ignavier Ng, Xinshuai Dong, Peter Spirtes, Kun Zhang

Permutation-based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data

Authors: Xinshuai Dong, Ignavier Ng, Boyang Sun, Haoyue Dai, Guangyuan Hao, Shunxing Fan, Peter Spirtes, Yumou Qiu, Kun Zhang

Chemistry, Physics, And Earth Sciences

Multi-Timescale Dynamics Model Bayesian Optimization for Plasma Stabilization in Tokamaks

Authors: Rohit Sonker, Alexandre Capone, Andrew Rothstein, Hiro Kaga, Egemen Kolemen, Jeff Schneider

OmniArch: Building Foundation Model for Scientific Computing

Authors: Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Chonghan Gao, Rongye Shi, Shanghang Zhang, Jianxin Li

PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design

Authors: Zhenqiao Song, Tianxiao Li, Lei Li, Martin Min

Computer Vision

David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training

Authors: Weijian Luo, colin zhang, Debing Zhang, Zhengyang Geng

From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Authors: Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Ada Martin, Vincent-Pierre Berges, Paul McVay, Ruslan Partsey, Aravind Rajeswaran, Franziska Meier, Justin Johnson, Jeong Joon Park, Alexander Sax

GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder

Authors: Shiming Chen, Dingjie Fu, Salman Khan, Fahad Khan

Understanding Complexity in VideoQA via Visual Program Generation

Authors: Cristobal Eyzaguirre, Igor Vasiljevic, Achal Dave, Jiajun Wu, Rareș Ambruș, Thomas Kollar, Juan Carlos Niebles, Pavel Tokmakov

Unifying 2D and 3D Vision-Language Understanding

Authors: Ayush Jain, Alexander Swerdlow, Yuzhou Wang, Sergio Arnaud, Ada Martin, Alexander Sax, Franziska Meier, Katerina Fragkiadaki

Deep Learning

Towards characterizing the value of edge embeddings in Graph Neural Networks

Authors: Dhruv Rohatgi, Tanya Marwah, Zachary Lipton, Jianfeng Lu, Ankur Moitra, Andrej Risteski

Discrete And Combinatorial Optimization

EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations

Authors: Haotian Zhai, Connor Lawless, Ellen Vitercik, Liu Leqi

Faster Global Minimum Cut with Predictions

Authors: Helia Niaparast, Benjamin Moseley, Karan Singh

Domain Adaptation And Transfer Learning

A General Representation-Based Approach to Multi-Source Domain Adaptation

Authors: Ignavier Ng, Yan Li, Zijian Li, Yujia Zheng, Guangyi Chen, Kun Zhang

Evaluation

Copilot Arena: A Platform for Code LLM Evaluation in the Wild

Authors: Wayne Chi, Valerie Chen, Anastasios Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar

RAGGED: Towards Informed Design of Scalable and Stable RAG Systems

Authors: Jennifer Hsia, Afreen Shaikh, Zhiruo Wang, Graham Neubig

RBench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

Authors: Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, Bolin Ni, Guo-Wei Yang, Yongming Rao, Houwen Peng, Han Hu, Gordon Wetzstein, Shi-min Hu

Everything Else

On Fine-Grained Distinct Element Estimation

Authors: Ilias Diakonikolas, Daniel Kane, Jasper Lee, Thanasis Pittas, David Woodruff, Samson Zhou

Understanding the Kronecker Matrix-Vector Complexity of Linear Algebra

Authors: Raphael Meyer, William Swartworth, David Woodruff

Fairness

FDGen: A Fairness-Aware Graph Generation Model

Authors: Zichong Wang, Wenbin Zhang

Fairness on Principal Stratum: A New Perspective on Counterfactual Fairness

Authors: Haoxuan Li, Zeyu Tang, Zhichao Jiang, Zhuangyan Fang, Yue Liu, zhi geng, Kun Zhang

Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs

Authors: Yinong O Wang, Nivedha Sivakumar, Falaah Arif Khan, Katherine Metcalf, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Authors: Konstantina Bairaktari, Jiayun Wu, Steven Wu

Relative Error Fair Clustering in the Weak-Strong Oracle Model

Authors: Vladimir Braverman, Prathamesh Dharangutte, Shaofeng Jiang, Hoai-An Nguyen, Chen Wang, Yubo Zhang, Samson Zhou

Foundation Models

Rethinking the Bias of Foundation Model under Long-tailed Distribution

Authors: Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su

Game Theory

Observation Interference in Partially Observable Assistance Games

Authors: Scott Emmons, Caspar Oesterheld, Vincent Conitzer, Stuart Russell

General Machine Learning

On the Power of Learning-Augmented Search Trees

Authors: Jingbang Chen, Xinyuan Cao, Alicia Stepin, Li Chen

Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges

Authors: Nayoung Lee, Jack Cai, Avi Schwarzschild, Kangwook Lee, Dimitris Papailiopoulos

Graph Neural Networks

CurvGAD: Leveraging Curvature for Enhanced Graph Anomaly Detection

Authors: Karish Grover, Geoff Gordon, Christos Faloutsos

Graph World Model

Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You

Graphical Models

A Generic Family of Graphical Models: Diversity, Efficiency, and Heterogeneity

Authors: Yufei Huang, Changhu Wang, Junjie Tang, Weichi Wu, Ruibin Xi

Health / Medicine

Distributed Parallel Gradient Stacking(DPGS): Solving Whole Slide Image Stacking Challenge in Multi-Instance Learning

Authors: Boyuan Wu, wang, Xianwei Lin, Jiachun Xu, Jikai Yu, Zhou Shicheng, Hongda Chen, Lianxin Hu

SUICA: Learning Super-high Dimensional Sparse Implicit Neural Representations for Spatial Transcriptomics

Authors: Qingtian Zhu, Yumin Zheng, Yuling Sang, Yifan Zhan, Ziyan Zhu, Jun Ding, Yinqiang Zheng

Language, Speech And Dialog

A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Authors: Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

Authors: William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Yang, Shinji Watanabe

Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs

Authors: Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu

Large Language Models

Accelerating Unbiased LLM Evaluation via Synthetic Feedback

Authors: Zhaoyi Zhou, Yuda Song, Andrea Zanette

An Architecture Search Framework for Inference-Time Techniques

Authors: Jon Saad-Falcon, Adrian Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Guha, Estefany Kelly Buchanan, Mayee Chen, Neel Guha, Christopher Re, Azalia Mirhoseini

Demystifying Long Chain-of-Thought Reasoning

Authors: Edward Yeo, Yuxuan Tong, Xinyao Niu, Graham Neubig, Xiang Yue

GSM-$infty$: How Do your LLMs Behave over Infinitely Increasing Reasoning Complexity and Context Length?

Authors: Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen

Idiosyncrasies in Large Language Models

Authors: Mingjie Sun, Yida Yin, Zhiqiu (Oscar) Xu, Zico Kolter, Zhuang Liu

Large Language Models are Demonstration Pre-Selectors for Themselves

Authors: Jiarui Jin, Yuwei Wu, Haoxuan Li, Xiaoting He, Weinan Zhang, Yiming Yang, Yong Yu, Jun Wang, Mengyue Yang

Let LLM Tell What to Prune and How Much to Prune

Authors: Mingzhe Yang, Sihao Lin, Changlin Li, Xiaojun Chang

Memorization Sinks: Isolating Memorization during LLM Training

Authors: Gaurav Ghosal, Pratyush Maini, Aditi Raghunathan

Optimizing Temperature for Language Models with Multi-Sample Inference

Authors: Weihua Du, Yiming Yang, Sean Welleck

Optimizing Test-Time Compute via Meta Reinforcement Finetuning

Authors: Yuxiao Qu, Matthew Yang, Amrith Setlur, Lewis Tunstall, Edward Beeching, Russ Salakhutdinov, Aviral Kumar

Overtrained Language Models Are Harder to Fine-Tune

Authors: Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan

Reflection-Window Decoding: Text Generation with Selective Refinement

Authors: Zeyu Tang, Zhenhao Chen, Xiangchen Song, Loka Li, Yunlong Deng, Yifan Shen, Guangyi Chen, Peter Spirtes, Kun Zhang

Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization

Authors: Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, Hao Ma, Han Fang

To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models

Authors: Anna Hedström, Salim I. Amoukou, Tom Bewley, Saumitra Mishra, Manuela Veloso

Training Software Engineering Agents and Verifiers with SWE-Gym

Authors: Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang

Unlocking Post-hoc Dataset Inference with Synthetic Data

Authors: Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic

Unnatural Languages Are Not Bugs but Features for LLMs

Authors: Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, Zico Kolter, Michael Shieh

What Do Learning Dynamics Reveal About Generalization in LLM Mathematical Reasoning?

Authors: Katie Kang, Amrith Setlur, Dibya Ghosh, Jacob Steinhardt, Claire Tomlin, Sergey Levine, Aviral Kumar

Learning Theory

Sample-Optimal Agnostic Boosting with Unlabeled Data

Authors: Udaya Ghai, Karan Singh

Multi-agent

Online Learning And Bandits

Offline Learning for Combinatorial Multi-armed Bandits

Authors: Xutong Liu, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, Carlee Joe-Wong, John C. S. Lui, Wei Chen

Online Learning, Active Learning And Bandits

Optimization

FedECADO: A Dynamical System Model of Federated Learning

Authors: Aayushya Agarwal, Gauri Joshi, Lawrence Pileggi

Graph-Based Algorithms for Diverse Similarity Search

Authors: Piyush Anand, Piotr Indyk, Ravishankar Krishnaswamy, Sepideh Mahabadi, Vikas Raykar, Kirankumar Shiragur, Haike Xu

Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures

Authors: Alina Ene, Alessandro Epasto, Vahab Mirrokni, Hoai-An Nguyen, Huy Nguyen, David Woodruff, Peilin Zhong

Robust Sparsification via Sensitivity

Authors: Chansophea Wathanak In, Yi Li, David Woodruff, Xuan Wu

Privacy

EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption

Authors: Leo de Castro, Daniel Escudero, Adya Agrawal, Antigoni Polychroniadou, Manuela Veloso

Private Federated Learning using Preference-Optimized Synthetic Data

Authors: Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti

Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning

Authors: Rongzhe Wei, Mufei Li, Mohsen Ghassemi, Eleonora Kreacic, Yifan Li, Xiang Yue, Bo Li, Vamsi Potluru, Pan Li, Eli Chien

Probabilistic Methods

Density Ratio Estimation with Conditional Probability Paths

Authors: Hanlin Yu, Arto Klami, Aapo Hyvarinen, Anna Korba, Lemir Omar Chehab

Reinforcement Learning And Planning

Representation Learning

Contextures: Representations from Contexts

Authors: Runtian Zhai, Kai Yang, Burak VARICI, Che-Ping Tsai, Zico Kolter, Pradeep Ravikumar

Learning Vision and Language Concepts for Controllable Image Generation

Authors: Shaoan Xie, Lingjing Kong, Yujia Zheng, Zeyu Tang, Eric Xing, Guangyi Chen, Kun Zhang

Nonparametric Identification of Latent Concepts

Authors: Yujia Zheng, Shaoan Xie, Kun Zhang

Research Priorities, Methodology, And Evaluation

Position: You Can’t Manufacture a NeRF

Authors: Marta An Kimmel, Mueed Rehman, Yonatan Bisk, Gary Fedder

Robotics

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Authors: Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto

Learning Safe Control via On-the-Fly Bandit Exploration

Authors: Alexandre Capone, Ryan Cosner, Aaron Ames, Sandra Hirche

Towards Learning to Complete Anything in Lidar

Authors: Ayça Takmaz, Cristiano Saltori, Neehar Peri, Tim Meinhardt, Riccardo de Lutio, Laura Leal-Taixé, Aljosa Osep

Safety

DIS-CO: Discovering Copyrighted Content in VLMs Training Data

Authors: André Duarte, Xuandong Zhao, Arlindo Oliveira, Lei Li

Do Not Mimic My Voice : Speaker Identity Unlearning for Zero-Shot Text-to-Speech

Authors: Taesoo Kim, Jinju Kim, Dongchan Kim, Jong Hwan Ko, Gyeong-Moon Park

SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

Authors: Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine

WMarkGPT: Watermarked Image Understanding via Multimodal Large Language Models

Authors: Tan Songbai, Xuerui Qiu, Yao Shu, Gang Xu, Linrui Xu, Xiangyu Xu, HUIPING ZHUANG, Ming Li, Fei Yu

Weak-to-Strong Jailbreaking on Large Language Models

Authors: Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Wang

Security

ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

Authors: Chhavi Yadav, Evan Laufer, Dan Boneh, Kamalika Chaudhuri

Sequential Models, Time Series

A Generalizable Physics-Enhanced State Space Model for Long-Term Dynamics Forecasting in Complex Environments

Authors: Yuchen Wang, Hongjue Zhao, Haohong Lin, Enze Xu, Lifang He, Huajie Shao

Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization

Authors: Luca Masserano, Abdul Fatir Ansari, Boran Han, Xiyuan Zhang, Christos Faloutsos, Michael Mahoney, Andrew Wilson, Youngsuk Park, Syama Sundar Yadav Rangapuram, Danielle Maddix, Yuyang Wang

LSCD: Lomb–Scargle Conditioned Diffusion for Time series Imputation

Authors: Elizabeth M Fons Etcheverry, Alejandro Sztrajman, Yousef El-Laham, Luciana Ferrer, Svitlana Vyetrenko, Manuela Veloso

Social Aspects

Data-driven Design of Randomized Control Trials with Guaranteed Treatment Effects

Authors: Santiago Cortes-Gomez, Naveen Raman, Aarti Singh, Bryan Wilder

On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents

Authors: Jen-Tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael Lyu, Maarten Sap

STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings

Authors: Saksham Rastogi, Pratyush Maini, Danish Pruthi

Structure Learning

Identification of Latent Confounders via Investigating the Tensor Ranks of the Nonlinear Observations

Authors: Zhengming Chen, Yewei Xia, Feng Xie, Jie Qiao, Zhifeng Hao, Ruichu Cai, Kun Zhang

Supervised Learning

Preserving AUC Fairness in Learning with Noisy Protected Groups

Authors: Mingyang Wu, Li Lin, Wenbin Zhang, Xin Wang, Zhenhuan Yang, Shu Hu

Theory

Learning-Augmented Hierarchical Clustering

Authors: Vladimir Braverman, Jon C. Ergun, Chen Wang, Samson Zhou

On the Query Complexity of Verifier-Assisted Language Generation

Authors: Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan Ash, Cyril Zhang, Andrej Risteski

Sort Before You Prune: Improved Worst-Case Guarantees of the DiskANN Family of Graphs

Authors: Siddharth Gollapudi, Ravishankar Krishnaswamy, Kirankumar Shiragur, Harsh Wardhan

Time Series

Exploring Representations and Interventions in Time Series Foundation Models

Authors: Michal Wilinski, Mononito Goswami, Willa Potosnak, Nina Żukowska, Artur Dubrawski

Read More