How Rufus doubled their inference speed and handled Prime Day traffic with AWS AI chips and parallel decoding

May 28, 2025

by Shruti Dubey Amazon AWS

Large language models (LLMs) have revolutionized the way we interact with technology, but their widespread adoption has been blocked by high inference latency, limited throughput, and high costs associated with text generation. These inefficiencies are particularly pronounced during high-demand events like Amazon Prime Day, where systems like Rufus—the Amazon AI-powered shopping assistant—must handle massive scale while adhering to strict latency and throughput requirements. Rufus is an AI-powered shopping assistant designed to help customers make informed purchasing decisions. Powered by LLMs, Rufus answers customer questions about a variety of shopping needs and products and simplifies the shopping experience, as shown in the following image.

Rufus relies on many components to deliver its customer experience including a foundation LLM (for response generation) and a query planner (QP) model for query classification and retrieval enhancement. The model parses customer questions to understand their intent, whether keyword-based or conversational natural language. QP is on the critical path for Rufus because Rufus cannot initiate token generation until QP provides its full output. Thus, reducing QP’s end-to-end text generation latency is a critical requirement for reducing the first chunk latency in Rufus, which refers to the time taken to generate and send the first response to a user request. Lowering this latency improves perceived responsiveness and overall user experience. This post focuses on how the QP model used draft centric speculative decoding (SD)—also called parallel decoding—with AWS AI chips to meet the demands of Prime Day. By combining parallel decoding with AWS Trainium and Inferentia chips, Rufus achieved two times faster response times, a 50% reduction in inference costs, and seamless scalability during peak traffic.

Scaling LLMs for Prime Day

Prime Day is one of the most demanding events for the Amazon infrastructure, pushing systems to their limits. In 2024, Rufus faced an unprecedented engineering challenge: handling millions of queries per minute and generating billions of tokens in real-time, all while maintaining a 300 ms latency SLA for QP tasks and minimizing power consumption. These demands required a fundamental rethinking of how LLMs are deployed at scale conquering the cost and performance bottlenecks. The key challenges of Prime Day included:

Massive scale: Serving millions of tokens per minute to customers worldwide, with peak traffic surges that strain even the most robust systems.
Strict SLAs: Delivering real-time responsiveness with a hard latency limit of 300 ms, ensuring a seamless customer experience.
Cost efficiency: Minimizing the cost of serving LLMs at scale while reducing power consumption, a critical factor for sustainable and economical operations.

Traditional LLM text generation is inherently inefficient because of its sequential nature. Each token generation requires a full forward pass through the model, leading to high latency and underutilization of computational resources. While techniques like speculative decoding have been proposed to address these inefficiencies, their complexity and training overhead have limited their adoption.

AWS AI chips and parallel decoding

To overcome these challenges, Rufus adopted parallel decoding, a simple yet powerful technique for accelerating LLM generation. With parallel decoding, the sequential dependency is broken, making autoregressive generation faster. This approach introduces additional decoding heads to the base model eliminating the need for a separate draft model for speculated tokens. These heads predict multiple tokens in parallel for future positions before it knows the previous tokens, and this significantly improves generation efficiency.

To accelerate the performance of parallel decoding for online inference, Rufus used a combination of AWS solutions: Inferentia2 and Trainium AI Chips, Amazon Elastic Compute Cloud (Amazon EC2) and Application Load Balancer. In addition, the Rufus team partnered with NVIDIA to power the solution using NVIDIA’s Triton Inference Server, providing capabilities to host the model using AWS chips.

To get the maximum efficiency of parallel decoding on AWS Neuron Cores, we worked in collaboration with AWS Neuron team to add the architectural support of parallel decoding on a Neuronx-Distributed Inference (NxDI) framework for single batch size.

Rufus extended the base LLM with multiple decoding heads. These heads are a small neural network layer and are trained using the base model’s learned representations to predict the next tokens in parallel. These heads are trained together with the original model, keeping the base model unchanged.Because the tokens aren’t generated sequentially, they must be verified to make sure that all of the tokens fit together. To validate the tokens predicted by the draft heads, Rufus uses a tree-based attention mechanism to verify and integrate tokens. Each draft head produces several options for each position. These options are then organized into a tree-like structure to select the most promising combination. This allows multiple candidate tokens to be processed in parallel, reducing latency and increasing neuron core utilization. The following figure shows a sparse tree constructed using our calibration set, with a depth of four, indicating the involvement of four heads in the calculation process. Each node represents a token from a top-k prediction of a draft head, and the edges depict the connections between these nodes.

Results of using parallel decoding

By integrating parallel decoding with AWS AI chips and NxDI framework, we doubled the speed of text generation compared to autoregressive decoding, making it an ideal solution for the high-demand environment of Prime Day. During Amazon Prime Day 2024, Rufus demonstrated the power of AWS AI chips with impressive performance metrics:

Two times faster generation: AWS AI chips, optimized for parallel decoding operations, enabled doubled the token generation speed compared to traditional processors. This parallel processing capability allowed multiple future tokens to be predicted simultaneously, delivering real-time interactions for millions of customers.
50% lower inference costs: The combination of purpose-built AWS AI chips and parallel decoding optimization eliminated redundant computations, cutting inference costs by half while maintaining response quality.
Simplified deployment: AWS AI chips efficiently powered the model’s parallel decoding heads, enabling simultaneous token prediction without the complexity of managing separate draft models. This architectural synergy simplified the deployment while delivering efficient inference at scale.
Seamless scalability: The combination handled peak traffic without compromising performance and response quality.

These advances not only enhanced the customer experience but also showcased the potential of NxDI framework and the adaptability of AWS AI chips for optimizing large-scale LLM performance.

How to use parallel decoding on Trainium and Inferentia

The flexibility of NxDI combined with AWS Neuron chips makes it a powerful solution for LLM text generation in production. Whether you’re using Trainium or Inferentia for inference, NxDI provides a unified interface to implement parallel decoding optimizations. This integration reduces operational complexity and provides a straightforward path for organizations looking to deploy and scale their LLM applications efficiently.

You can explore parallel decoding technique such as Medusa to accelerate your inference workflows on INF2 or TRN1 instances. To get started, you’ll need a Medusa-compatible model (such as text-generation-inference/Mistral-7B-Instruct-v0.2-medusa) and a Medusa tree configuration. Enable Medusa by setting is_medusa=True, configuring your medusa_speculation_length, num_medusa_heads, and specifying your medusa_tree. When using the HuggingFace generate() API, set the assistant_model to your target model. Note that Medusa currently supports only a batch size of 1.

 def load_json_file(json_path):
    with open(json_path, "r") as f:
        return json.load(f)
medusa_tree = load_json_file("medusa_mc_sim_7b_63.json")
neuron_config = NeuronConfig(
    is_medusa=True,
    medusa_speculation_length=64,
    num_medusa_heads=4,
    medusa_tree=medusa_tree
)

Conclusion

Prime Day is a testament to the power of innovation to overcome technical challenges. By using AWS AI chips, Rufus not only met the stringent demands of Prime Day but also set a new standard for LLM efficiency. As LLMs continue to evolve, frameworks such as NxDI will play a crucial role in making them more accessible, scalable, and cost-effective. We’re excited to see how the community will build on the NxDI foundation and AWS AI chips to unlock new possibilities for LLM applications. Try it out today and experience the difference for yourself!

Acknowledgments

We extend our gratitude to the AWS Annapurna team responsible for AWS AI chips and framework development. Special thanks to the researchers and engineers whose contributions made this achievement possible. The improvements in latency, throughput, and cost efficiency achieved with parallel decoding compared to autoregressive decoding have set a new benchmark for LLM deployments at scale.

About the authors

Shruti Dubey is a Software Engineer on Amazon’s Core Search Team, where she optimizes LLM inference systems to make AI faster and more scalable. She’s passionate about Generative AI and loves turning cutting-edge research into real-world impact. Outside of work, you’ll find her running, reading, or trying to convince her dog that she’s the boss.

Shivangi Agarwal is an Applied Scientist on Amazon’s Prime Video team, where she focuses on optimizing LLM inference and developing intelligent ranking systems for Prime Videos using query-level signals. She’s driven by a passion for building efficient, scalable AI that delivers real-world impact. When she’s not working, you’ll likely find her catching a good movie, discovering new places, or keeping up with her adventurous 3-year-old kid.

Sukhdeep Singh Kharbanda is an Applied Science Manager at Amazon Core Search. In his current role, Sukhdeep is leading Amazon Inference team to build GenAI inference optimization solutions and inference system at scale for fast inference at low cost. Outside work, he enjoys playing with his kid and cooking different cuisines.

Rahul Goutam is an Applied Science Manager at Amazon Core Search, where he leads teams of scientists and engineers to build scalable AI solutions that power flexible and intuitive shopping experiences. When he’s off the clock, he enjoys hiking a trail or skiing down one.

Yang Zhou is a software engineer working on building and optimizing machine learning systems. His recent focus is enhancing the performance and cost efficiency of generative AI inference. Beyond work, he enjoys traveling and has recently discovered a passion for running long distances.

RJ is an Engineer within Amazon. He builds and optimizes systems for distributed systems for training and works on optimizing adopting systems to reduce latency for ML Inference. Outside work, he is exploring using Generative AI for building food recipes.

James Park is a Principal Machine Learning Specialist Solutions Architect at Amazon Web Services. He works with Amazon to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, experiences, and staying up to date with the latest technology trends.

New Amazon Bedrock Data Automation capabilities streamline video and audio analysis

May 27, 2025

by Ashish Lal Amazon AWS

Organizations across a wide range of industries are struggling to process massive amounts of unstructured video and audio content to support their core business applications and organizational priorities. Amazon Bedrock Data Automation helps them meet this challenge by streamlining application development and automating workflows that use content from documents, images, audio, and video. Recently, we announced two new capabilities that you can use to get custom insights from video and audio. You can streamline development and boost efficiency through consistent, multimodal analytics that can be seamlessly customized to their specific business needs.

Amazon Bedrock Data Automation accelerates development time from months to minutes through prepackaged foundation models (FMs), eliminating the need for multiple task-specific models and complex processing logic. Now developers can eliminate the time-consuming heavy lifting of unstructured multimodal content processing at scale, whether analyzing petabytes of video or processing millions of customer conversations. Developers can use natural language instructions to generate insights that meet the needs of their downstream systems and applications. Media and entertainment users can unlock custom insights from movies, television shows, ads, and user-generated video content. Customer-facing teams can generate new insights from audio—analyzing client consultations to identify best practices, categorize conversation topics, and extract valuable customer questions for training.

Customizing insights with Amazon Bedrock Data Automation for videos

Amazon Bedrock Data Automation makes it painless for you to tailor your generative AI–powered insights generated from video. You can specify which fields you want to generate from videos, such as scene context or summary, data format, and the natural language instructions for each field. You can customize Amazon Bedrock Data Automation output by generating specific insights in consistent formats for AI-powered multimedia analysis applications. For example, you can use Amazon Bedrock Data Automation to extract scene summaries, identify visually prominent objects, and detect logos in movies, television shows, and social media content. With Amazon Bedrock Data Automation, you can create new custom video output in minutes. Or you can select from a catalog of pre-built solutions—including advertisement analysis, media search, and more. Read the following example to understand how a customer is using Amazon Bedrock Data Automation for video analysis.

Air is an AI-based software product that helps businesses automate how they collect, approve, and share content. Creative teams love Air because they can replace their digital asset management (DAM), cloud storage solution, and workflow tools with Air’s creative operations system. Today, Air manages more than 250M images and videos for global brands such as Google, P&G, and Sweetgreen. Air’s product launched in March 2021, and they’ve raised $70M from world class venture capital firms. Air uses Amazon Bedrock Data Automation to help creative teams quickly organize their content.

“At Air, we are using Amazon Bedrock Data Automation to process tens of millions of images and videos. Amazon Bedrock Data Automation allows us to extract specific, tailored insights from content (such as video chapters, transcription, optical character recognition) in a matter of seconds. This was a virtually impossible task for us earlier. The new Amazon Bedrock Data Automation powered functionality on Air enables creative and marketing teams with critical business insights. With Amazon Bedrock Data Automation, Air has cut down search and organization time for its users by 90%. Today, every company needs to operate like a media company. Businesses are prioritizing the ability to generate original and unique creative work: a goal achievable through customization. Capabilities like Amazon Bedrock Data Automation allow Air to customize the extraction process for every customer, based on their specific goals and needs.”

—Shane Hedge, Co-Founder and CEO at Air

Extracting focused insights with Amazon Bedrock Data Automation for audio

The new Amazon Bedrock Data Automation capabilities make it faster and more streamlined for you to extract customized generative AI–powered insights from audio. You can specify the desired output configuration in natural language. And you can extract custom insights—such as summaries, key topics, and intents—from customer calls, clinical discussions, meetings, and other audio. You can use the audio insights in Amazon Bedrock Data Automation to improve productivity, enhance customer experience, ensure regulatory compliance, among others. For example, sales agents can improve their productivity by extracting insights such as summaries, key action items, and next steps from conversations between sales agents with clients.

Getting started with the new Amazon Bedrock Data Automation video and audio capabilities

To analyze your video and audio assets, follow these steps:

On the Amazon Bedrock console, choose Data Automation in the navigation pane. The following screenshot shows the Data Automation page.
In the Create a new BDA Project screen under BDA Project name, enter a name. Select Create project, as shown in the following screenshot.
Choose a Sample Blueprint or create a Blueprint

To use a blueprint, follow these steps:

You can choose a sample blueprint or you can create a new one.
To create a blueprint, on the Amazon Bedrock Data Automation console in the navigation pane under Data Automation, select custom output.
Choose Create blueprint and select the tile for the video or audio file you want to create a blueprint for, as shown in the following screenshot.

Choosing a sample blueprint for video modality

Creating a new blueprint for audio modality

Generate results for custom output
- On the video asset, within the blueprint, you can choose Generate results to see the detailed analysis.
Choose Edit field – In the Edit fields pane, enter a field name. Under Instructions, provide clear, step-by-step guidance for how to identify and classify the field’s data during the extraction process.
Choose Save blueprint.

Conclusion

The new video and audio capabilities in Amazon Bedrock Data Automation represent a significant step forward in helping you unlock the value of their unstructured content at scale. By streamlining application development and automating workflows that use content from documents, images, audio, and video, organizations can now quickly generate custom insights. Whether you’re analyzing customer conversations to improve sales effectiveness, extracting insights from media content, or processing video feeds, Amazon Bedrock Data Automation provides the flexibility and customization options you need while eliminating the undifferentiated heavy lifting of processing multimodal content. To learn more about these new capabilities, visit the Amazon Bedrock Data Automation documentation, and start building your first video or audio analysis project today.

Resources

To learn more about the new Amazon Bedrock Data Automation capabilities, visit:

Amazon Bedrock
Amazon Bedrock Data Automation
Get insights from multimodal content with Amazon Bedrock Data Automation, now generally available
Creating blueprints for video and Creating blueprints for audio in the documentation
The What’s New post for the new video capability in Amazon Bedrock Data Automation
The What’s New post for the new audio capability in Amazon Bedrock Data Automation

About the author

Ashish Lal is an AI/ML Senior Product Marketing Manager for Amazon Bedrock. He has 11+ years of experience in product marketing and enjoys helping customers accelerate time to value and reduce their AI lifecycle cost.

GuardianGamer scales family-safe cloud gaming with AWS

May 27, 2025

by Heidi Vogel Brockmann, Ronald Brockmann Amazon AWS

This blog post is co-written with Heidi Vogel Brockmann and Ronald Brockmann from GuardianGamer.

Millions of families face a common challenge: how to keep children safe in online gaming without sacrificing the joy and social connection these games provide.

In this post, we share how GuardianGamer—a member of the AWS Activate startup community—has built a cloud gaming platform that helps parents better understand and engage with their children’s gaming experiences using AWS services. Built specifically for families with children under 13, GuardianGamer uses AWS services including Amazon Nova and Amazon Bedrock to deliver a scalable and efficient supervision platform. The team uses Amazon Nova for intelligent narrative generation to provide parents with meaningful insights into their children’s gaming activities and social interactions, while maintaining a non-intrusive approach to monitoring.

The challenge: Monitoring children’s online gaming experiences

Monitoring children’s online gaming activities has been overwhelming for parents, offering little visibility and limited control. GuardianGamer fills a significant void in the market for parents to effectively monitor their children’s gaming activities without being intrusive.

Traditional parental controls were primarily focused on blocking content rather than providing valuable data related to their children’s gaming experiences and social interactions. This led GuardianGamer’s founders to develop a better solution—one that uses AI to summarize gameplay and chat interactions, helping parents better understand and engage with their children’s gaming activities in a non-intrusive way, by using short video reels, while also helping identify potential safety concerns.

Creating connected experiences for parent and child

GuardianGamer is a cloud gaming platform built specifically for families with pre-teen children under 13, combining seamless gaming experiences with comprehensive parental insights. Built on AWS and using Amazon Nova for intelligent narrative generation, the platform streams popular games while providing parents with much-desired visibility into their children’s gaming activities and social interactions. The service prioritizes both safety and social connection through integrated private voice chat, delivering a positive gaming environment that keeps parents informed in a non-invasive way.

There are two connected experiences offered in the platform: one for parents to stay informed and one for kids to play in a highly trusted and safe GuardianGamer space.

For parents, GuardianGamer offers a comprehensive suite of parental engagement tools and insights, empowering them to stay informed and involved in their children’s online activities. Insights are generated from gaming and video understanding, and texted to parents to foster positive conversations between parents and kids. Through these tools, parents can actively manage their child’s gaming experience, enjoying a safe and balanced approach to online entertainment.

For kids, GuardianGamer offers uninterrupted gameplay with minimal latency, all while engaging in social interactions. The platform makes it possible for children to connect and play exclusively within a trusted circle of friends—each vetted and approved by parents—creating a secure digital extension of their real-world relationships. This transforms gaming sessions into natural extensions of friendships formed through school, sports, and community activities, all enhanced by advanced parental AI insights.

By seamlessly blending technology, community, and family, GuardianGamer creates a safer and enriching digital space, called “The Trusted Way for Kids to Play.”

Solution overview

When the GuardianGamer team set out to build a platform that would help parents supervise their children’s gaming experiences across Minecraft, Roblox, and beyond, they knew they needed a cloud infrastructure partner with global reach and proven scalability. Having worked with AWS on previous projects, the team found it to be the natural choice for their ambitious vision.

“Our goal was to build a solution that could scale from zero to millions of users worldwide while maintaining low latency and high reliability—all with a small, nimble engineering team. AWS serverless architecture gave us exactly what we needed without requiring a massive DevOps investment.”

– Heidi Vogel Brockmann, founder and CEO of GuardianGamer.

The following diagram illustrates the backend’s AWS architecture.

GuardianGamer’s backend uses a fully serverless stack built on AWS Lambda, Amazon DynamoDB, Amazon Cognito, Amazon Simple Storage Service (Amazon S3), and Amazon Simple Notification Service (Amazon SNS), making it possible to expand the platform effortlessly as user adoption grows while keeping operational overhead minimal. This architecture enables the team to focus on their core innovation: AI-powered game supervision for parents, rather than infrastructure management.

The cloud gaming component presented unique challenges, requiring low-latency GPU resources positioned close to users around the world.

“Gaming is an inherently global activity, and latency can make or break the user experience. The extensive Regional presence and diverse Amazon Elastic Compute Cloud (Amazon EC2) instance types give us the flexibility to deploy gaming servers where our users are.”

– Heidi Vogel Brockmann.

The team uses Amazon Elastic File System (Amazon EFS) for efficient game state storage within each AWS Region and Amazon Elastic Container Service (Amazon ECS) for streamlined cluster management.

For the AI analysis capabilities that form the heart of GuardianGamer’s parental supervision features, the team relies on AWS Batch to coordinate analysis jobs, and Amazon Bedrock provides access to powerful large language models (LLMs).

“We’re currently using Amazon Nova Lite for summary generation and highlight video selection, which helps parents quickly understand what’s happening in their children’s gameplay without watching hours of content, just a few minutes a day to keep up to date and start informed conversations with their child,”

– Heidi Vogel Brockmann.

Results

Together, AWS and GuardianGamer have successfully scaled GuardianGamer’s cloud gaming platform to handle thousands of concurrent users across multiple game environments. The company’s recent expansion to support Roblox—in addition to its existing Minecraft capabilities—has broadened its serviceable addressable market to 160 million children and their families.

“What makes our implementation special is how we use Amazon Nova to maintain a continuous record of each child’s gaming activities across sessions. When a parent opens our app, they see a comprehensive view of their child’s digital journey, not just isolated moments.”

– Ronald Brockmann, CTO of GuardianGamer.

Conclusion

GuardianGamer demonstrates how a small, agile team can use AWS services to build a sophisticated, AI-powered gaming platform that prioritizes both child safety and parent engagement. By combining cloud gaming infrastructure across multiple Regions with the capabilities of Amazon Bedrock and Amazon Nova, GuardianGamer is pioneering a new approach to family-friendly gaming. Through continuous parent feedback and responsible AI practices, the platform delivers safer, more transparent gaming experiences while maintaining rapid innovation.

“AWS has been exceptional at bringing together diverse teams and technologies across the company to support our vision. Our state-of-the-art architecture leverages several specialized AI components, including speech analysis, video processing, and game metadata collection. We’re particularly excited about incorporating Amazon Nova, which helps us transform complex gaming data into coherent narratives for parents. With AWS as our scaling partner, we’re confident we can deliver our service to millions of families worldwide.”

– Heidi Vogel Brockmann.

Learn more about building family-safe gaming experiences on AWS. And for further reading, check out The psychology behind why children are hooked on Minecraft and Keep kids off Roblox if you’re worried, its CEO tells parents.

About the Authors

Heidi Vogel Brockmann is the CEO & Founder at GuardianGamer AI. Heidi is an engineer and a proactive mom of four with a mission to transform digital parenting in the gaming space. Frustrated by the lack of tools available for parents with gaming kids, Heidi built the platform to enable fun for kids and peace of mind for parents.

Ronald Brockmann is the CTO of GuardianGamer AI. With extensive expertise in cloud technology and video streaming, Ronald brings decades of experience in building scalable, secure systems. A named inventor on dozens of patents, he excels at building high-performance teams and deploying products at scale. His leadership combines innovative thinking with precise execution to drive GuardianGamer’s technical vision.

Raechel Frick is a Sr Product Marketing Manager at AWS. With over 20 years of experience in the tech industry, she brings a customer-first approach and growth mindset to building integrated marketing programs. Based in the greater Seattle area, Raechel balances her professional life with being a soccer mom and after-school carpool manager, demonstrating her ability to excel both in the corporate world and family life.

John D’Eufemia is an Account Manager at AWS supporting customers within Media, Entertainment, Games, and Sports. With an MBA from Clark University, where he graduated Summa Cum Laude, John brings entrepreneurial spirit to his work, having co-founded multiple ventures at Femia Holdings. His background includes significant leadership experience through his 8-year involvement with DECA Inc., where he served as both an advisor and co-founder of Clark University’s DECA chapter.

Principal Financial Group increases Voice Virtual Assistant performance using Genesys, Amazon Lex, and Amazon QuickSight

May 23, 2025

by Mulay Ahmed and Ruby Donald Amazon AWS

This post was cowritten by Mulay Ahmed, Assistant Director of Engineering, and Ruby Donald, Assistant Director of Engineering at Principal Financial Group. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

Principal Financial Group® is an integrated global financial services company with specialized solutions helping people, businesses, and institutions reach their long-term financial goals and access greater financial security.

With US contact centers that handle millions of customer calls annually, Principal® wanted to further modernize their customer call experience. With a robust AWS Cloud infrastructure already in place, they selected a cloud-first approach to create a more personalized and seamless experience for their customers that would:

Understand customer intents through natural language (vs. touch tone experiences)
Assist customers with self-service offerings where possible
Accurately route customer calls based on business rules
Assist engagement center agents with contextual data

Initially, Principal developed a voice Virtual Assistant (VA) using an Amazon Lex bot to recognize customer intents. The VA can perform self-service transactions or route customers to specific call center queues in the Genesys Cloud contact center platform, based on customer intents and business rules.

As customers interact with the VA, it’s essential to continuously monitor its health and performance. This allows Principal to identify opportunities for fine-tuning, which can enhance the VA’s ability to understand customer intents. Consequently, this will reduce fallback intent rates, improve functional intent fulfillment rates, and lead to better customer experiences.

In this post, we explore how Principal used this opportunity to build an integrated voice VA reporting and analytics solution using an Amazon QuickSight dashboard.

Amazon Lex is a service for building conversational interfaces using voice and text. It provides high-quality speech recognition and language understanding capabilities, enabling the addition of sophisticated, natural language chatbots to new and existing applications.

Genesys Cloud, an omni-channel orchestration and customer relationship platform, provides a contact center platform in a public cloud model that enables quick and simple integration of AWS Contact Center Intelligence (AWS CCI). As part of AWS CCI, Genesys Cloud integrates with Amazon Lex, which enables self-service, intelligent routing, and data collection capabilities.

QuickSight is a unified business intelligence (BI) service that makes it straightforward within an organization to build visualizations, perform ad hoc analysis, and quickly get business insights from their data.

Solution overview

Principal required a reporting and analytics solution that would monitor VA performance based on customer interactions at scale, enabling Principal to improve the Amazon Lex bot performance.

Reporting requirements included customer and VA interaction and Amazon Lex bot performance (target metrics and intent fulfillment) analytics to identify and implement tuning and training opportunities.

The solution used a QuickSight dashboard that derives these insights from the following customer interaction data used to measure VA performance:

Genesys Cloud data such as queues and data actions
Business-specific data such as product and call center operations data
Business API-specific data and metrics such as API response codes

The following diagram shows the solution architecture using Genesys, Amazon Lex, and QuickSight.

The solution workflow involves the following steps:

Users call in and interact with Genesys Cloud.
Genesys Cloud calls an AWS Lambda routing function. This function will return a response to Genesys Cloud with the necessary data, to route the customer call. To generate a response, the function fetches routing data from an Amazon DynamoDB table, and requests an Amazon Lex V2 bot to provide an answer on the user intent.
The Amazon Lex V2 bot processes the customer intent and calls a Lambda fulfillment function to fulfill the intent.
The fulfillment function executes custom logic (routing and session variables logic) and calls necessary APIs to fetch the data required to fulfill the intent.
The APIs process and return the data requested (such as data to perform a self-service transaction).
The Amazon Lex V2 bot’s conversation logs are sent to Amazon CloudWatch (these logs will be used for business analytics, operational monitoring, and alerts).
Genesys Cloud calls a third Lambda function to send customer interaction reports. The Genesys report function pushes these reports to an Amazon Simple Storage Service (Amazon S3) bucket (these reports will be used for business analytics).
An Amazon Data Firehose delivery stream ships the conversation logs from CloudWatch to an S3 bucket.
The Firehose delivery stream transforms the logs in Parquet or CSV format using a Lambda function.
An AWS Glue crawler scans the data in Amazon S3.
The crawler creates or updates the AWS Glue Data Catalog with the schema information.
We use Amazon Athena to query the datasets (customer interaction reports and conversation logs).
QuickSight connects to Athena to query the data from Amazon S3 using the Data Catalog.

Other design considerations

The following are other key design considerations to implement the VA solution:

Cost optimization – The solution uses Amazon S3 Bucket Keys to optimize on costs:
- Reduce the number of Amazon S3 requests to AWS Key Management Service (AWS KMS) to complete encryption operations.
- Reduce the number of AWS KMS events in AWS CloudTrail logs.
Encryption – The solution encrypts data at rest with AWS KMS and in transit using SSL/TLS.
Genesys Cloud integration – The integration between the Amazon Lex V2 bot and Genesys Cloud is done using AWS Identity and Access Management (IAM). For more details, see Genesys Cloud.
Logging and monitoring – The solution monitors AWS resources with CloudWatch and uses alerts to receive notification upon failure events.
Least privilege access – The solution uses IAM roles and policies to grant the minimum necessary permissions to uses and services.
Data privacy – The solution handles customer sensitive data such as personally identifiable information (PII) according to compliance and data protection requirements. It implements data masking when applicable and appropriate.
Secure APIs – APIs implemented in this solution are protected and designed according to compliance and security requirements.
Data types – The solution defines data types, such as time stamps, in the Data Catalog (and Athena) in order to refresh data (SPICE data) in QuickSight on a schedule.
DevOps – The solution is version controlled, and changes are deployed using pipelines, to enable faster release cycles.
Analytics on Amazon Lex – Analytics on Amazon Lex empowers teams with data-driven insights to improve the performance of their bots. The overview dashboard provides a single snapshot of key metrics such as the total number of conversations and intent recognition rates. Principal does not use this capability due to the following reasons:
- The dashboard can’t integrate with external data:
  - Genesys Cloud data (such as queues and data actions)
  - Business-specific data (such as product and call center operations data)
  - Business API-specific data and metrics (such as response codes)
The dashboard can’t be customized to add additional views and data.

Sample dashboard

With this reporting and analytics solution, Principal can consolidate data from multiple sources and visualize the performance of the VA to identify areas of opportunities for improvement. The following screenshot shows an example of their QuickSight dashboard for illustrative purposes.

Conclusion

In this post, we presented how Principal created a report and analytics solution for their VA solution using Genesys Cloud and Amazon Lex, along with QuickSight to provide customer interaction insights.

The VA solution allowed Principal to maintain its existing contact center solution with Genesys Cloud and achieve better customer experiences. It offers other benefits such as the ability for a customer to receive support on some inquiries without requiring an agent on the call (self-service). It also provides intelligent routing capabilities, leading to reduced call time and increased agent productivity.

With the implementation of this solution, Principal can monitor and derive insights from its VA solution and fine-tune accordingly its performance.

In its 2025 roadmap, Principal will continue to strengthen the foundation of the solution described in this post. In a second post, Principal will present how they automate the deployment and testing of new Amazon Lex bot versions.

AWS and Amazon are not affiliates of any company of the Principal Financial Group®. This communication is intended to be educational in nature and is not intended to be taken as a recommendation.

Insurance products issued by Principal National Life Insurance Co (except in NY) and Principal Life Insurance Company®. Plan administrative services offered by Principal Life. Principal Funds, Inc. is distributed by Principal Funds Distributor, Inc. Securities offered through Principal Securities, Inc., member SIPC and/or independent broker/dealers. Referenced companies are members of the Principal Financial Group®, Des Moines, IA 50392. ©2025 Principal Financial Services, Inc. 4373397-042025

About the Authors

Mulay Ahmed is an Assistant Director of Engineering at Principal and well-versed in architecting and implementing complex enterprise-grade solutions on AWS Cloud.

Ruby Donald is an Assistant Director of Engineering at Principal and leads the Enterprise Virtual Assistants Engineering Team. She has extensive experience in building and delivering software at enterprise scale.

Optimize query responses with user feedback using Amazon Bedrock embedding and few-shot prompting

May 22, 2025

by Tanay Chowdhury Amazon AWS

Improving response quality for user queries is essential for AI-driven applications, especially those focusing on user satisfaction. For example, an HR chat-based assistant should strictly follow company policies and respond using a certain tone. A deviation from that can be corrected by feedback from users. This post demonstrates how Amazon Bedrock, combined with a user feedback dataset and few-shot prompting, can refine responses for higher user satisfaction. By using Amazon Titan Text Embeddings v2, we demonstrate a statistically significant improvement in response quality, making it a valuable tool for applications seeking accurate and personalized responses.

Recent studies have highlighted the value of feedback and prompting in refining AI responses. Prompt Optimization with Human Feedback proposes a systematic approach to learning from user feedback, using it to iteratively fine-tune models for improved alignment and robustness. Similarly, Black-Box Prompt Optimization: Aligning Large Language Models without Model Training demonstrates how retrieval augmented chain-of-thought prompting enhances few-shot learning by integrating relevant context, enabling better reasoning and response quality. Building on these ideas, our work uses the Amazon Titan Text Embeddings v2 model to optimize responses using available user feedback and few-shot prompting, achieving statistically significant improvements in user satisfaction. Amazon Bedrock already provides an automatic prompt optimization feature to automatically adapt and optimize prompts without additional user input. In this blog post, we showcase how to use OSS libraries for a more customized optimization based on user feedback and few-shot prompting.

We’ve developed a practical solution using Amazon Bedrock that automatically improves chat assistant responses based on user feedback. This solution uses embeddings and few-shot prompting. To demonstrate the effectiveness of the solution, we used a publicly available user feedback dataset. However, when applying it inside a company, the model can use its own feedback data provided by its users. With our test dataset, it shows a 3.67% increase in user satisfaction scores. The key steps include:

Retrieve a publicly available user feedback dataset (for this example, Unified Feedback Dataset on Hugging Face).
Create embeddings for queries to capture semantic similar examples, using Amazon Titan Text Embeddings.
Use similar queries as examples in a few-shot prompt to generate optimized prompts.
Compare optimized prompts against direct large language model (LLM) calls.
Validate the improvement in response quality using a paired sample t-test.

The following diagram is an overview of the system.

The key benefits of using Amazon Bedrock are:

Zero infrastructure management – Deploy and scale without managing complex machine learning (ML) infrastructure
Cost-effective – Pay only for what you use with the Amazon Bedrock pay-as-you-go pricing model
Enterprise-grade security – Use AWS built-in security and compliance features
Straightforward integration – Integrate seamlessly existing applications and open source tools
Multiple model options – Access various foundation models (FMs) for different use cases

The following sections dive deeper into these steps, providing code snippets from the notebook to illustrate the process.

Prerequisites

Prerequisites for implementation include an AWS account with Amazon Bedrock access, Python 3.8 or later, and configured Amazon credentials.

Data collection

We downloaded a user feedback dataset from Hugging Face, llm-blender/Unified-Feedback. The dataset contains fields such as conv_A_user (the user query) and conv_A_rating (a binary rating; 0 means the user doesn’t like it and 1 means the user likes it). The following code retrieves the dataset and focuses on the fields needed for embedding generation and feedback analysis. It can be run in an Amazon Sagemaker notebook or a Jupyter notebook that has access to Amazon Bedrock.

# Load the dataset and specify the subset
dataset = load_dataset("llm-blender/Unified-Feedback", "synthetic-instruct-gptj-pairwise")

# Access the 'train' split
train_dataset = dataset["train"]

# Convert the dataset to Pandas DataFrame
df = train_dataset.to_pandas()

# Flatten the nested conversation structures for conv_A and conv_B safely
df['conv_A_user'] = df['conv_A'].apply(lambda x: x[0]['content'] if len(x) > 0 else None)
df['conv_A_assistant'] = df['conv_A'].apply(lambda x: x[1]['content'] if len(x) > 1 else None)

# Drop the original nested columns if they are no longer needed
df = df.drop(columns=['conv_A', 'conv_B'])

Data sampling and embedding generation

To manage the process effectively, we sampled 6,000 queries from the dataset. We used Amazon Titan Text Embeddings v2 to create embeddings for these queries, transforming text into high-dimensional representations that allow for similarity comparisons. See the following code:

import random import bedrock # Take a sample of 6000 queries 
df = df.shuffle(seed=42).select(range(6000)) 
# AWS credentials
session = boto3.Session()
region = 'us-east-1'
# Initialize the S3 client
s3_client = boto3.client('s3')

boto3_bedrock = boto3.client('bedrock-runtime', region)
titan_embed_v2 = BedrockEmbeddings(
    client=boto3_bedrock, model_id="amazon.titan-embed-text-v2:0")
    
# Function to convert text to embeddings
def get_embeddings(text):
    response = titan_embed_v2.embed_query(text)
    return response  # This should return the embedding vector

# Apply the function to the 'prompt' column and store in a new column
df_test['conv_A_user_vec'] = df_test['conv_A_user'].apply(get_embeddings)

Few-shot prompting with similarity search

For this part, we took the following steps:

Sample 100 queries from the dataset for testing. Sampling 100 queries helps us run multiple trials to validate our solution.
Compute cosine similarity (measure of similarity between two non-zero vectors) between the embeddings of these test queries and the stored 6,000 embeddings.
Select the top k similar queries to the test queries to serve as few-shot examples. We set K = 10 to balance between the computational efficiency and diversity of the examples.

See the following code:

# Step 2: Define cosine similarity function
def compute_cosine_similarity(embedding1, embedding2):
embedding1 = np.array(embedding1).reshape(1, -1) # Reshape to 2D array
embedding2 = np.array(embedding2).reshape(1, -1) # Reshape to 2D array
return cosine_similarity(embedding1, embedding2)[0][0]

# Sample query embedding
def get_matched_convo(query, df):
    query_embedding = get_embeddings(query)
    
    # Step 3: Compute similarity with each row in the DataFrame
    df['similarity'] = df['conv_A_user_vec'].apply(lambda x: compute_cosine_similarity(query_embedding, x))
    
    # Step 4: Sort rows based on similarity score (descending order)
    df_sorted = df.sort_values(by='similarity', ascending=False)
    
    # Step 5: Filter or get top matching rows (e.g., top 10 matches)
    top_matches = df_sorted.head(10) 
    
    # Print top matches
    return top_matches[['conv_A_user', 'conv_A_assistant','conv_A_rating','similarity']]

This code provides a few-shot context for each test query, using cosine similarity to retrieve the closest matches. These example queries and feedback serve as additional context to guide the prompt optimization. The following function generates the few-shot prompt:

import boto3
from langchain_aws import ChatBedrock
from pydantic import BaseModel

# Initialize Amazon Bedrock client
bedrock_runtime = boto3.client(service_name="bedrock-runtime", region_name="us-east-1")

# Configure the model to use
model_id = "us.anthropic.claude-3-5-haiku-20241022-v1:0"
model_kwargs = {
"max_tokens": 2048,
"temperature": 0.1,
"top_k": 250,
"top_p": 1,
"stop_sequences": ["nnHuman"],
}

# Create the LangChain Chat object for Bedrock
llm = ChatBedrock(
client=bedrock_runtime,
model_id=model_id,
model_kwargs=model_kwargs,
)

# Pydantic model to validate the output prompt
class OptimizedPromptOutput(BaseModel):
optimized_prompt: str

# Function to generate the few-shot prompt
def generate_few_shot_prompt_only(user_query, nearest_examples):
    # Ensure that df_examples is a DataFrame
    if not isinstance(nearest_examples, pd.DataFrame):
    raise ValueError("Expected df_examples to be a DataFrame")
    # Construct the few-shot prompt using nearest matching examples
    few_shot_prompt = "Here are examples of user queries, LLM responses, and feedback:nn"
    for i in range(len(nearest_examples)):
    few_shot_prompt += f"User Query: {nearest_examples.loc[i,'conv_A_user']}n"
    few_shot_prompt += f"LLM Response: {nearest_examples.loc[i,'conv_A_assistant']}n"
    few_shot_prompt += f"User Feedback: {'👍' if nearest_examples.loc[i,'conv_A_rating'] == 1.0 else '👎'}nn"
    
    # Add the user query for which the optimized prompt is required
    few_shot_prompt += f"Based on these examples, generate a general optimized prompt for the following user query:nn"
    few_shot_prompt += f"User Query: {user_query}n"
    few_shot_prompt += "Optimized Prompt: Provide a clear, well-researched response based on accurate data and credible sources. Avoid unnecessary information or speculation."
    
    return few_shot_prompt

The get_optimized_prompt function performs the following tasks:

The user query and similar examples generate a few-shot prompt.
We use the few-shot prompt in an LLM call to generate an optimized prompt.
Make sure the output is in the following format using Pydantic.

See the following code:

# Function to generate an optimized prompt using Bedrock and return only the prompt using Pydantic
def get_optimized_prompt(user_query, nearest_examples):
    # Generate the few-shot prompt
    few_shot_prompt = generate_few_shot_prompt_only(user_query, nearest_examples)
    
    # Call the LLM to generate the optimized prompt
    response = llm.invoke(few_shot_prompt)
    
    # Extract and validate only the optimized prompt using Pydantic
    optimized_prompt = response.content # Fixed to access the 'content' attribute of the AIMessage object
    optimized_prompt_output = OptimizedPromptOutput(optimized_prompt=optimized_prompt)
    
    return optimized_prompt_output.optimized_prompt

# Example usage
query = "Is the US dollar weakening over time?"
nearest_examples = get_matched_convo(query, df_test)
nearest_examples.reset_index(drop=True, inplace=True)

# Generate optimized prompt
optimized_prompt = get_optimized_prompt(query, nearest_examples)
print("Optimized Prompt:", optimized_prompt)

The make_llm_call_with_optimized_prompt function uses an optimized prompt and user query to make the LLM (Anthropic’s Claude Haiku 3.5) call to get the final response:

# Function to make the LLM call using the optimized prompt and user query
def make_llm_call_with_optimized_prompt(optimized_prompt, user_query):
    start_time = time.time()
    # Combine the optimized prompt and user query to form the input for the LLM
    final_prompt = f"{optimized_prompt}nnUser Query: {user_query}nResponse:"

    # Make the call to the LLM using the combined prompt
    response = llm.invoke(final_prompt)
    
    # Extract only the content from the LLM response
    final_response = response.content  # Extract the response content without adding any labels
    time_taken = time.time() - start_time
    return final_response,time_taken

# Example usage
user_query = "How to grow avocado indoor?"
# Assume 'optimized_prompt' has already been generated from the previous step
final_response,time_taken = make_llm_call_with_optimized_prompt(optimized_prompt, user_query)
print("LLM Response:", final_response)

Comparative evaluation of optimized and unoptimized prompts

To compare the optimized prompt with the baseline (in this case, the unoptimized prompt), we defined a function that returned a result without an optimized prompt for all the queries in the evaluation dataset:

def get_unoptimized_prompt_response(df_eval):
    # Iterate over the dataframe and make LLM calls
    for index, row in tqdm(df_eval.iterrows()):
        # Get the user query from 'conv_A_user'
        user_query = row['conv_A_user']
        
        # Make the Bedrock LLM call
        response = llm.invoke(user_query)
        
        # Store the response content in a new column 'unoptimized_prompt_response'
        df_eval.at[index, 'unoptimized_prompt_response'] = response.content  # Extract 'content' from the response object
    
    return df_eval

The following function generates the query response using similarity search and intermediate optimized prompt generation for all the queries in the evaluation dataset:

def get_optimized_prompt_response(df_eval):
    # Iterate over the dataframe and make LLM calls
    for index, row in tqdm(df_eval.iterrows()):
        # Get the user query from 'conv_A_user'
        user_query = row['conv_A_user']
        nearest_examples = get_matched_convo(user_query, df_test)
        nearest_examples.reset_index(drop=True, inplace=True)
        optimized_prompt = get_optimized_prompt(user_query, nearest_examples)
        # Make the Bedrock LLM call
        final_response,time_taken = make_llm_call_with_optimized_prompt(optimized_prompt, user_query)
        
        # Store the response content in a new column 'unoptimized_prompt_response'
        df_eval.at[index, 'optimized_prompt_response'] = final_response  # Extract 'content' from the response object
    
    return df_eval

This code compares responses generated with and without few-shot optimization, setting up the data for evaluation.

LLM as judge and evaluation of responses

To quantify response quality, we used an LLM as a judge to score the optimized and unoptimized responses for alignment with the user query. We used Pydantic here to make sure the output sticks to the desired pattern of 0 (LLM predicts the response won’t be liked by the user) or 1 (LLM predicts the response will be liked by the user):

# Define Pydantic model to enforce predicted feedback as 0 or 1
class FeedbackPrediction(BaseModel):
    predicted_feedback: conint(ge=0, le=1)  # Only allow values 0 or 1

# Function to generate few-shot prompt
def generate_few_shot_prompt(df_examples, unoptimized_response):
    few_shot_prompt = (
        "You are an impartial judge evaluating the quality of LLM responses. "
        "Based on the user queries and the LLM responses provided below, your task is to determine whether the response is good or bad, "
        "using the examples provided. Return 1 if the response is good (thumbs up) or 0 if the response is bad (thumbs down).nn"
    )
    few_shot_prompt += "Below are examples of user queries, LLM responses, and user feedback:nn"
    
    # Iterate over few-shot examples
    for i, row in df_examples.iterrows():
        few_shot_prompt += f"User Query: {row['conv_A_user']}n"
        few_shot_prompt += f"LLM Response: {row['conv_A_assistant']}n"
        few_shot_prompt += f"User Feedback: {'👍' if row['conv_A_rating'] == 1 else '👎'}nn"
    
    # Provide the unoptimized response for feedback prediction
    few_shot_prompt += (
        "Now, evaluate the following LLM response based on the examples above. Return 0 for bad response or 1 for good response.nn"
        f"User Query: {unoptimized_response}n"
        f"Predicted Feedback (0 for 👎, 1 for 👍):"
    )
    return few_shot_prompt

LLM-as-a-judge is a functionality where an LLM can judge the accuracy of a text using certain grounding examples. We have used that functionality here to judge the difference between the result received from optimized and un-optimized prompt. Amazon Bedrock launched an LLM-as-a-judge functionality in December 2024 that can be used for such use cases. In the following function, we demonstrate how the LLM acts as an evaluator, scoring responses based on their alignment and satisfaction for the full evaluation dataset:

# Function to predict feedback using few-shot examples
def predict_feedback(df_examples, df_to_rate, response_column, target_col):
    # Create a new column to store predicted feedback
    df_to_rate[target_col] = None
    
    # Iterate over each row in the dataframe to rate
    for index, row in tqdm(df_to_rate.iterrows(), total=len(df_to_rate)):
        # Get the unoptimized prompt response
        try:
            time.sleep(2)
            unoptimized_response = row[response_column]

            # Generate few-shot prompt
            few_shot_prompt = generate_few_shot_prompt(df_examples, unoptimized_response)

            # Call the LLM to predict the feedback
            response = llm.invoke(few_shot_prompt)

            # Extract the predicted feedback (assuming the model returns '0' or '1' as feedback)
            predicted_feedback_str = response.content.strip()  # Clean and extract the predicted feedback

            # Validate the feedback using Pydantic
            try:
                feedback_prediction = FeedbackPrediction(predicted_feedback=int(predicted_feedback_str))
                # Store the predicted feedback in the dataframe
                df_to_rate.at[index, target_col] = feedback_prediction.predicted_feedback
            except (ValueError, ValidationError):
                # In case of invalid data, assign default value (e.g., 0)
                df_to_rate.at[index, target_col] = 0
        except:
            pass

    return df_to_rate

In the following example, we repeated this process for 20 trials, capturing user satisfaction scores each time. The overall score for the dataset is the sum of the user satisfaction score.

df_eval = df.drop(df_test.index).sample(100)
df_eval['unoptimized_prompt_response'] = "" # Create an empty column to store responses
df_eval = get_unoptimized_prompt_response(df_eval)
df_eval['optimized_prompt_response'] = "" # Create an empty column to store responses
df_eval = get_optimized_prompt_response(df_eval)
Call the function to predict feedback
df_with_predictions = predict_feedback(df_eval, df_eval, 'unoptimized_prompt_response', 'predicted_unoptimized_feedback')
df_with_predictions = predict_feedback(df_with_predictions, df_with_predictions, 'optimized_prompt_response', 'predicted_optimized_feedback')

# Calculate accuracy for unoptimized and optimized responses
original_success = df_with_predictions.conv_A_rating.sum()*100.0/len(df_with_predictions)
unoptimized_success  = df_with_predictions.predicted_unoptimized_feedback.sum()*100.0/len(df_with_predictions) 
optimized_success = df_with_predictions.predicted_optimized_feedback.sum()*100.0/len(df_with_predictions) 

# Display results
print(f"Original success: {original_success:.2f}%")
print(f"Unoptimized Prompt success: {unoptimized_success:.2f}%")
print(f"Optimized Prompt success: {optimized_success:.2f}%")

Result analysis

The following line chart shows the performance improvement of the optimized solution over the unoptimized one. Green areas indicate positive improvements, whereas red areas show negative changes.

As we gathered the result of 20 trials, we saw that the mean of satisfaction scores from the unoptimized prompt was 0.8696, whereas the mean of satisfaction scores from the optimized prompt was 0.9063. Therefore, our method outperforms the baseline by 3.67%.

Finally, we ran a paired sample t-test to compare satisfaction scores from the optimized and unoptimized prompts. This statistical test validated whether prompt optimization significantly improved response quality. See the following code:

from scipy import stats
# Sample user satisfaction scores from the notebook
unopt = [] #20 samples of scores for the unoptimized promt
opt = [] # 20 samples of scores for the optimized promt]
# Paired sample t-test
t_stat, p_val = stats.ttest_rel(unopt, opt)
print(f"t-statistic: {t_stat}, p-value: {p_val}")

After running the t-test, we got a p-value of 0.000762, which is less than 0.05. Therefore, the performance boost of optimized prompts over unoptimized prompts is statistically significant.

Key takeaways

We learned the following key takeaways from this solution:

Few-shot prompting improves query response – Using highly similar few-shot examples leads to significant improvements in response quality.
Amazon Titan Text Embeddings enables contextual similarity – The model produces embeddings that facilitate effective similarity searches.
Statistical validation confirms effectiveness – A p-value of 0.000762 indicates that our optimized approach meaningfully enhances user satisfaction.
Improved business impact – This approach delivers measurable business value through improved AI assistant performance. The 3.67% increase in satisfaction scores translates to tangible outcomes: HR departments can expect fewer policy misinterpretations (reducing compliance risks), and customer service teams might see a significant reduction in escalated tickets. The solution’s ability to continuously learn from feedback creates a self-improving system that increases ROI over time without requiring specialized ML expertise or infrastructure investments.

Limitations

Although the system shows promise, its performance heavily depends on the availability and volume of user feedback, especially in closed-domain applications. In scenarios where only a handful of feedback examples are available, the model might struggle to generate meaningful optimizations or fail to capture the nuances of user preferences effectively. Additionally, the current implementation assumes that user feedback is reliable and representative of broader user needs, which might not always be the case.

Next steps

Future work could focus on expanding this system to support multilingual queries and responses, enabling broader applicability across diverse user bases. Incorporating Retrieval Augmented Generation (RAG) techniques could further enhance context handling and accuracy for complex queries. Additionally, exploring ways to address the limitations in low-feedback scenarios, such as synthetic feedback generation or transfer learning, could make the approach more robust and versatile.

Conclusion

In this post, we demonstrated the effectiveness of query optimization using Amazon Bedrock, few-shot prompting, and user feedback to significantly enhance response quality. By aligning responses with user-specific preferences, this approach alleviates the need for expensive model fine-tuning, making it practical for real-world applications. Its flexibility makes it suitable for chat-based assistants across various domains, such as ecommerce, customer service, and hospitality, where high-quality, user-aligned responses are essential.

To learn more, refer to the following resources:

About the Authors

Tanay Chowdhury is a Data Scientist at the Generative AI Innovation Center at Amazon Web Services.

Parth Patwa is a Data Scientist at the Generative AI Innovation Center at Amazon Web Services.

Yingwei Yu is an Applied Science Manager at the Generative AI Innovation Center at Amazon Web Services.

Boosting team productivity with Amazon Q Business Microsoft 365 integrations for Microsoft 365 Outlook and Word

May 22, 2025

by Leo Mentis Raj Selvaraj Amazon AWS

Amazon Q Business, with its enterprise grade security, seamless integration with multiple diverse data sources, and sophisticated natural language understanding, represents the next generation of AI business assistants. What sets Amazon Q Business apart is its support of enterprise requirements from its ability to integrate with company documentation to its adaptability with specific business terminology and context-aware responses. Combined with comprehensive customization options, Amazon Q Business is transforming how organizations enhance their document processing and business operations.

Amazon Q Business integration with Microsoft 365 applications offers powerful AI assistance directly within the tools that your team already uses daily.

In this post, we explore how these integrations for Outlook and Word can transform your workflow.

Prerequisites

Before you get started, make sure that you have the following prerequisites in place:

Create an Amazon Q Business application. Configuring an Amazon Q Business application using AWS IAM Identity Center.
Access to the Microsoft Entra admin center.
Microsoft Entra tenant ID (this should be treated as sensitive information). How to find your Microsoft Entra tenant ID.

Set up Amazon Q Business M365 integrations

Follow the steps below to setup Microsoft 365 integrations with Amazon Q Business.

Go to the AWS Management Console for Amazon Q Business and choose Enhancements then Integrations. On the Integrations page, choose Add integrations.
Select Outlook or Word. In this example, we selected Outlook.
Under Integration name, enter a name for the integration. Under Workspace, enter your Microsoft Entra tenant ID. Leave the remaining values as the default, and choose Add Integration.
After the integration is successfully deployed, select the integration name and copy the manifest URL to use in a later step.
Go to the Microsoft admin center. Under Settings choose Integrated apps, and choose Upload custom apps.
Choose App type and then select Office Add-in. Enter the manifest URL from the Amazon Q Business console, in Provide link to the manifest file. Choose Validate.
On the User page, add users, choose Accept permissions and choose Finish Deployment.

Amazon Q Business in Outlook: Email efficiency reimagined

By integrating Amazon Q Business with Outlook, you have access to several tools to improve email efficiency. To access these tools in Outlook, select Amazon Q Business icon in Outlook on top right side of the email section. Amazon Q Business can help you summarize an email thread, extract insights and action items, and suggest follow-ups.

Email summarization: Quickly understand the key points of lengthy email threads by choosing Summarize in the Amazon Q Business sidebar.
Draft responses: Generate contextually appropriate email replies based on the conversation history and insert them directly into your email draft from the Amazon Q Business sidebar.

Received email:
Hi team,
I wanted to share the key points from today’s quarterly strategy meeting with John Doe and the leadership team.

Key Takeaways:

Q4 2024 targets were exceeded by 12%, setting a strong foundation for 2025
New product launch timeline confirmed for July 2025
Need to accelerate hiring for the technical team (6 positions to fill by end of Q2)

Action Items:
John Smith will finalize the budget allocation for Q2 by March 5
Marketing team to present updated campaign metrics next week
HR to fast track technical recruitment process
Sales team to provide updated pipeline report by Friday

Project Updates:
Project Phoenix is on track for May deployment
Customer feedback program launching next month
International expansion plans under review

Next Steps:
Follow-up meeting scheduled for March 12 at 2 PM EST
Department heads to submit Q2 objectives by March 1
John to distribute updated organizational chart next week
Please let me know if I missed anything or if you have any questions.

Best regards,
Jane Doe

Amazon Q Business draft reply:
You will see a draft reply in the Amazon Q Business sidebar. Choose the highlighted icon at the bottom of the sidebar to create an email using the draft reply.

Hi Jane,
Thank you for sharing the meeting notes from yesterday’s Q1 Strategy Review. The summary is very helpful.
I noticed the impressive Q4 results and the confirmed July product launch timeline. The hiring acceleration for the technical team seems to be a priority we should focus on.
I’ll make note of all the action items, particularly the March 1 deadline for Q2 objectives submission. I’ll also block my calendar for the follow up meeting on March 12 at 2 PM EST.
Is there anything specific you’d like me to help with regarding any of these items? I’m particularly interested in the Project Phoenix deployment and the customer feedback program.

Thanks again for the comprehensive summary.
Regards

Meeting preparation: Extract action items by choosing Action items and next steps in the Amazon Q Business sidebar. Also find important details from email conversations by asking questions in the Amazon Q Business sidebar chat box.

Amazon Q Business in Word: Content creation accelerated

You can select Amazon Q Business on the top right corner of the word document access Amazon Q Business. You can access the Amazon Q Business document processing features from the Word context menu when you highlight text. You can also access Amazon Q Business in the sidebar when working in a Word document. When you select a document processing feature, the output will appear in the Amazon Q Business sidebar, as shown in the following figure.

You can use Amazon Q Business in Word to summarize, explain, simplify, or fix the content of a Word document.

Summarize: Document summarization is a powerful capability of Amazon Q Business that you can use to quickly extract key information from lengthy documents. This feature uses natural language processing to identify the most important concepts, facts, and insights within text documents, then generates concise summaries that preserve the essential meaning while significantly reducing reading time. You can customize the summary length and focus areas based on your specific needs, making it straightforward to process large volumes of information efficiently. Document summarization helps professionals across industries quickly grasp the core content of reports, research papers, articles, and other text-heavy materials without sacrificing comprehension of critical details. To summarize a document, select Amazon Q Business from the ribbon, choose Summarize from the Amazon Q Business sidebar and enter a prompt describing what type of summary you want.

Quickly understand the key points of lengthy word document by choosing Summarize in the Amazon Q Business sidebar

Simplify: The Amazon Q Business Word add-in analyzes documents in real time, identifying overly complex sentences, jargon, and verbose passages that might confuse readers. You can have Amazon Q Business rewrite selected text or entire documents to improve readability while maintaining the original meaning. The Simplify feature is particularly valuable for professionals who need to communicate technical information to broader audiences, educators creating accessible learning materials, or anyone looking to enhance the clarity of their written communication without spending hours manually editing their work.

Select the passage in the work document and choose Simplify in the Amazon Q Business sidebar.

Explain: You can use Amazon Q Business to help you better understand complex content within their documents. You can select difficult terms, technical concepts, or confusing passages and receive clear, contextual explanations. Amazon Q Business analyzes the selected text and generates comprehensive explanations tailored to your needs, including definitions, simplified descriptions, and relevant examples. This functionality is especially beneficial for professionals working with specialized terminology, students navigating academic papers, or anyone encountering unfamiliar concepts in their reading. The Explain feature transforms the document experience from passive consumption to interactive learning, making complex information more accessible to users.

Select the passage in the work document and choose Explain in the Amazon Q Business sidebar.

Fix: Amazon Q Business scans the selected passages for grammatical errors, spelling mistakes, punctuation problems, inconsistent formatting, and stylistic improvements, and resolves those issues. This functionality is invaluable for professionals preparing important business documents, students finalizing academic papers, or anyone seeking to produce polished, accurate content without the need for extensive manual proofreading. This feature significantly reduces editing time while improving document quality.

Select the passage in the work document and choose Fix in the Amazon Q Business sidebar.

Measuring impact

Amazon Q Business helps measure the effectiveness of the solution by empowering users to provide feedback. Feedback information is stored in Amazon Cloudwatch logs where admins can review it to identify issues and improvements.

Clean up

When you are done testing Amazon Q Business integrations, you can remove them through the Amazon Q Business console.

In the console, choose Applications and select your application ID.
Select Integrations.
In the Integrations page, select the integration that you created.
Choose Delete.

Conclusion

Amazon Q Business integrations with Microsoft 365 applications represents a significant opportunity to enhance your team’s productivity. By bringing AI assistance directly into the tools where work happens, teams can focus on higher-value activities while maintaining quality and consistency in their communications and documents.

Experience the power of Amazon Q Business, by exploring its seamless integration with your everyday business tools. Start enhancing your productivity today by visiting the Amazon Q Business User Guide to understand the full potential of this AI-powered solution. Transform your email communication with our Microsoft Outlook integration and revolutionize your document creation process with our Microsoft Word features. To discover the available integrations that can streamline your workflow, see Integrations.

About the author

Leo Mentis Raj Selvaraj is a Sr. Specialist Solutions Architect – GenAI at AWS with 4.5 years of experience, currently guiding customers through their GenAI implementation journeys. Previously, he architected data platform and analytics solutions for strategic customers using a comprehensive range of AWS services including storage, compute, databases, serverless, analytics, and ML technologies. Leo also collaborates with internal AWS teams to drive product feature development based on customer feedback, contributing to the evolution of AWS offerings.

Integrate Amazon Bedrock Agents with Slack

May 21, 2025

by Salman Ahmed Amazon AWS

As companies increasingly adopt generative AI applications, AI agents capable of delivering tangible business value have emerged as a crucial component. In this context, integrating custom-built AI agents within chat services such as Slack can be transformative, providing businesses with seamless access to AI assistants powered by sophisticated foundation models (FMs). After an AI agent is developed, the next challenge lies in incorporating it in a way that provides straightforward and efficient use. Organizations have several options: integration into existing web applications, development of custom frontend interfaces, or integration with communication services such as Slack. The third option—integrating custom AI agents with Slack—offers a simpler and quicker implementation path you can follow to summon the AI agent on-demand within your familiar work environment.

This solution drives team productivity through faster query responses and automated task handling, while minimizing operational overhead. The pay-per-use model optimizes cost as your usage scales, making it particularly attractive for organizations starting their AI journey or expanding their existing capabilities.

There are numerous practical business use cases for AI agents, each offering measurable benefits and significant time savings compared to traditional approaches. Examples include a knowledge base agent that instantly surfaces company documentation, reducing search time from minutes to seconds. A compliance checker agent that facilitates policy adherence in real time, potentially saving hours of manual review. Sales analytics agents provide immediate insights, alleviating the need for time consuming data compilation and analysis. AI agents for IT support help with common technical issues, often resolving problems faster than human agents.

These AI-powered solutions enhance user experience through contextual conversations, providing relevant assistance based on the current conversation and query context. This natural interaction model improves the quality of support and helps drive user adoption across the organization. You can follow this implementation approach to provide the solution to your Slack users in use cases where quick access to AI-powered insights would benefit team workflows. By integrating custom AI agents, organizations can track improvements in key performance indicators (KPIs) such as mean time to resolution (MTTR), first-call resolution rates, and overall productivity gains, demonstrating the practical benefits of AI agents powered by large language models (LLMs).

In this post, we present a solution to incorporate Amazon Bedrock Agents in your Slack workspace. We guide you through configuring a Slack workspace, deploying integration components in Amazon Web Services (AWS), and using this solution.

Solution overview

The solution consists of two main components: the Slack to Amazon Bedrock Agents integration infrastructure and either your existing Amazon Bedrock agent or a sample agent we provide for testing. The integration infrastructure handles the communication between Slack and the Amazon Bedrock agent, and the agent processes and responds to the queries.

The solution uses Amazon API Gateway, AWS Lambda, AWS Secrets Manager, and Amazon Simple Queue Service (Amazon SQS) for a serverless integration. This alleviates the need for always-on infrastructure, helping to reduce overall costs because you only pay for actual usage.

Amazon Bedrock agents automate workflows and repetitive tasks while securely connecting to your organization’s data sources to provide accurate responses.

An action group defines actions that the agent can help the user perform. This way, you can integrate business logic with your backend services by having your agent process and manage incoming requests. The agent also maintains context throughout conversations, uses the process of chain of thought, and enables more personalized interactions.

The following diagram represents the solution architecture, which contains two key sections:

Section A – The Amazon Bedrock agent and its components are included in this section. With this part of the solution, you can either connect your existing agent or deploy our sample agent using the provided AWS CloudFormation template
Section B – This section contains the integration infrastructure (API Gateway, Secrets Manager, Lambda, and Amazon SQS) that’s deployed by a CloudFormation template.

The request flow consists of the following steps:

A user sends a message in Slack to the bot by using @appname.
Slack sends a webhook POST request to the API Gateway endpoint.
The request is forwarded to the verification Lambda function.
The Lambda function retrieves the Slack signing secret and bot token to verify request authenticity.
After verification, the message is sent to a second Lambda function.
Before putting the message in the SQS queue, the Amazon SQS integration Lambda function sends a “ Processing your request…” message to the user in Slack within a thread under the original message.
Messages are sent to the FIFO (First-In-First-Out) queue for processing, using the channel and thread ID to help prevent message duplication.
The SQS queue triggers the Amazon Bedrock integration Lambda function.
The Lambda function invokes the Amazon Bedrock agent with the user’s query, and the agent processes the request and responds with the answer.
The Lambda function updates the initial “ Processing your request…” message in the Slack thread with either the final agent’s response or, if debug mode is enabled, the agent’s reasoning process.

Prerequisites

You must have the following in place to complete the solution in this post:

An AWS account
A Slack account (two options):
- For company Slack accounts, work with your administrator to create and publish the integration application, or you can use a sandbox organization
- Alternatively, create your own Slack account and workspace for testing and experimentation
Model access in Amazon Bedrock for Anthropic’s Claude 3.5 Sonnet in the same AWS Region where you’ll deploy this solution (if using your own agent, you can skip this requirement)
The accompanying CloudFormation templates provided in GitHub repo:
- Sample Amazon Bedrock agent (virtual-meteorologist)
- Slack integration to Amazon Bedrock Agents

Create a Slack application in your workspace

Creating applications in Slack requires specific permissions that vary by organization. If you don’t have the necessary access, you’ll need to contact your Slack administrator. The screenshots in this walkthrough are from a personal Slack account and are intended to demonstrate the implementation process that can be followed for this solution.

Go to Slack API and choose Create New App

In the Create an app pop-up, choose From scratch

For App Name, enter virtual-meteorologist
For Pick a workspace to develop your app in, choose the workspace where you want to use this application
Choose Create App

After the application is created, you’ll be taken to the Basic Information page.

In the navigation pane under Features, choose OAuth & Permissions
Navigate to the Scopes section and under Bot Tokens Scopes, add the following scopes by choosing Add an OAuth Scope and entering im:read, im:write, and chat:write

On the OAuth & Permissions page, navigate to the OAuth Tokens section and choose Install to {Workspace}
On the following page, choose Allow to complete the process

On the OAuth & Permissions page, navigate to OAuth Tokens and copy the value for Bot User OAuth Token that has been created. Save this in a notepad to use later when you’re deploying the CloudFormation template.

In the navigation pane under Settings, choose Basic Information
Navigate to Signing Secret and choose Show
Copy and save this value to your notepad to use later when you’re deploying the CloudFormation template

Deploy the sample Amazon Bedrock agent resources with AWS CloudFormation

If you already have an Amazon Bedrock agent configured, you can copy its ID and alias from the agent details. If you don’t, then when you run the CloudFormation template for the sample Amazon Bedrock agent (virtual-meteorologist), the following resources are deployed (costs will be incurred for the AWS resources used):

Lambda functions:
- GeoCoordinates – Converts location names to latitude and longitude coordinates
- Weather – Retrieves weather information using coordinates
- DateTime – Gets current date and time for specific time zones
AWS Identity and Access Management IAM roles:
- GeoCoordinatesRole – Role for GeoCoordinates Lambda function
- WeatherRole – Role for Weather Lambda function
- DateTimeRole – Role for DateTime Lambda function
- BedrockAgentExecutionRole – Role for Amazon Bedrock agent execution
Lambda permissions:
- GeoCoordinatesLambdaPermission – Allows Amazon Bedrock to invoke the GeoCoordinates Lambda function
- WeatherLambdaPermission – Allows Amazon Bedrock to invoke the Weather Lambda function
- DateTimeLambdaPermission – Allows Amazon Bedrock to invoke the DateTime Lambda function
Amazon Bedrock agent:
- BedrockAgent – Virtual meteorologist agent configured with three action groups
Amazon Bedrock agent action groups:
- obtain-latitude-longitude-from-place-name
- obtain-weather-information-with-coordinates
- get-current-date-time-from-timezone

Choose Launch Stack to deploy the resources:

After deployment is complete, navigate to the Outputs tab and copy the BedrockAgentId and BedrockAgentAliasID values. Save these to a notepad to use later when deploying the Slack integration to Amazon Bedrock Agents CloudFormation template.

Deploy the Slack integration to Amazon Bedrock Agents resources with AWS CloudFormation

When you run the CloudFormation template to integrate Slack with Amazon Bedrock Agents, the following resources are deployed (costs will be incurred for the AWS resources used):

API Gateway:
- SlackAPI – A REST API for Slack interactions
Lambda functions:
- MessageVerificationFunction – Verifies Slack message signatures and tokens
- SQSIntegrationFunction – Handles message queueing to Amazon SQS
- BedrockAgentsIntegrationFunction – Processes messages with the Amazon Bedrock agent
IAM roles:
- MessageVerificationFunctionRole – Role for MessageVerificationFunction Lambda function permissions
- SQSIntegrationFunctionRole – Role for SQSIntegrationFunction Lambda function permissions
- BedrockAgentsIntegrationFunctionRole – Role for BedrockAgentsIntegrationFunction Lambda function permissions
SQS queues:
- ProcessingQueue – FIFO queue for ordered message processing
- DeadLetterQueue – FIFO queue for failed message handling
Secrets Manager secret:
- SlackBotTokenSecret – Stores Slack credentials securely

Choose Launch Stack to deploy these resources:

Provide your preferred stack name. When deploying the CloudFormation template, you’ll need to provide four values: the Slack bot user OAuth token, the signing secret from your Slack configuration, and the BedrockAgentId and BedrockAgentAliasID values saved earlier. If your agent is in draft version, use TSTALIASID as the BedrockAgentAliasID. Although our example uses a draft version, you can use the alias ID of your published version if you’ve already published your agent.

Keep SendAgentRationaleToSlack set to False by default. However, if you want to troubleshoot or observe how Amazon Bedrock Agents processes your questions, you can set this to True. This way, you can receive detailed processing information in the Slack thread where you invoked the Slack application.

When deployment is complete, navigate to the Outputs tab and copy the WebhookURL value. Save this to your notepad to use in your Slack configuration in the next step.

Integrate Amazon Bedrock Agents with your Slack workspace

Complete the following steps to integrate Amazon Bedrock Agents with your Slack workspace:

Go to Slack API and choose the virtual-meteorologist application

In the navigation pane, choose Event Subscriptions
On the Event Subscriptions page, turn on Enable Events
Enter your previously copied API Gateway URL for Request URL—verification will happen automatically
For Subscribe to bot events, select Add Bot User Event button and add app_mention and message.im
Choose Save Changes
Choose Reinstall your app and choose Allow on the following page

Test the Amazon Bedrock Agents bot application in Slack

Return to Slack and locate virtual-meteorologist in the Apps section. After you add this application to your channel, you can interact with the Amazon Bedrock agent by using @virtual-meteorologist to get weather information.

Let’s test it with some questions. When we ask about today’s weather in Chicago, the application first sends a “ Processing your request…” message as an initial response. After the Amazon Bedrock agent completes its analysis, this temporary message is replaced with the actual weather information.

You can ask follow-up questions within the same thread, and the Amazon Bedrock agent will maintain the context from your previous conversation. To start a new conversation, use @virtual-meteorologist in the main channel instead of the thread.

Clean up

If you decide to stop using this solution, complete the following steps to remove it and its associated resources deployed using AWS CloudFormation:

Delete the Slack integration CloudFormation stack:
- On the AWS CloudFormation console, choose Stacks in the navigation pane
- Locate the stack you created for the Slack integration for Amazon Bedrock Agents during the deployment process (you assigned a name to it)
- Select the stack and choose Delete
If you deployed the sample Amazon Bedrock agent (virtual-meteorologist), repeat these steps to delete the agent stack

Considerations

When designing serverless architectures, separating Lambda functions by purpose offers significant advantages in terms of maintenance and flexibility. This design pattern allows for straightforward behavior modifications and customizations without impacting the overall system logic. Each request involves two Lambda functions: one for token validation and another for SQS payload processing. During high-traffic periods, managing concurrent executions across both functions requires attention to Lambda concurrency limits. For use cases where scaling is a critical concern, combining these functions into a single Lambda function might be an alternative approach, or you could consider using services such as Amazon EventBridge to help manage the event flow between components. Consider your use case and traffic patterns when choosing between these architectural approaches.

Summary

This post demonstrated how to integrate Amazon Bedrock Agents with Slack, a widely used enterprise collaboration tool. After creating your specialized Amazon Bedrock Agents, this implementation pattern shows how to quickly integrate them into Slack, making them readily accessible to your users. The integration enables AI-powered solutions that enhance user experience through contextual conversations within Slack, improving the quality of support and driving user adoption. You can follow this implementation approach to provide the solution to your Slack users in use cases where quick access to AI-powered insights would benefit team workflows. By integrating custom AI agents, organizations can track improvements in KPIs such as mean time to resolution (MTTR), first-call resolution rates, and overall productivity gains, showcasing the practical benefits of Amazon Bedrock Agents in enterprise collaboration settings.

We provided a sample agent to help you test and deploy the complete solution. Organizations can now quickly implement their Amazon Bedrock agents and integrate them into Slack, allowing teams to access powerful generative AI capabilities through a familiar interface they use daily. Get started today by developing your own agent using Amazon Bedrock Agents.

Additional resources

To learn more about building Amazon Bedrock Agents, refer to the following resources:

About the Authors

Salman Ahmed is a Senior Technical Account Manager in AWS Enterprise Support. He specializes in guiding customers through the design, implementation, and support of AWS solutions. Combining his networking expertise with a drive to explore new technologies, he helps organizations successfully navigate their cloud journey. Outside of work, he enjoys photography, traveling, and watching his favorite sports teams.

Sergio Barraza is a Senior Technical Account Manager at AWS, helping customers on designing and optimizing cloud solutions. With more than 25 years in software development, he guides customers through AWS services adoption. Outside work, Sergio is a multi-instrument musician playing guitar, piano, and drums, and he also practices Wing Chun Kung Fu.

Ravi Kumar is a Senior Technical Account Manager in AWS Enterprise Support who helps customers in the travel and hospitality industry to streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience. In his free time, Ravi enjoys creative activities like painting. He also likes playing cricket and traveling to new places.

Ankush Goyal is a Enterprise Support Lead in AWS Enterprise Support who helps customers streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience.

A first-of-its-kind experiment to measure the impact of out-of-home advertising

May 21, 2025

by Egor Abramov Amazon AWS

A first-of-its-kind experiment to measure the impact of out-of-home advertising

By combining surveys with ads targeted to metro and commuter rail lines, Amazon researchers identify the fraction of residents of different neighborhoods exposed to the ads and measure ad effectiveness.

Economics

Egor Abramov

May 21, 02:47 PMMay 21, 02:47 PM

Amazon promotes its products and services through many advertising channels, and we naturally want to know how much bang we get for our buck. We thus try to measure the effectiveness of our ads through a combination of data modeling and, most importantly, experimentation.

In online environments, where the majority of ads live today, researchers can randomize ads and make use of tracking technologies like cookies to measure behavior. This approach provides high-quality data and statistically significant results. In contrast, more-conventional ads found outdoors, in malls and airports, and on transport, referred to as out-of-home (OOH) ads, dont offer the same opportunity for personalized behavior measurements and therefore havent lent themselves well to experiments.

To close this gap, our team at Amazon has developed and successfully implemented an experimental design that focuses on metro and commuter lines. Based on the knowledge of Amazon marketing teams and marketing agencies engaged by Amazon including several that specialize in OOH advertising this is the first experiment of its kind.

First, we randomized ads across commuter lines in a particular city (not So Paulo: the figure below is a hypothetical example). Second, we sent out an e-mail survey to residents, asking which lines they regularly use. From the survey, we calculated the fraction of residents who had the opportunity to see ads in various neighborhoods. We then used this data to directly measure the effectiveness of an ad campaign.

A hypothetical experiment in So Paulo, Brazil. The left panel shows which metro and rail lines have ads. The right panel shows hypothetical percentages of residents who take the lines with ads in each of So Paulos prefectures.

This radically simple approach produces strikingly clear results: over the course of a four-week ad campaign, the high-ad-exposure neighborhoods saw a measurable increase in sales versus neighborhoods with low exposure.

Sales in neighborhoods with high ad exposure vs. sales in neighborhoods with low exposure, before, during, and after an ad campaign. Sales in both neighborhoods are pegged to 100 the week before the campaign starts.

Our experimental design borrows concepts from a technique called geoexperimentation, an approach commonly used for online ads when individual user tracking or personalized ads arent available. Researchers deploying geoexperimentation take a given country and isolate geographical units states, regions, or metropolitan areas in which they randomly apply an ad campaign or not.

Geographic units are typically selected to be large enough that advertisements within a given unit are seen almost exclusively by people living there; that way, the campaigns impact is concentrated within the unit. To estimate the campaigns impact, researchers then compare sales numbers from different geographical units over time, with and without ads. Geoexperiments are effective because within any given country, there is usually a large enough number of relevant geographical units for the experiment to have sufficient statistical power.

There was a working assumption that OOH advertising wouldnt be a good candidate for a real-world implementation of geo experimentation, given how geographically concentrated OOH ad campaigns are. The typical geoexperiment requires dozens of geographic units to ensure sufficient statistical power, but OOH ads are generally deployed in only a handful of bigger cities or, in some cases, a single city within a country.

Using more-granular geographic units for example, city neighborhoods or districts is not an option either, as thered be no way to isolate or identify users who might see the ads. This, for instance, is the challenge posed by a billboard in one district that residents of other districts may drive past.

We realized, however, that these challenges can be met if you choose your OOH ad environments wisely. Targeting ads to metro and commuter rail lines lets us isolate and identify the users who have the opportunity to see them. Our survey asks which metro and commuter rail lines respondents use for commute and occasional travel, how often they so use them, and which part of the city they live in; this gives the fraction of residents exposed to our ads in each neighborhood. At the same time, the random ad placement creates random variation in ad exposure across the citys neighborhoods, giving us the number of geographic units we need for statistical power.

The idea, then, is to compare changes in sales in the treated and control regions, adjusted for historical sales trends. Many techniques used for analyzing geoexperiments apply here, too, including synthetic-control and difference-in-difference methods.

With this first-ever experimental design for OOH advertising, we believe weve opened the door for causal measurement of OOH ad effectiveness. To be sure, this implementation is specific to metro and commuter rail ads, which limits it to cities that support that kind of public transportation. However, we believe lessons learned from such experiments can be extrapolated to other types of OOH ad environments and can be combined with other, non-experiment-based measures of ad performance such as awareness surveys or multimedia models.

Research areas: Economics

Tags: Experimental design

Secure distributed logging in scalable multi-account deployments using Amazon Bedrock and LangChain

May 21, 2025

by Mohammad Tahsin Amazon AWS

Data privacy is a critical issue for software companies that provide services in the data management space. If they want customers to trust them with their data, software companies need to show and prove that their customers’ data will remain confidential and within controlled environments. Some companies go to great lengths to maintain confidentiality, sometimes adopting multi-account architectures, where each customer has their data in a separate AWS account. By isolating data at the account level, software companies can enforce strict security boundaries, help prevent cross-customer data leaks, and support adherence with industry regulations such as HIPAA or GDPR with minimal risk.

Multi-account deployment represents the gold standard for cloud data privacy, allowing software companies to make sure customer data remains segregated even at massive scale, with AWS accounts providing security isolation boundaries as highlighted in the AWS Well-Architected Framework. Software companies increasingly adopt generative AI capabilities like Amazon Bedrock, which provides fully managed foundation models with comprehensive security features. However, managing a multi-account deployment powered by Amazon Bedrock introduces unique challenges around access control, quota management, and operational visibility that could complicate its implementation at scale. Constantly requesting and monitoring quota for invoking foundation models on Amazon Bedrock becomes a challenge when the number of AWS accounts reaches double digits. One approach to simplify operations is to configure a dedicated operations account to centralize management while data from customers transits through managed services and is stored at rest only in their respective customer accounts. By centralizing operations in a single account while keeping data in different accounts, software companies can simplify the management of model access and quotas while maintaining strict data boundaries and security isolation.

In this post, we present a solution for securing distributed logging multi-account deployments using Amazon Bedrock and LangChain.

Challenges in logging with Amazon Bedrock

Observability is crucial for effective AI implementations—organizations can’t optimize what they don’t measure. Observability can help with performance optimization, cost management, and model quality assurance. Amazon Bedrock offers built-in invocation logging to Amazon CloudWatch or Amazon Simple Storage Service (Amazon S3) through a configuration on the AWS Management Console, and individual logs can be routed to different CloudWatch accounts with cross-account sharing, as illustrated in the following diagram.

Routing logs to each customer account presents two challenges: logs containing customer data would be stored in the operations account for the user-defined retention period (at least 1 day), which might not comply with strict privacy requirements, and CloudWatch has a limit of five monitoring accounts (customer accounts). With these limitations, how can organizations build a secure logging solution that scales across multiple tenants and customers?

In this post, we present a solution for enabling distributed logging for Amazon Bedrock in multi-account deployments. The objective of this design is to provide robust AI observability while maintaining strict privacy boundaries for data at rest by keeping logs exclusively within the customer accounts. This is achieved by moving logging to the customer accounts rather than invoking it from the operations account. By configuring the logging instructions in each customer’s account, software companies can centralize AI operations while enforcing data privacy, by keeping customer data and logs within strict data boundaries in each customer’s account. This architecture uses AWS Security Token Service (AWS STS) to allow customer accounts to assume dedicated roles in AWS Identity and Access Management (IAM) in the operations account while invoking Amazon Bedrock. For logging, this solution uses LangChain callbacks to capture invocation metadata directly in each customer’s account, making the entire process in the operations account memoryless. Callbacks can be used to log token usage, performance metrics, and the overall quality of the model in response to customer queries. The proposed solution balances centralized AI service management with strong data privacy, making sure customer interactions remain within their dedicated environments.

Solution overview

The complete flow of model invocations on Amazon Bedrock is illustrated in the following figure. The operations account is the account where the Amazon Bedrock permissions will be managed using an identity-based policy, where the Amazon Bedrock client will be created, and where the IAM role with the correct permissions will exist. Every customer account will assume a different IAM role in the operations account. The customer accounts are where customers will access the software or application. This account will contain an IAM role that will assume the corresponding role in the operations account, to allow Amazon Bedrock invocations. It is important to note that it is not necessary for these two accounts to exist in the same AWS organization. In this solution, we use an AWS Lambda function to invoke models from Amazon Bedrock, and use LangChain callbacks to write invocation data to CloudWatch. Without loss of generality, the same principle can be applied to other forms of compute such as servers in Amazon Elastic Compute Cloud (Amazon EC2) instances or managed containers on Amazon Elastic Container Service (Amazon ECS).

The sequence of steps in a model invocation are:

The process begins when the IAM role in the customer account assumes the role in the operations account, allowing it to access the Amazon Bedrock service. This is accomplished through the AWS STS AssumeRole API operation, which establishes the necessary cross-account relationship.
The operations account verifies that the requesting principal (IAM role) from the customer account is authorized to assume the role it is targeting. This verification is based on the trust policy attached to the IAM role in the operations account. This step makes sure that only authorized customer accounts and roles can access the centralized Amazon Bedrock resources.
After trust relationship verification, temporary credentials (access key ID, secret access key, and session token) with specified permissions are returned to the customer account’s IAM execution role.
The Lambda function in the customer account invokes the Amazon Bedrock client in the operations account. Using temporary credentials, the customer account’s IAM role sends prompts to Amazon Bedrock through the operations account, consuming the operations account’s model quota.
After the Amazon Bedrock client response returns to the customer account, LangChain callbacks log the response metrics directly into CloudWatch in the customer account.

Enabling cross-account access with IAM roles

The key idea in this solution is that there will be an IAM role per customer in the operations account. The software company will manage this role and assign permissions to define aspects such as which models can be invoked, in which AWS Regions, and what quotas they’re subject to. This centralized approach significantly simplifies the management of model access and permissions, especially when scaling to hundreds or thousands of customers. For enterprise customers with multiple AWS accounts, this pattern is particularly valuable because it allows the software company to configure a single role that can be assumed by a number of the customer’s accounts, providing consistent access policies and simplifying both permission management and cost tracking. Through carefully crafted trust relationships, the operations account maintains control over who can access what, while still enabling the flexibility needed in complex multi-account environments.

The IAM role can have assigned one or more policies. For example, the following policy allows a certain customer to invoke some models:

{
    "Version": "2012-10-17",
    "Statement": {
        "Sid": "AllowInference",
        "Effect": "Allow",
        "Action": [
            "bedrock:Converse",
            "bedrock:ConverseStream",
            "bedrock:GetAsyncInvoke",
            "bedrock:InvokeModel",
            "bedrock:InvokeModelWithResponseStream",
            "bedrock:StartAsyncInvoke"
        ],
        "Resource": "arn:aws:bedrock:*::foundation-model/<model-id>"
    }
}

The control would be implemented at the trust relationship level, where we would only allow some accounts to assume that role. For example, in the following script, the trust relationship allows the role for customer 1 to only be assumed by the allowed AWS account when the ExternalId matches a specified value, with the purpose of preventing the confused deputy problem:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AmazonBedrockModelInvocationCustomer1",
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
                },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "<account-customer-1>",
                    "sts:ExternalId": "<external-id>"
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:bedrock::<account-customer-1>:*"
                }
            }
        }
    ]
}

AWS STS AssumeRole operations constitute the cornerstone of secure cross-account access within multi-tenant AWS environments. By implementing this authentication mechanism, organizations establish a robust security framework that enables controlled interactions between the operations account and individual customer accounts. The operations team grants precisely scoped access to resources across the customer accounts, with permissions strictly governed by the assumed role’s trust policy and attached IAM permissions. This granular control makes sure that the operational team and customers can perform only authorized actions on specific resources, maintaining strong security boundaries between tenants.

As organizations scale their multi-tenant architectures to encompass thousands of accounts, the performance characteristics and reliability of these cross-account authentication operations become increasingly critical considerations. Engineering teams must carefully design their cross-account access patterns to optimize for both security and operational efficiency, making sure that authentication processes remain responsive and dependable even as the environment grows in complexity and scale.

When considering the service quotas that govern these operations, it’s important to note that AWS STS requests made using AWS credentials are subject to a default quota of 600 requests per second, per account, per Region—including AssumeRole operations. A key architectural advantage emerges in cross-account scenarios: only the account initiating the AssumeRole request (customer account) counts against its AWS STS quota; the target account’s (operations account) quota remains unaffected. This asymmetric quota consumption means that the operations account doesn’t deplete their AWS STS service quotas when responding to API requests from customer accounts. For most multi-tenant implementations, the standard quota of 600 requests per second provides ample capacity, though AWS offers quota adjustment options for environments with exceptional requirements. This quota design enables scalable operational models where a single operations account can efficiently service thousands of tenant accounts without encountering service limits.

Writing private logs using LangChain callbacks

LangChain is a popular open source orchestration framework that enables developers to build powerful applications by connecting various components through chains, which are sequential series of operations that process and transform data. At the core of LangChain’s extensibility is the BaseCallbackHandler class, a fundamental abstraction that provides hooks into the execution lifecycle of chains, allowing developers to implement custom logic at different stages of processing. This class can be extended to precisely define behaviors that should occur upon completion of a chain’s invocation, enabling sophisticated monitoring, logging, or triggering of downstream processes. By implementing custom callback handlers, developers can capture metrics, persist results to external systems, or dynamically alter the execution flow based on intermediate outputs, making LangChain both flexible and powerful for production-grade language model applications.

Implementing a custom CloudWatch logging callback in LangChain provides a robust solution for maintaining data privacy in multi-account deployments. By extending the BaseCallbackHandler class, we can create a specialized handler that establishes a direct connection to the customer account’s CloudWatch logs, making sure model interaction data remains within the account boundaries. The implementation begins by initializing a Boto3 CloudWatch Logs client using the customer account’s credentials, rather than the operations account’s credentials. This client is configured with the appropriate log group and stream names, which can be dynamically generated based on customer identifiers or application contexts. During model invocations, the callback captures critical metrics such as token usage, latency, prompt details, and response characteristics. The following Python script serves as an example of this implementation:

class CustomCallbackHandler(BaseCallbackHandler):

    def log_to_cloudwatch(self, message: str):
        """Function to write extracted metrics to CloudWatch"""

    def on_llm_end(self, response, **kwargs):
        print("nChat model finished processing.")
        # Extract model_id and token usage from the response
        input_token_count = response.llm_output.get("usage", {}).get("prompt_tokens", None)
        output_token_count = response.llm_output.get("usage", {}).get("completion_tokens", None)
        model_id=response.llm_output.get("model_id", None)

        # Here we invoke the callback
        self.log_to_cloudwatch(
              f"User ID: {self.user_id}nApplication ID: {self.application_id}n Input tokens: {input_token_count}n Output tokens: {output_token_count}n Invoked model: {model_id}"
             )

    def on_llm_error(self, error: Exception, **kwargs):
        print(f"Chat model encountered an error: {error}")

The on_llm_start, on_llm_end, and on_llm_error methods are overridden to intercept these lifecycle events and persist the relevant data. For example, the on_llm_end method can extract token counts, execution time, and model-specific metadata, formatting this information into structured log entries before writing them to CloudWatch. By implementing proper error handling and retry logic within the callback, we provide reliable logging even during intermittent connectivity issues. This approach creates a comprehensive audit trail of AI interactions while maintaining strict data isolation in the customer account, because the logs do not transit through or rest in the operations account.

The AWS Shared Responsibility Model in multi-account logging

When implementing distributed logging for Amazon Bedrock in multi-account architectures, understanding the AWS Shared Responsibility Model becomes paramount. Although AWS secures the underlying infrastructure and services like Amazon Bedrock and CloudWatch, customers remain responsible for securing their data, configuring access controls, and implementing appropriate logging strategies. As demonstrated in our IAM role configurations, customers must carefully craft trust relationships and permission boundaries to help prevent unauthorized cross-account access. The LangChain callback implementation outlined places the responsibility on customers to enforce proper encryption of logs at rest, define appropriate retention periods that align with compliance requirements, and implement access controls for who can view sensitive AI interaction data. This aligns with the multi-account design principle where customer data remains isolated within their respective accounts. By respecting these security boundaries while maintaining operational efficiency, software companies can uphold their responsibilities within the shared security model while delivering scalable AI capabilities across their customer base.

Conclusion

Implementing a secure, scalable multi-tenant architecture with Amazon Bedrock requires careful planning around account structure, access patterns, and operational management. The distributed logging approach we’ve outlined demonstrates how organizations can maintain strict data isolation while still benefiting from centralized AI operations. By using IAM roles with precise trust relationships, AWS STS for secure cross-account authentication, and LangChain callbacks for private logging, companies can create a robust foundation that scales to thousands of customers without compromising on security or operational efficiency.

This architecture addresses the critical challenge of maintaining data privacy in multi-account deployments while still enabling comprehensive observability. Organizations should prioritize automation, monitoring, and governance from the beginning to avoid technical debt as their system scales. Implementing infrastructure as code for role management, automated monitoring of cross-account access patterns, and regular security reviews will make sure the architecture remains resilient and will help maintain adherence with compliance standards as business requirements evolve. As generative AI becomes increasingly central to software provider offerings, these architectural patterns provide a blueprint for maintaining the highest standards of data privacy while delivering innovative AI capabilities to customers across diverse regulatory environments and security requirements.

To learn more, explore the comprehensive Generative AI Security Scoping Matrix through Securing generative AI: An introduction to the Generative AI Security Scoping Matrix, which provides essential frameworks for securing AI implementations. Building on these security foundations, strengthen Amazon Bedrock deployments by getting familiar with IAM authentication and authorization mechanisms that establish proper access controls. As organizations grow to require multi-account structures, these IAM practices connect seamlessly with AWS STS, which delivers temporary security credentials enabling secure cross-account access patterns. To complete this integrated security approach, delve into LangChain and LangChain on AWS capabilities, offering powerful tools that build upon these foundational security services to create secure, context-aware AI applications, while maintaining appropriate security boundaries across your entire generative AI workflow.

About the Authors

Mohammad Tahsin is an AI/ML Specialist Solutions Architect at AWS. He lives for staying up-to-date with the latest technologies in AI/ML and helping customers deploy bespoke solutions on AWS. Outside of work, he loves all things gaming, digital art, and cooking.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Aswin Vasudevan is a Senior Solutions Architect for Security, ISV at AWS. He is a big fan of generative AI and serverless architecture and enjoys collaborating and working with customers to build solutions that drive business value.

Build a domain‐aware data preprocessing pipeline: A multi‐agent collaboration approach

May 20, 2025

by Piyali Kamra Amazon AWS

Enterprises—especially in the insurance industry—face increasing challenges in processing vast amounts of unstructured data from diverse formats, including PDFs, spreadsheets, images, videos, and audio files. These might include claims document packages, crash event videos, chat transcripts, or policy documents. All contain critical information across the claims processing lifecycle.

Traditional data preprocessing methods, though functional, might have limitations in accuracy and consistency. This might affect metadata extraction completeness, workflow velocity, and the extent of data utilization for AI-driven insights (such as fraud detection or risk analysis). To address these challenges, this post introduces a multi‐agent collaboration pipeline: a set of specialized agents for classification, conversion, metadata extraction, and domain‐specific tasks. By orchestrating these agents, you can automate the ingestion and transformation of a wide range of multimodal unstructured data—boosting accuracy and enabling end‐to‐end insights.

For teams processing a small volume of uniform documents, a single-agent setup might be more straightforward to implement and sufficient for basic automation. However, if your data spans diverse domains and formats—such as claims document packages, collision footage, chat transcripts, or audio files—a multi-agent architecture offers distinct advantages. Specialized agents allow for targeted prompt engineering, better debugging, and more accurate extraction, each tuned to a specific data type.

As volume and variety grow, this modular design scales more gracefully, allowing you to plug in new domain-aware agents or refine individual prompts and business logic—without disrupting the broader pipeline. Feedback from domain experts in the human-in-the-loop phase can also be mapped back to specific agents, supporting continuous improvement.

To support this adaptive architecture, you can use Amazon Bedrock, a fully managed service that makes it straightforward to build and scale generative AI applications using foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon through a single API. A powerful feature of Amazon Bedrock—Amazon Bedrock Agents—enables the creation of intelligent, domain-aware agents that can retrieve context from Amazon Bedrock Knowledge Bases, call APIs, and orchestrate multi-step tasks. These agents provide the flexibility and adaptability needed to process unstructured data at scale, and can evolve alongside your organization’s data and business workflows.

Solution overview

Our pipeline functions as an insurance unstructured data preprocessing hub with the following features:

Classification of incoming unstructured data based on domain rules
Metadata extraction for claim numbers, dates, and more
Conversion of documents into uniform formats (such as PDF or transcripts)
Conversion of audio/video data into structured markup format
Human validation for uncertain or missing fields

Enriched outputs and associated metadata will ultimately land in a metadata‐rich unstructured data lake, forming the foundation for fraud detection, advanced analytics, and 360‐degree customer views.

The following diagram illustrates the solution architecture.

The end-to-end workflow features a supervisor agent at the center, classification and conversion agents branching off, a human‐in‐the‐loop step, and Amazon Simple Storage Service (Amazon S3) as the final unstructured data lake destination.

Multi‐agent collaboration pipeline

This pipeline is composed of multiple specialized agents, each handling a distinct function such as classification, conversion, metadata extraction, and domain-specific analysis. Unlike a single monolithic agent that attempts to manage all tasks, this modular design promotes scalability, maintainability, and reuse. Individual agents can be independently updated, swapped, or extended to accommodate new document types or evolving business rules without impacting the overall system. This separation of concerns improves fault tolerance and enables parallel processing, resulting in faster and more reliable data transformation workflows.

Multi-agent collaboration offers the following metrics and efficiency gains:

Reduction in human validation time – Focused prompts tailored to specific agents will lead to cleaner outputs and less complicated verification, providing efficiency in validation time.
Faster iteration cycles and regression isolation – Changes to prompts or logic are scoped to individual agents, minimizing the area of effect of updates and significantly reducing regression testing effort during tuning or enhancement phases.
Improved metadata extraction accuracy, especially on edge cases – Specialized agents reduce prompt overload and allow deeper domain alignment, which improves field-level accuracy—especially when processing mixed document types like crash videos vs. claims document packages.
Scalable efficiency gains with automated issue resolver agents – As automated issue resolver agents are added over time, processing time per document is expected to improve considerably, reducing manual touchpoints. These agents can be designed to use human-in-the-loop feedback mappings and intelligent data lake lookups to automate recurring fixes.

Unstructured Data Hub Supervisor Agent

The Supervisor Agent orchestrates the workflow, delegates tasks, and invokes specialized downstream agents. It has the following key responsibilities:

Receive incoming multimodal data and processing instructions from the user portal (multimodal claims document packages, vehicle damage images, audio transcripts, or repair estimates).
Forward each unstructured data type to the Classification Collaborator Agent to determine whether a conversion step is needed or direct classification is possible.
Coordinate specialized domain processing by invoking the appropriate agent for each data type—for example, a claims documents package is handled by the Claims Documentation Package Processing Agent, and repair estimates go to the Vehicle Repair Estimate Processing Agent.
Make sure that every incoming data eventually lands, along with its metadata, in the S3 data lake.

Classification Collaborator Agent

The Classification Collaborator Agent determines each file’s type using domain‐specific rules and makes sure it’s either converted (if needed) or directly classified. This includes the following steps:

Identify the file extension. If it’s DOCX, PPT, or XLS, it routes the file to the Document Conversion Agent first.
Output a unified classification result for each standardized document—specifying the category, confidence, extracted metadata, and next steps.

Document Conversion Agent

The Document Conversion Agent converts non‐PDF files into PDF and extracts initial metadata (creation date, file size, and so on). This includes the following steps:

Transform DOCX, PPT, XLS, and XLSX into PDF.
Capture embedded metadata.
Return the new PDF to the Classification Collaborator Agent for final classification.

Specialized classification agents

Each agent handles specific modalities of data:

Document Classification Agent:
- Processes text‐heavy formats like claims document packages, standard operating procedure documents (SOPs), and policy documents
- Extracts claim numbers, policy numbers, policy holder details, coverage dates, and expense amounts as metadata
- Identifies missing items (for example, missing policy holder information, missing dates)
Transcription Classification Agent:
- Focuses on audio or video transcripts, such as First Notice of Lost (FNOL) calls or adjuster follow‐ups
- Classifies transcripts into business categories (such as first‐party claim or third‐party conversation) and extracts relevant metadata
Image Classification Agent:
- Analyzes vehicle damage photos and collision videos for details like damage severity, vehicle identification, or location
- Generates structured metadata that can be fed into downstream damage analysis systems

Additionally, we have defined specialized downstream agents:

Claims Document Package Processing Agent
Vehicle Repair Estimate Processing Agent
Vehicle Damage Analysis Processing Agent
Audio Video Transcription Processing Agent
Insurance Policy Document Processing Agent

After the high‐level classification identifies a file as, for example, a claims document package or repair estimate, the Supervisor Agent invokes the appropriate specialized agent to perform deeper domain‐specific transformation and extraction.

Metadata extraction and human-in-the-loop

Metadata is essential for automated workflows. Without accurate metadata fields—like claim numbers, policy numbers, coverage dates, loss dates, or claimant names—downstream analytics lack context. This part of the solution handles data extraction, error handling, and recovery through the following features:

Automated extraction – Large language models (LLMs) and domain‐specific rules parse critical data from unstructured content, identify key metadata fields, and flag anomalies early.
Data staging for review – The pipeline extracts metadata fields and stages each record for human review. This process presents the extracted fields—highlighting missing or incorrect values for human review.
Human-in-the-loop – Domain experts step in to validate and correct metadata during the human-in-the-loop phase, providing accuracy and context for key fields such as claim numbers, policyholder details, and event timelines. These interventions not only serve as a point-in-time error recovery mechanism but also lay the foundation for continuous improvement of the pipeline’s domain-specific rules, conversion logic, and classification prompts.

Eventually, automated issue resolver agents can be introduced in iterations to handle an increasing share of data fixes, further reducing the need for manual review. Several strategies can be introduced to enable this progression to improve resilience and adaptability over time:

Persisting feedback – Corrections made by domain experts can be captured and mapped to the types of issues they resolve. These structured mappings help refine prompt templates, update business logic, and generate targeted instructions to guide the design of automated issue resolver agents to emulate similar fixes in future workflows.
Contextual metadata lookups – As the unstructured data lake becomes increasingly metadata-rich—with deeper connections across policy numbers, claim IDs, vehicle info, and supporting documents— issue resolver agents with appropriate prompts can be introduced to perform intelligent dynamic lookups. For example, if a media file lacks a policy number but includes a claim number and vehicle information, an issue resolver agent can retrieve missing metadata by querying related indexed documents like claims document packages or repair estimates.

By combining these strategies, the pipeline becomes increasingly adaptive—continually improving data quality and enabling scalable, metadata-driven insights across the enterprise.

Metadata‐rich unstructured data lake

After each unstructured data type is converted and classified, both the standardized content

and metadata JSON files are stored in an unstructured data lake (Amazon S3). This repository unifies different data types (images, transcripts, documents) through shared metadata, enabling the following:

Fraud detection by cross‐referencing repeated claimants or contradictory details
Customer 360-degree profiles by linking claims, calls, and service records
Advanced analytics and real‐time queries

Multi‐modal, multi‐agentic pattern

In our AWS CloudFormation template, each multimodal data type follows a specialized flow:

Data conversion and classification:
- The Supervisor Agent receives uploads and passes them to the Classification Collaborator Agent.
- If needed, the Document Conversion Agent might step in to standardize the file.
- The Classification Collaborator Agent’s classification step organizes the uploads into categories—FNOL calls, claims document packages, collision videos, and so on.
Document processing:
- The Document Classification Agent and other specialized agents apply domain rules to extract metadata like claim numbers, coverage dates, and more.
- The pipeline presents the extracted as well as missing information to the domain expert for correction or updating.
Audio/video analysis:
- The Transcription Classification Agent handles FNOL calls and third‐party conversation transcripts.
- The Audio Video Transcription Processing Agent or the Vehicle Damage Analysis Processing Agent further parses collision videos or damage photos, linking spoken events to visual evidence.
Markup text conversion:
- Specialized processing agents create markup text from the fully classified and corrected metadata. This way, the data is transformed into a metadata-rich format ready for consumption by knowledge bases, Retrieval Augmented Generation (RAG) pipelines, or graph queries.

Human-in-the-loop and future improvements

The human‐in‐the‐loop component is key for verifying and adding missing metadata and fixing incorrect categorization of data. However, the pipeline is designed to evolve as follows:

Refined LLM prompts – Every correction from domain experts helps refine LLM prompts, reducing future manual steps and improving metadata consistency
Issue resolver agents – As metadata consistency improves over time, specialized fixers can handle metadata and classification errors with minimal user input
Cross referencing – Issue resolver agents can cross‐reference existing data in the metadata-rich S3 data lake to automatically fill in missing metadata

The pipeline evolves toward full automation, minimizing human oversight except for the most complex cases.

Prerequisites

Before deploying this solution, make sure that you have the following in place:

An AWS account. If you don’t have an AWS account, sign up for one.
Access as an AWS Identity and Access Management (IAM) administrator or an IAM user that has permissions for:
- Deploying AWS CloudFormation.
- Creating and managing S3 buckets and uploading objects.
- Creating and updating Amazon Simple Queue Service (Amazon SQS) queues, AWS Lambda functions, Amazon Bedrock, Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, Amazon OpenSearch Service, and Amazon API Gateway.
- Creating and managing IAM roles.
Access to Amazon Bedrock. Make sure Amazon Bedrock is available in your AWS Region, and you have explicitly enabled the FMs you plan to use (for example, Anthropic’s Claude or Cohere). Refer to Add or remove access to Amazon Bedrock foundation models for guidance on enabling models for your AWS account. This solution was tested in us-west-2. Make sure that you have enabled the required FMs:
- claude-3-5-haiku-20241022-v1:0
- claude-3-5-sonnet-20241022-v2:0
- claude-3-haiku-20240307-v1:0
- titan-embed-text-v2:0
Set the API Gateway integration timeout from the default 29 seconds to 180 seconds, as introduced in this announcement, in your AWS account by submitting a service quota increase for API Gateway integration timeout.

Deploy the solution with AWS CloudFormation

Complete the following steps to set up the solution resources:

Sign in to the AWS Management Console as an IAM administrator or appropriate IAM user.
Choose Launch Stack to deploy the CloudFormation template.

Provide the necessary parameters and create the stack.

For this setup, we use us-west-2 as our Region, Anthropic’s Claude 3.5 Haiku model for orchestrating the flow between the different agents, and Anthropic’s Claude 3.5 Sonnet V2 model for conversion, categorization, and processing of multimodal data.

If you want to use other models on Amazon Bedrock, you can do so by making appropriate changes in the CloudFormation template. Check for appropriate model support in the Region and the features that are supported by the models.

It will take about 30 minutes to deploy the solution. After the stack is deployed, you can view the various outputs of the CloudFormation stack on the Outputs tab, as shown in the following screenshot.

The provided CloudFormation template creates multiple S3 buckets (such as DocumentUploadBucket, SampleDataBucket, and KnowledgeBaseDataBucket) for raw uploads, sample files, Amazon Bedrock Knowledge Bases references, and more. Each specialized Amazon Bedrock agent or Lambda function uses these buckets to store intermediate or final artifacts.

The following screenshot is an illustration of the Amazon Bedrock agents that are deployed in the AWS account.

The next section outlines how to test the unstructured data processing workflow.

Test the unstructured data processing workflow

In this section, we present different use cases to demonstrate the solution. Before you begin, complete the following steps:

Locate the APIGatewayInvokeURL value from the CloudFormation stack’s outputs. This URL launches the Insurance Unstructured Data Preprocessing Hub in your browser.

Download the sample data files from the designated S3 bucket (SampleDataBucketName) to your local machine. The following screenshots show the bucket details from CloudFormation stack’s outputs and the contents of the sample data bucket.

With these details, you can now test the pipeline by uploading the following sample multimodal files through the Insurance Unstructured Data Preprocessing Hub Portal:

Claims document package (ClaimDemandPackage.pdf)
Vehicle repair estimate (collision_center_estimate.xlsx)
Collision video with supported audio (carcollision.mp4)
First notice of loss audio transcript (fnol.mp4)
Insurance policy document (ABC_Insurance_Policy.docx)

Each multimodal data type will be processed through a series of agents:

Supervisor Agent – Initiates the processing
Classification Collaborator Agent – Categorizes the multimodal data
Specialized processing agents – Handle domain-specific processing

Finally, the processed files, along with their enriched metadata, are stored in the S3 data lake. Now, let’s proceed to the actual use cases.

Use Case 1: Claims document package

This use case demonstrates the complete workflow for processing a multimodal claims document package. By uploading a PDF document to the pipeline, the system automatically classifies the document type, extracts essential metadata, and categorizes each page into specific components.

Choose Upload File in the UI and choose the pdf file.

The file upload might take some time depending on the document size.

When the upload is complete, you can confirm that the extracted metadata values are follows:
1. Claim Number: 0112233445
2. Policy Number: SF9988776655
3. Date of Loss: 2025-01-01
4. Claimant Name: Jane Doe

The Classification Collaborator Agent identifies the document as a Claims Document Package. Metadata (such as claim ID and incident date) is automatically extracted and displayed for review.

For this use case, no changes are made—simply choose Continue Preprocessing to proceed.

The processing stage might take up to 15 minutes to complete. Rather than manually checking the S3 bucket (identified in the CloudFormation stack outputs as KnowledgeBaseDataBucket) to verify that 72 files—one for each page and its corresponding metadata JSON—have been generated, you can monitor the progress by periodically choosing Check Queue Status. This lets you view the current state of the processing queue in real time.

The pipeline further categorizes each page into specific types (for example, lawyer letter, police report, medical bills, doctor’s report, health forms, x-rays). It also generates corresponding markup text files and metadata JSON files.

Finally, the processed text and metadata JSON files are stored in the unstructured S3 data lake.

The following diagram illustrates the complete workflow.

Use Case 2: Collision center workbook for vehicle repair estimate

In this use case, we upload a collision center workbook to trigger the workflow that converts the file, extracts repair estimate details, and stages the data for review before final storage.

Choose Upload File and choose the xlsx workbook.
Wait for the upload to complete and confirm that the extracted metadata is accurate:
1. Claim Number: CLM20250215
2. Policy Number: SF9988776655
3. Claimant Name: John Smith
4. Vehicle: Truck

The Document Conversion Agent converts the file to PDF if needed, or the Classification Collaborator Agent identifies it as a repair estimate. The Vehicle Repair Estimate Processing Agent extracts cost lines, part numbers, and labor hours.

Review and update the displayed metadata as necessary, then choose Continue Preprocessing to trigger final storage.

The finalized file and metadata are stored in Amazon S3.

The following diagram illustrates this workflow.

Use Case 3: Collision video with audio transcript

For this use case, we upload a video showing the accident scene to trigger a workflow that analyzes both visual and audio data, extracts key frames for collision severity, and stages metadata for review before final storage.

Choose Upload File and choose the mp4 video.
Wait until the upload is complete, then review the collision scenario and adjust the displayed metadata to correct omissions or inaccuracies as follows:
1. Claim Number: 0112233445
2. Policy Number: SF9988776655
3. Date of Loss: 01-01-2025
4. Claimant Name: Jane Doe
5. Policy Holder Name: John Smith

The Classification Collaborator Agent directs the video to either the Audio/Video Transcript or Vehicle Damage Analysis agent. Key frames are analyzed to determine collision severity.

Review and update the displayed metadata (for example, policy number, location), then choose Continue Preprocessing to initiate final storage.

Final transcripts and metadata are stored in Amazon S3, ready for advanced analytics such as verifying story consistency.

The following diagram illustrates this workflow.

Use Case 4: Audio transcript between claimant and customer service associate

Next, we upload a video that captures the claimant reporting an accident to trigger the workflow that extracts an audio transcript and identifies key metadata for review before final storage.

Choose Upload File and choose mp4.
Wait until the upload is complete, then review the call scenario and adjust the displayed metadata to correct any omissions or inaccuracies as follows:
1. Claim Number: Not Assigned Yet
2. Policy Number: SF9988776655
3. Claimant Name: Jane Doe
4. Policy Holder Name: John Smith
5. Date Of Loss: January 1, 2025 8:30 AM

The Classification Collaborator Agent routes the file to the Audio/Video Transcript Agent for processing. Key metadata attributes are automatically identified from the call.

Review and correct any incomplete metadata, then choose Continue Preprocessing to proceed.

Final transcripts and metadata are stored in Amazon S3, ready for advanced analytics (for example, verifying story consistency).

The following diagram illustrates this workflow.

Use Case 5: Auto insurance policy document

For our final use case, we upload an insurance policy document to trigger the workflow that converts and classifies the document, extracts key metadata for review, and stores the finalized output in Amazon S3.

Choose Upload File and choose docx.
Wait until the upload is complete, and confirm that the extracted metadata values are as follows:
1. Policy Number: SF9988776655
2. Policy type: Auto Insurance
3. Effective Date: 12/12/2024
4. Policy Holder Name: John Smith

The Document Conversion Agent transforms the document into a standardized PDF format if required. The Classification Collaborator Agent then routes it to the Document Classification Agent for categorization as an Auto Insurance Policy Document. Key metadata attributes are automatically identified and presented for user review.

Review and correct incomplete metadata, then choose Continue Preprocessing to trigger final storage.

The finalized policy document in markup format, along with its metadata, is stored in Amazon S3—ready for advanced analytics such as verifying story consistency.

The following diagram illustrates this workflow.

Similar workflows can be applied to other types of insurance multimodal data and documents by uploading them on the Data Preprocessing Hub Portal. Whenever needed, this process can be enhanced by introducing specialized downstream Amazon Bedrock agents that collaborate with the existing Supervisor Agent, Classification Agent, and Conversion Agents.

Amazon Bedrock Knowledge Bases integration

To use the newly processed data in the data lake, complete the following steps to ingest the data in Amazon Bedrock Knowledge Bases and interact with the data lake using a structured workflow. This integration allows for dynamic querying across different document types, enabling deeper insights from multimodal data.

Choose Chat with Your Documents to open the chat interface.

Choose Sync Knowledge Base to initiate the job that ingests and indexes the newly processed files and the available metadata into the Amazon Bedrock knowledge base.
After the sync is complete (which might take a couple of minutes), enter your queries in the text box. For example, set Policy Number to SF9988776655 and try asking:
1. “Retrieve details of all claims filed against the policy number by multiple claimants.”
2. “What is the nature of Jane Doe’s claim, and what documents were submitted?”
3. “Has the policyholder John Smith submitted any claims for vehicle repairs, and are there any estimates on file?”
Choose Send and review the system’s response.

This integration enables cross-document analysis, so you can query across multimodal data types like transcripts, images, claims document packages, repair estimates, and claim records to reveal customer 360-degree insights from your domain-aware multi-agent pipeline. By synthesizing data from multiple sources, the system can correlate information, uncover hidden patterns, and identify relationships that might not have been evident in isolated documents.

A key enabler of this intelligence is the rich metadata layer generated during preprocessing. Domain experts actively validate and refine this metadata, providing accuracy and consistency across diverse document types. By reviewing key attributes—such as claim numbers, policyholder details, and event timelines—domain experts enhance the metadata foundation, making it more reliable for downstream AI-driven analysis.

With rich metadata in place, the system can now infer relationships between documents more effectively, enabling use cases such as:

Identifying multiple claims tied to a single policy
Detecting inconsistencies in submitted documents
Tracking the complete lifecycle of a claim from FNOL to resolution

By continuously improving metadata through human validation, the system becomes more adaptive, paving the way for future automation, where issue resolver agents can proactively identify and self-correct missing and inconsistent metadata with minimal manual intervention during the data ingestion process.

Clean up

To avoid unexpected charges, complete the following steps to clean up your resources:

Delete the contents from the S3 buckets mentioned in the outputs of the CloudFormation stack.
Delete the deployed stack using the AWS CloudFormation console.

Conclusion

By transforming unstructured insurance data into metadata‐rich outputs, you can accomplish the following:

Accelerate fraud detection by cross‐referencing multimodal data
Enhance customer 360-degree insights by uniting claims, calls, and service records
Support real‐time decisions through AI‐assisted search and analytics

As this multi‐agent collaboration pipeline matures, specialized issue resolver agents and refined LLM prompts can further reduce human involvement—unlocking end‐to‐end automation and improved decision‐making. Ultimately, this domain‐aware approach future‐proofs your claims processing workflows by harnessing raw, unstructured data as actionable business intelligence.

To get started with this solution, take the following next steps:

Deploy the CloudFormation stack and experiment with the sample data.
Refine domain rules or agent prompts based on your team’s feedback.
Use the metadata in your S3 data lake for advanced analytics like real‐time risk assessment or fraud detection.
Connect an Amazon Bedrock knowledge base to KnowledgeBaseDataBucket for advanced Q&A and RAG.

With a multi‐agent architecture in place, your insurance data ceases to be a scattered liability, becoming instead a unified source of high‐value insights.

Refer to the following additional resources to explore further:

About the Author

Piyali Kamra is a seasoned enterprise architect and a hands-on technologist who has over two decades of experience building and executing large scale enterprise IT projects across geographies. She believes that building large scale enterprise systems is not an exact science but more like an art, where you can’t always choose the best technology that comes to one’s mind but rather tools and technologies must be carefully selected based on the team’s culture , strengths, weaknesses and risks, in tandem with having a futuristic vision as to how you want to shape your product a few years down the road.