Amazon AWS – Page 15

The future of quality assurance: Shift-left testing with QyrusAI and Amazon Bedrock

April 17, 2025

by Ameet Deshpande, Vatsal Saglani Amazon AWS

This post is co-written with Ameet Deshpande and Vatsal Saglani from Qyrus.

As businesses embrace accelerated development cycles to stay competitive, maintaining rigorous quality standards can pose a significant challenge. Traditional testing methods, which occur late in the development cycle, often result in delays, increased costs, and compromised quality.

Shift-left testing, which emphasizes earlier testing in the development process, aims to address these issues by identifying and resolving problems sooner. However, effectively implementing this approach requires the right tools. By using advanced AI models, QyrusAI improves testing throughout the development cycle—from generating test cases during the requirements phase to uncovering unexpected issues during application exploration.

In this post, we explore how QyrusAI and Amazon Bedrock are revolutionizing shift-left testing, enabling teams to deliver better software faster. Amazon Bedrock is a fully managed service that allows businesses to build and scale generative AI applications using foundation models (FMs) from leading AI providers. It enables seamless integration with AWS services, offering customization, security, and scalability without managing infrastructure.

QyrusAI: Intelligent testing agents powered by Amazon Bedrock

QyrusAI is a suite of AI-driven testing tools that enhances the software testing process across the entire software development lifecycle (SDLC). Using advanced large language models (LLMs) and vision-language models (VLMs) through Amazon Bedrock, QyrusAI provides a suite of capabilities designed to elevate shift-left testing. Let’s dive into each agent and the cutting-edge models that power them.

TestGenerator

TestGenerator generates initial test cases based on requirements using a suite of advanced models:

Meta’s Llama 70B – We use this model to generate test cases by analyzing requirements documents and understanding key entities, user actions, and expected behaviors. With its in-context learning capabilities, we use its natural language understanding to infer possible scenarios and edge cases, creating a comprehensive list of test cases that align with the given requirements.
Anthropic’s Claude 3.5 Sonnet – We use this model to evaluate the generated test scenarios, acting as a judge to assess if the scenarios are comprehensive and accurate. We also use it to highlight missing scenarios, potential failure points, or edge cases that might not be apparent in the initial phases. Additionally, we use it to rank test cases based on relevance, helping prioritize the most critical tests covering high-risk areas and key functionalities.
Cohere’s English Embed – We use this model to embed text from large documents such as requirement specifications, user stories, or functional requirement documents, enabling efficient semantic search and retrieval.
Pinecone on AWS Marketplace – Embedded documents are stored in Pinecone to enable fast and efficient retrieval. During test case generation, these embeddings are used as part of a ReAct agent approach—where the LLM thinks, observes, searches for specific or generic requirements in the document, and generates comprehensive test scenarios.

The following diagram shows how TestGenerator is deployed on AWS using Amazon Elastic Container Service (Amazon ECS) tasks exposed through Application Load Balancer, using Amazon Bedrock, Amazon Simple Storage Service (Amazon S3), and Pinecone for embedding storage and retrieval to generate comprehensive test cases.

VisionNova

VisionNova is QyrusAI’s design test case generator that crafts design-based test cases using Anthropic’s Claude 3.5 Sonnet. The model is used to analyze design documents and generate precise, relevant test cases. This workflow specializes in understanding UX/UI design documents and translating visual elements into testable scenarios.

The following diagram shows how VisionNova is deployed on AWS using ECS tasks exposed through Application Load Balancer, using Anthropic’s Claude 3 and Claude 3.5 Sonnet models on Amazon Bedrock for image understanding, and using Amazon S3 for storing images, to generate design-based test cases for validating UI/UX elements.

Uxtract

UXtract is QyrusAI’s agentic workflow that converts Figma prototypes into test scenarios and steps based on the flow of screens in the prototype.

Figma prototype graphs are used to create detailed test cases with step-by-step instructions. The graph is analyzed to understand the different flows and make sure transitions between elements are validated. Anthropic’s Claude 3 Opus is used to process these transitions to identify potential actions and interactions, and Anthropic’s Claude 3.5 Sonnet is used to generate detailed test steps and instructions based on the transitions and higher-level objectives. This layered approach makes sure that UXtract captures both the functional accuracy of each flow and the granularity needed for effective testing.

The following diagram illustrates how UXtract uses ECS tasks, connected through Application Load Balancer, along with Amazon Bedrock models and Amazon S3 storage, to analyze Figma prototypes and create detailed, step-by-step test cases.

API Builder

API Builder creates virtualized APIs for early frontend testing by using various LLMs from Amazon Bedrock. These models interpret API specifications and generate accurate mock responses, facilitating effective testing before full backend implementation.

The following diagram illustrates how API Builder uses ECS tasks, connected through Application Load Balancer, along with Amazon Bedrock models and Amazon S3 storage, to create a virtualized and high-scalable microservice with dynamic data provisions using Amazon Elastic File System (Amazon EFS) on AWS Lambda compute.

QyrusAI offers a range of additional agents that further enhance the testing process:

Echo – Echo generates synthetic test data using a blend of Anthropic’s Claude 3 Sonnet, Mistral 8x7B Instruct, and Meta’s Llama1 70B to provide comprehensive testing coverage.
Rover and TestPilot – These multi-agent frameworks are designed for exploratory and objective-based testing, respectively. They use a combination of LLMs, VLMs, and embedding models from Amazon Bedrock to uncover and address issues effectively.
Healer – Healer tackles common test failures caused by locator issues by analyzing test scripts and their current state with various LLMs and VLMs to suggest accurate fixes.

These agents, powered by Amazon Bedrock, collaborate to deliver a robust, AI-driven shift-left testing strategy throughout the SDLC.

QyrusAI and Amazon Bedrock

At the core of QyrusAI’s integration with Amazon Bedrock is our custom-developed qai package, which builds upon aiobotocore, aioboto3, and boto3. This unified interface enables our AI agents to seamlessly access the diverse array of LLMs, VLMs, and embedding models available on Amazon Bedrock. The qai package is essential to our AI-powered testing ecosystem, offering several key benefits:

Consistent access – The package standardizes interactions with various models on Amazon Bedrock, providing uniformity across our suite of testing agents.
DRY principle – By centralizing Amazon Bedrock interaction logic, we’ve minimized code duplication and enhanced system maintainability, reducing the likelihood of errors.
Seamless updates – As Amazon Bedrock evolves and introduces new models or features, updating the qai package allows us to quickly integrate these advancements without altering each agent individually.
Specialized classes – The package includes distinct class objects for different model types (LLMs and VLMs) and families, optimizing interactions based on model requirements.
Out-of-the-box features – In addition to standard and streaming completions, the qai package offers built-in support for multiple and parallel function calling, providing a comprehensive set of capabilities.

Function calling and JSON mode were critical requirements for our AI workflows and agents. To maximize compatibility across diverse array of models available on Amazon Bedrock, we implemented consistent interfaces for these features in our QAI package. Because prompts for generating structured data can differ among LLMs and VLMs, specialized classes were created for various models and model families to provide consistent function calling and JSON mode capabilities. This approach provides a unified interface across the agents, streamlining interactions and enhancing overall efficiency.

The following code is a simplified overview of how we use the qai package to interact with LLMs and VLMs on Amazon Bedrock:

from qai import QAILLMs 

llm = QAILLMs() 

# can be taken from env or parameter store
provider = "Claude" 
model = "anthropic.claude-3-sonnet-20240229-v1:0" 

getattr(llm, provider).llm.__function_call__(model, messages, functions, tool_choice=None, max_tokens=2046)

The shift-left testing paradigm

Shift-left testing allows teams to catch issues sooner and reduce risk. Here’s how QyrusAI agents facilitate the shift-left approach:

Requirement analysis – TestGenerator AI generates initial test cases directly from the requirements, setting a strong foundation for quality from the start.
Design – VisionNova and UXtract convert Figma designs and prototypes into detailed test cases and functional steps.
Pre-implementation – This includes the following features:
- API Builder creates virtualized APIs, enabling early frontend testing before the backend is fully developed.
- Echo generates synthetic test data, allowing comprehensive testing without real data dependencies.
Implementation – Teams use the pre-generated test cases and virtualized APIs during development, providing continuous quality checks.
Testing – This includes the following features:
- Rover, a multi-agent system, autonomously explores the application to uncover unexpected issues.
- TestPilot conducts objective-based testing, making sure the application meets its intended goals.
Maintenance –QyrusAI supports ongoing regression testing with advanced test management, version control, and reporting features, providing long-term software quality.

The following diagram visually represents how QyrusAI agents integrate throughout the SDLC, from requirement analysis to maintenance, enabling a shift-left testing approach that makes sure issues are caught early and quality is maintained continuously.

QyrusAI’s integrated approach makes sure that testing is proactive, continuous, and seamlessly aligned with every phase of the SDLC. With this approach, teams can:

Detect potential issues earlier in the process
Lower the cost of fixing bugs
Enhance overall software quality
Accelerate development timelines

This shift-left strategy, powered by QyrusAI and Amazon Bedrock, enables teams to deliver higher-quality software faster and more efficiently.

A typical shift-left testing workflow with QyrusAI

To make this more tangible, let’s walk through how QyrusAI and Amazon Bedrock can help create and refine test cases from a sample requirements document:

A user uploads a sample requirements document.
TestGenerator, powered by Meta’s Llama 3.1, processes the document and generates a list of high-level test cases.
These test cases are refined by Anthropic’s Claude 3.5 Sonnet to enforce coverage of key business rules.
VisionNova and UXtract use design documents from tools like Figma to generate step-by-step UI tests, validating key user journeys.
API Builder virtualizes APIs, allowing frontend developers to begin testing the UI with mock responses before the backend is ready.

By following these steps, teams can get ahead of potential issues, creating a safety net that improves both the quality and speed of software development.

The impact of AI-driven shift-left testing

Our data—collected from early adopters of QyrusAI—demonstrates the significant benefits of our AI-driven shift-left approach:

80% reduction in defect leakage – Finding and fixing defects earlier results in fewer bugs reaching production
20% reduction in UAT effort – Comprehensive testing early on means a more stable product reaching the user acceptance testing (UAT) phase
36% faster time to market – Early defect detection, reduced rework, and more efficient testing leads to faster delivery

These metrics have been gathered through a combination of internal testing and pilot programs with select customers. The results consistently show that incorporating AI early in the SDLC can lead to a significant reduction in defects, development costs, and time to market.

Conclusion

Shift-left testing, powered by QyrusAI and Amazon Bedrock, is set to revolutionize the software development landscape. By integrating AI-driven testing across the entire SDLC—from requirements analysis to maintenance—QyrusAI helps teams:

Detect and fix issues early – Significantly cut development costs by identifying and resolving problems sooner
Enhance software quality – Achieve higher quality through thorough, AI-powered testing
Speed up development – Accelerate development cycles without sacrificing quality
Adapt to changes – Quickly adjust to evolving requirements and application structures

Amazon Bedrock provides the essential foundation with its advanced language and vision models, offering unparalleled flexibility and capability in software testing. This integration, along with seamless connectivity to other AWS services, enhances scalability, security, and cost-effectiveness.

As the software industry advances, the collaboration between QyrusAI and Amazon Bedrock positions teams at the cutting edge of AI-driven quality assurance. By adopting this shift-left, AI-powered approach, organizations can not only keep pace with today’s fast-moving digital world, but also set new benchmarks in software quality and development efficiency.

If you’re looking to revolutionize your software testing processes, we invite you to reach out to our team and learn more about QyrusAI. Let’s work together to build better software, faster.

To see how QyrusAI can enhance your development workflow, get in touch today at support@qyrus.com. Let’s redefine your software quality with AI-driven shift-left testing.

About the Authors

Ameet Deshpande is Head of Engineering at Qyrus and leads innovation in AI-driven, codeless software testing solutions. With expertise in quality engineering, cloud platforms, and SaaS, he blends technical acumen with strategic leadership. Ameet has spearheaded large-scale transformation programs and consulting initiatives for global clients, including top financial institutions. An electronics and communication engineer specializing in embedded systems, he brings a strong technical foundation to his leadership in delivering transformative solutions.

Vatsal Saglani is a Data Science and Generative AI Lead at Qyrus, where he builds generative AI-powered test automation tools and services using multi-agent frameworks, large language models, and vision-language models. With a focus on fine-tuning advanced AI systems, Vatsal accelerates software development by empowering teams to shift testing left, enhancing both efficiency and software quality.

Siddan Korbu is a Customer Delivery Architect with AWS. He works with enterprise customers to help them build AI/ML and generative AI solutions using AWS services.

Automate video insights for contextual advertising using Amazon Bedrock Data Automation

April 17, 2025

by Alex Burkleaux Amazon AWS

Contextual advertising, a strategy that matches ads with relevant digital content, has transformed digital marketing by delivering personalized experiences to viewers. However, implementing this approach for streaming video-on-demand (VOD) content poses significant challenges, particularly in ad placement and relevance. Traditional methods rely heavily on manual content analysis. For example, a content analyst might spend hours watching a romantic drama, placing an ad break right after a climactic confession scene, but before the resolution. Then, they manually tag the content with metadata such as romance, emotional, or family-friendly to verify appropriate ad matching. Although this manual process helps create a seamless viewer experience and maintains ad relevance, it proves highly impractical at scale.

Recent advancements in generative AI, particularly multimodal foundation models (FMs), demonstrate advanced video understanding capabilities and offer a promising solution to these challenges. We previously explored this potential in the post Media2Cloud on AWS Guidance: Scene and ad-break detection and contextual understanding for advertising using generative AI, where we demonstrated custom workflows using Amazon Titan Multimodal embeddings G1 models and Anthropic’s Claude FMs from Amazon Bedrock. In this post, we’re introducing an even simpler way to build contextual advertising solutions.

Amazon Bedrock Data Automation (BDA) is a new managed feature powered by FMs in Amazon Bedrock. BDA extracts structured outputs from unstructured content—including documents, images, video, and audio—while alleviating the need for complex custom workflows. In this post, we demonstrate how BDA automatically extracts rich video insights such as chapter segments and audio segments, detects text in scenes, and classifies Interactive Advertising Bureau (IAB) taxonomies, and then uses these insights to build a nonlinear ads solution to enhance contextual advertising effectiveness. A sample Jupyter notebook is available in the following GitHub repository.

Solution overview

Nonlinear ads are digital video advertisements that appear simultaneously with the main video content without interrupting playback. These ads are displayed as overlays, graphics, or rich media elements on top of the video player, typically appearing at the bottom of the screen. The following screenshot is an illustration of the final linear ads solution we will implement in this post.

The following diagram presents an overview of the architecture and its key components.

The workflow is as follows:

Users upload videos to Amazon Simple Storage Service (Amazon S3).
Each new video invokes an AWS Lambda function that triggers BDA for video analysis. An asynchronous job runs to analyze the video.
The analysis output is stored in an output S3 bucket.
The downstream system (AWS Elemental MediaTailor) can consume the chapter segmentation, contextual insights, and metadata (such as IAB taxonomy) to drive better ad decisions in the video.

For simplicity in our notebook example, we provide a dictionary that maps the metadata to a set of local ad inventory files to be displayed with the video segments. This simulates how MediaTailor interacts with content manifest files and requests replacement ads from the Ad Decision Service.

Prerequisites

The following prerequisites are needed to run the notebooks and follow along with the examples in this post:

An AWS account with requisite permissions, including access to Amazon Bedrock, Amazon S3, and a Jupyter notebook environment to run the sample notebooks.
A Jupyter notebook environment with appropriate permissions to access Amazon Bedrock APIs. For more information about Amazon Bedrock policy configurations, see Get credentials to grant programmatic access.
Install third-party libraries like FFmpeg, open-cv, and webvtt-py before executing the code sections.
Use the Meridian short film from Netflix Open Content under the Creative Commons Attribution 4.0 International Public License as the example video.

Video analysis using BDA

Thanks to BDA, processing and analyzing videos has become significantly simpler. The workflow consists of three main steps: creating a project, invoking the analysis, and retrieving analysis results. The first step—creating a project—establishes a reusable configuration template for your analysis tasks. Within the project, you define the types of analyses you want to perform and how you want the results structured. To create a project, use the create_data_automation_project API from the BDA boto3 client. This function returns a dataAutomationProjectArn, which you will need to include with each runtime invocation.

{
    'projectArn': 'string',
    'projectStage': 'DEVELOPMENT'|'LIVE',
    'status': 'COMPLETED'|'IN_PROGRESS'|'FAILED'
}

Upon project completion (status: COMPLETED), you can use the invoke_data_automation_async API from the BDA runtime client to start video analysis. This API requires input/output S3 locations and a cross-Region profile ARN in your request. BDA requires cross-Region inference support for all file processing tasks, automatically selecting the optimal AWS Region within your geography to maximize compute resources and model availability. This mandatory feature helps provide optimal performance and customer experience at no additional cost. You can also optionally configure Amazon EventBridge notifications for job tracking (for more details, see Tutorial: Send an email when events happen using Amazon EventBridge). After it’s triggered, the process immediately returns a job ID while continuing processing in the background.

default_profile_arn = "arn:aws:bedrock:{region}:{account_id}:data-automation-profile/us.data-automation-v1"

response = bda_runtime_client.invoke_data_automation_async(
    inputConfiguration={
        's3Uri': f's3://{data_bucket}/{s3_key}'
    },
    outputConfiguration={
        's3Uri': f's3://{data_bucket}/{output_prefix}'
    },
    dataAutomationConfiguration={
        'dataAutomationProjectArn': dataAutomationProjectArn,
        'stage': 'DEVELOPMENT'
    },
    notificationConfiguration={
        'eventBridgeConfiguration': {
            'eventBridgeEnabled': False
        }
    },
    dataAutomationProfileArn=default_profile_arn
)

BDA standard outputs for video

Let’s explore the outputs from BDA for video analysis. Understanding these outputs is essential to understand what type of insights BDA provides and how to use them to build our contextual advertising solution. The following diagram is an illustration of key components of a video, and each defines a granularity level you need to analyze the video content.

The key components are as follows:

Frame – A single still image that creates the illusion of motion when displayed in rapid succession with other frames in a video.
Shot – A continuous series of frames recorded from the moment the camera starts rolling until it stops.
Chapter – A sequence of shots that forms a coherent unit of action or narrative within the video, or a continuous conversation topic. BDA determines chapter boundaries by first classifying the video as either visually heavy (such as movies or episodic content) or audio heavy (such as news or presentations). Based on this classification, it then decides whether to establish boundaries using visual-based shot sequences or audio-based conversation topics.
Video – The complete content that enables analysis at the full video level.

Video-level analysis

Now that we defined the video granularity terms, let’s examine the insights BDA provides. At full video level, BDA generates a comprehensive summary that delivers a concise overview of the video’s key themes and main content. The system also includes speaker identification, a process that attempts to derive speakers’ names based on audible cues (For example, “I’m Jane Doe”) or visual cues on the screen whenever possible. To illustrate this capability, we can examine the following full video summary that BDA generated for the short film Meridian:

In a series of mysterious disappearances along a stretch of road above El Matador Beach, three seemingly unconnected men vanished without a trace. The victims – a school teacher, an insurance salesman, and a retiree – shared little in common except for being divorced, with no significant criminal records or ties to criminal organizations…Detective Sullivan investigates the cases, initially dismissing the possibility of suicide due to the absence of bodies. A key breakthrough comes from a credible witness who was walking his dog along the bluffs on the day of the last disappearance. The witness described seeing a man atop a massive rock formation at the shoreline, separated from the mainland. The man appeared to be searching for something or someone when suddenly, unprecedented severe weather struck the area with thunder and lightning….The investigation takes another turn when Captain Foster of the LAPD arrives at the El Matador location, discovering that Detective Sullivan has also gone missing. The case becomes increasingly complex as the connection between the disappearances, the mysterious woman, and the unusual weather phenomena remains unexplained.

Along with the summary, BDA generates a complete audio transcript that includes speaker identification. This transcript captures the spoken content while noting who is speaking throughout the video. The following is an example of a transcript generated by BDA from the Meridian short film:

[spk_0]: So these guys just disappeared.
[spk_1]: Yeah, on that stretch of road right above El Matador. You know it. With the big rock. That’s right, yeah.
[spk_2]: You know, Mickey Cohen used to take his associates out there, get him a bond voyage.
…

Chapter-level analysis

BDA performs detailed analysis at the chapter level by generating comprehensive chapter summaries. Each chapter summary includes specific start and end timestamps to precisely mark the chapter’s duration. Additionally, when relevant, BDA applies IAB categories to classify the chapter’s content. These IAB categories are part of a standardized classification system created for organizing and mapping publisher content, which serves multiple purposes, including advertising targeting, internet security, and content filtering. The following example demonstrates a typical chapter-level analysis:

[00:00:20;04 – 00:00:23;01] Automotive, Auto Type
The video showcases a vintage urban street scene from the mid-20th century. The focal point is the Florentine Gardens building, an ornate structure with a prominent sign displaying “Florentine GARDENS” and “GRUEN Time”. The building’s facade features decorative elements like columns and arched windows, giving it a grand appearance. Palm trees line the sidewalk in front of the building, adding to the tropical ambiance. Several vintage cars are parked along the street, including a yellow taxi cab and a black sedan. Pedestrians can be seen walking on the sidewalk, contributing to the lively atmosphere. The overall scene captures the essence of a bustling city environment during that era.

For a comprehensive list of supported IAB taxonomy categories, see Videos.

Also at the chapter level, BDA produces detailed audio transcriptions with precise timestamps for each spoken segment. These granular transcriptions are particularly useful for closed captioning and subtitling tasks. The following is an example of a chapter-level transcription:

[26.85 – 29.59] So these guys just disappeared.
[30.93 – 34.27] Yeah, on that stretch of road right above El Matador.
[35.099 – 35.959] You know it.
[36.49 – 39.029] With the big rock. That’s right, yeah.
[40.189 – 44.86] You know, Mickey Cohen used to take his associates out there, get him a bond voyage.
…

Shot- and frame-level insights

At a more granular level, BDA provides frame-accurate timestamps for shot boundaries. The system also performs text detection and logo detection on individual frames, generating bounding boxes around detected text and logo along with confidence scores for each detection. The following image is an example of text bounding boxes extracted from the Meridian video.

Contextual advertising solution

Let’s apply the insights extracted from BDA to power nonlinear ad solutions. Unlike traditional linear advertising that relies on predetermined time slots, nonlinear advertising enables dynamic ad placement based on content context. At the chapter level, BDA automatically segments videos and provides detailed insights including content summaries, IAB categories, and precise timestamps. These insights serve as intelligent markers for ad placement opportunities, allowing advertisers to target specific chapters that align with their promotional content.

In this example, we prepared a list of ad images and mapped them each to specific IAB categories. When BDA identifies IAB categories at the chapter level, the system automatically matches and selects the most relevant ad from the list to display as an overlay banner during that chapter. In the following example, when BDA identifies a scene with a car driving on a country road (IAB category: Automotive, Travel), the system selects and displays a suitcase at an airport from the pre-mapped ad database. This automated matching process promotes precise ad placement while maintaining optimal viewer experience.

Clean up

Follow the instructions in the cleanup section of the notebook to delete the projects and resources provisioned to avoid unnecessary charges. Refer to Amazon Bedrock pricing for details regarding BDA cost.

Conclusion

Amazon Bedrock Data Automation, powered by foundation models from Amazon Bedrock, marks a significant advancement in video analysis. BDA minimizes the complex orchestration layers previously required for extracting deep insights from video content, transforming what was once a sophisticated technical challenge into a streamlined, managed solution. This breakthrough empowers media companies to deliver more engaging, personalized advertising experiences while significantly reducing operational overhead. We encourage you to explore the sample Jupyter notebook provided in the GitHub repository to experience BDA firsthand and discover additional BDA use cases across other modalities in the following resources:

About the authors

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries

Alex Burkleaux is a Senior AI/ML Specialist Solution Architect at AWS. She helps customers use AI Services to build media solutions using Generative AI. Her industry experience includes over-the-top video, database management systems, and reliability engineering.

How Salesforce achieves high-performance model deployment with Amazon SageMaker AI

April 17, 2025

by Rielah De Jesus Amazon AWS

This post is a joint collaboration between Salesforce and AWS and is being cross-published on both the Salesforce Engineering Blog and the AWS Machine Learning Blog.

The Salesforce AI Model Serving team is working to push the boundaries of natural language processing and AI capabilities for enterprise applications. Their key focus areas include optimizing large language models (LLMs) by integrating cutting-edge solutions, collaborating with leading technology providers, and driving performance enhancements that impact Salesforce’s AI-driven features. The AI Model Serving team supports a wide range of models for both traditional machine learning (ML) and generative AI including LLMs, multi-modal foundation models (FMs), speech recognition, and computer vision-based models. Through innovation and partnerships with leading technology providers, this team enhances performance and capabilities, tackling challenges such as throughput and latency optimization and secure model deployment for real-time AI applications. They accomplish this through evaluation of ML models across multiple environments and extensive performance testing to achieve scalability and reliability for inferencing on AWS.

The team is responsible for the end-to-end process of gathering requirements and performance objectives, hosting, optimizing, and scaling AI models, including LLMs, built by Salesforce’s data science and research teams. This includes optimizing the models to achieve high throughput and low latency and deploying them quickly through automated, self-service processes across multiple AWS Regions.

In this post, we share how the AI Model Service team achieved high-performance model deployment using Amazon SageMaker AI.

Key challenges

The team faces several challenges in deploying models for Salesforce. An example would be balancing latency and throughput while achieving cost-efficiency when scaling these models based on demand. Maintaining performance and scalability while minimizing serving costs is vital across the entire inference lifecycle. Inference optimization is a crucial aspect of this process, because the model and their hosting environment must be fine-tuned to meet price-performance requirements in real-time AI applications. Salesforce’s fast-paced AI innovation requires the team to constantly evaluate new models (proprietary, open source, or third-party) across diverse use cases. They then have to quickly deploy these models to stay in cadence with their product teams’ go-to-market motions. Finally, the models must be hosted securely, and customer data must be protected to abide by Salesforce’s commitment to providing a trusted and secure platform.

Solution overview

To support such a critical function for Salesforce AI, the team developed a hosting framework on AWS to simplify their model lifecycle, allowing them to quickly and securely deploy models at scale while optimizing for cost. The following diagram illustrates the solution workflow.

Managing performance and scalability

Managing scalability in the project involves balancing performance with efficiency and resource management. With SageMaker AI, the team supports distributed inference and multi-model deployments, preventing memory bottlenecks and reducing hardware costs. SageMaker AI provides access to advanced GPUs, supports multi-model deployments, and enables intelligent batching strategies to balance throughput with latency. This flexibility makes sure performance improvements don’t compromise scalability, even in high-demand scenarios. To learn more, see Revolutionizing AI: How Amazon SageMaker Enhances Einstein’s Large Language Model Latency and Throughput.

Accelerating development with SageMaker Deep Learning Containers

SageMaker AI Deep Learning Containers (DLCs) play a crucial role in accelerating model development and deployment. These pre-built containers come with optimized deep learning frameworks and best-practice configurations, providing a head start for AI teams. DLCs provide optimized library versions, preconfigured CUDA settings, and other performance enhancements that improve inference speeds and efficiency. This significantly reduces the setup and configuration overhead, allowing engineers to focus on model optimization rather than infrastructure concerns.

Best practice configurations for deployment in SageMaker AI

A key advantage of using SageMaker AI is the best practice configurations for deployment. SageMaker AI provides default parameters for setting GPU utilization and memory allocation, which simplifies the process of configuring high-performance inference environments. These features make it straightforward to deploy optimized models with minimal manual intervention, providing high availability and low-latency responses.

The team uses the DLC’s rolling-batch capability, which optimizes request batching to maximize throughput while maintaining low latency. SageMaker AI DLCs expose configurations for rolling batch inference with best-practice defaults, simplifying the implementation process. By adjusting parameters such as max_rolling_batch_size and job_queue_size, the team was able to fine-tune performance without extensive custom engineering. This streamlined approach provides optimal GPU utilization while maintaining real-time response requirements.

SageMaker AI provides elastic load balancing, instance scaling, and real-time model monitoring, and provides Salesforce control over scaling and routing strategies to suit their needs. These measures maintain consistent performance across environments while optimizing scalability, performance, and cost-efficiency.

Because the team supports multiple simultaneous deployments across projects, they needed to make sure enhancements in each project didn’t compromise others. To address this, they adopted a modular development approach. The SageMaker AI DLC architecture is designed with modular components such as the engine abstraction layer, model store, and workload manager. This structure allows the team to isolate and optimize individual components on the container, like rolling batch inference for throughput, without disrupting critical functionality such as latency or multi-framework support. This allows project teams to work on individual projects such as performance tuning while allowing others to focus on enabling other functionalities such as streaming in parallel.

This cross-functional collaboration is complemented by comprehensive testing. The Salesforce AI model team implemented continuous integration (CI) pipelines using a mix of internal and external tools such as Jenkins and Spinnaker to detect any unintended side effects early. Regression testing made sure that optimizations, such as deploying models with TensorRT or vLLM, didn’t negatively impact scalability or user experience. Regular reviews, involving collaboration between the development, foundation model operations (FMOps), and security teams, made sure that optimizations aligned with project-wide objectives.

Configuration management is also part of the CI pipeline. To be precise, configuration is stored in git alongside inference code. Configuration management using simple YAML files enabled rapid experimentation across optimizers and hyperparameters without altering the underlying code. These practices made sure that performance or security improvements were well-coordinated and didn’t introduce trade-offs in other areas.

Maintaining security through rapid deployment

Balancing rapid deployment with high standards of trust and security requires embedding security measures throughout the development and deployment lifecycle. Secure-by-design principles are adopted from the outset, making sure that security requirements are integrated into the architecture. Rigorous testing of all models is conducted in development environments alongside performance testing to provide scalable performance and security before production.

To maintain these high standards throughout the development process, the team employs several strategies:

Automated continuous integration and delivery (CI/CD) pipelines with built-in checks for vulnerabilities, compliance validation, and model integrity
Employing DJL-Serving’s encryption mechanisms for data in transit and at rest
Using AWS services like SageMaker AI that provide enterprise-grade security features such as role-based access control (RBAC) and network isolation

Frequent automated testing for both performance and security is employed through small incremental deployments, allowing for early issue identification while minimizing risks. Collaboration with cloud providers and continuous monitoring of deployments maintain compliance with the latest security standards and make sure rapid deployment aligns seamlessly with robust security, trust, and reliability.

Focus on continuous improvement

As Salesforce’s generative AI needs scale, and with the ever-changing model landscape, the team continually works to improve their deployment infrastructure—ongoing research and development efforts are centered on enhancing the performance, scalability, and efficiency of LLM deployments. The team is exploring new optimization techniques with SageMaker, including:

Advanced quantization methods (INT-4, AWQ, FP8)
Tensor parallelism (splitting tensors across multiple GPUs)
More efficient batching using caching strategies within DJL-Serving to boost throughput and reduce latency

The team is also investigating emerging technologies like AWS AI chips (AWS Trainium and AWS Inferentia) and AWS Graviton processors to further improve cost and energy efficiency. Collaboration with open source communities and public cloud providers like AWS makes sure that the latest advancements are incorporated into deployment pipelines while also pushing the boundaries further. Salesforce is collaborating with AWS to include advanced features into DJL, which makes the usage even better and more robust, such as additional configuration parameters, environment variables, and more granular metrics for logging. A key focus is refining multi-framework support and distributed inference capabilities to provide seamless model integration across various environments.

Efforts are also underway to enhance FMOps practices, such as automated testing and deployment pipelines, to expedite production readiness. These initiatives aim to stay at the forefront of AI innovation, delivering cutting-edge solutions that align with business needs and meet customer expectations. They are in close collaboration with the SageMaker team to continue to explore potential features and capabilities to support these areas.

Conclusion

Though exact metrics vary by use case, the Salesforce AI Model Serving team saw substantial improvements in terms of deployment speed and cost-efficiency with their strategy on SageMaker AI. They experienced faster iteration cycles, measured in days or even hours instead of weeks. With SageMaker AI, they reduced their model deployment time by as much as 50%.

To learn more about how SageMaker AI enhances Einstein’s LLM latency and throughput, see Revolutionizing AI: How Amazon SageMaker Enhances Einstein’s Large Language Model Latency and Throughput. For more information on how to get started with SageMaker AI, refer to Guide to getting set up with Amazon SageMaker AI.

About the authors

Sai Guruju is working as a Lead Member of Technical Staff at Salesforce. He has over 7 years of experience in software and ML engineering with a focus on scalable NLP and speech solutions. He completed his Bachelor’s of Technology in EE from IIT-Delhi, and has published his work at InterSpeech 2021 and AdNLP 2024.

Nitin Surya is working as a Lead Member of Technical Staff at Salesforce. He has over 8 years of experience in software and machine learning engineering, completed his Bachelor’s of Technology in CS from VIT University, with an MS in CS (with a major in Artificial Intelligence and Machine Learning) from the University of Illinois Chicago. He has three patents pending, and has published and contributed to papers at the CoRL Conference.

Srikanta Prasad is a Senior Manager in Product Management specializing in generative AI solutions, with over 20 years of experience across semiconductors, aerospace, aviation, print media, and software technology. At Salesforce, he leads model hosting and inference initiatives, focusing on LLM inference serving, LLMOps, and scalable AI deployments. Srikanta holds an MBA from the University of North Carolina and an MS from the National University of Singapore.

Rielah De Jesus is a Principal Solutions Architect at AWS who has successfully helped various enterprise customers in the DC, Maryland, and Virginia area move to the cloud. In her current role, she acts as a customer advocate and technical advisor focused on helping organizations like Salesforce achieve success on the AWS platform. She is also a staunch supporter of women in IT and is very passionate about finding ways to creatively use technology and data to solve everyday challenges.

Automate Amazon EKS troubleshooting using an Amazon Bedrock agentic workflow

April 16, 2025

by Vikram Venkataraman Amazon AWS

As organizations scale their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, platform administrators face increasing challenges in efficiently managing multi-tenant clusters. Tasks such as investigating pod failures, addressing resource constraints, and resolving misconfiguration can consume significant time and effort. Instead of spending valuable engineering hours manually parsing logs, tracking metrics, and implementing fixes, teams should focus on driving innovation. Now, with the power of generative AI, you can transform your Kubernetes operations. By implementing intelligent cluster monitoring, pattern analysis, and automated remediation, you can dramatically reduce both mean time to identify (MTTI) and mean time to resolve (MTTR) for common cluster issues.

At AWS re:Invent 2024, we announced the multi-agent collaboration capability for Amazon Bedrock (preview). With multi-agent collaboration, you can build, deploy, and manage multiple AI agents working together on complex multistep tasks that require specialized skills. Because troubleshooting an EKS cluster involves deriving insights from multiple observability signals and applying fixes using a continuous integration and deployment (CI/CD) pipeline, a multi-agent workflow can help an operations team streamline the management of EKS clusters. The workflow manager agent can integrate with individual agents that interface with individual observability signals and a CI/CD workflow to orchestrate and perform tasks based on user prompt.

In this post, we demonstrate how to orchestrate multiple Amazon Bedrock agents to create a sophisticated Amazon EKS troubleshooting system. By enabling collaboration between specialized agents—deriving insights from K8sGPT and performing actions through the ArgoCD framework—you can build a comprehensive automation that identifies, analyzes, and resolves cluster issues with minimal human intervention.

Solution overview

The architecture consists of the following core components:

Amazon Bedrock collaborator agent – Orchestrates the workflow and maintains context while routing user prompts to specialized agents, managing multistep operations and agent interactions
Amazon Bedrock agent for K8sGPT – Evaluates cluster and pod events through K8sGPT’s Analyze API for security issues, misconfigurations, and performance problems, providing remediation suggestions in natural language
Amazon Bedrock agent for ArgoCD – Manages GitOps-based remediation through ArgoCD, handling rollbacks, resource optimization, and configuration updates

The following diagram illustrates the solution architecture.

Prerequisites

You need to have the following prerequisites in place:

The AWS Command Line Interface (AWS CLI) version 2. For installation instructions, refer to Installing or updating to the latest version of the AWS CLI.
An EKS cluster.
helm.
Kubectl.
Amazon Bedrock model access (In this post, we used Anthropic Claude 3.5 Sonnet v1) in the AWS Region of deployment.
Download the accompanying AWS CloudFormation template. The template is dependent on downloading resources from an Amazon Simple Storage Service (Amazon S3) bucket provisioned in the US East (N. Virginia) us-east-1 AWS Region. Hence, it’s restricted to running in the us-east-1 Region only.

Set up the Amazon EKS cluster with K8sGPT and ArgoCD

We start with installing and configuring the K8sGPT operator and ArgoCD controller on the EKS cluster.

The K8sGPT operator will help with enabling AI-powered analysis and troubleshooting of cluster issues. For example, it can automatically detect and suggest fixes for misconfigured deployments, such as identifying and resolving resource constraint problems in pods.

ArgoCD is a declarative GitOps continuous delivery tool for Kubernetes that automates the deployment of applications by keeping the desired application state in sync with what’s defined in a Git repository.

The Amazon Bedrock agent serves as the intelligent decision-maker in our architecture, analyzing cluster issues detected by K8sGPT. After the root cause is identified, the agent orchestrates corrective actions through ArgoCD’s GitOps engine. This powerful integration means that when problems are detected (whether it’s a misconfigured deployment, resource constraints, or scaling issue), the agent can automatically integrate with ArgoCD to provide the necessary fixes. ArgoCD then picks up these changes and synchronizes them with your EKS cluster, creating a truly self-healing infrastructure.

Create the necessary namespaces in Amazon EKS:

kubectl create ns helm-guestbook
kubectl create ns k8sgpt-operator-system

Add the k8sgpt Helm repository and install the operator:

helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
helm install k8sgpt-operator k8sgpt/k8sgpt-operator 
  --namespace k8sgpt-operator-system

You can verify the installation by entering the following command:

kubectl get pods -n k8sgpt-operator-system

NAME                                                          READY   STATUS    RESTARTS  AGE
release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Running   0         1d

After the operator is deployed, you can configure a K8sGPT resource. This Custom Resource Definition(CRD) will have the large language model (LLM) configuration that will aid in AI-powered analysis and troubleshooting of cluster issues. K8sGPT supports various backends to help in AI-powered analysis. For this post, we use Amazon Bedrock as the backend and Anthropic’s Claude V3 as the LLM.

You need to create the pod identity for providing the EKS cluster access to other AWS services with Amazon Bedrock:

eksctl create podidentityassociation  --cluster PetSite --namespace k8sgpt-operator-system --service-account-name k8sgpt  --role-name k8sgpt-app-eks-pod-identity-role --permission-policy-arns arn:aws:iam::aws:policy/AmazonBedrockFullAccess  --region $AWS_REGION

Configure the K8sGPT CRD:

cat << EOF > k8sgpt.yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-bedrock
  namespace: k8sgpt-operator-system
spec:
  ai:
    enabled: true
    model: anthropic.claude-v3
    backend: amazonbedrock
    region: us-east-1
    credentials:
      secretRef:
        name: k8sgpt-secret
        namespace: k8sgpt-operator-system
  noCache: false
  repository: ghcr.io/k8sgpt-ai/k8sgpt
  version: v0.3.48
EOF

kubectl apply -f k8sgpt.yaml

Validate the settings to confirm the k8sgpt-bedrock pod is running successfully:

kubectl get pods -n k8sgpt-operator-system
NAME                                                          READY   STATUS    RESTARTS      AGE
k8sgpt-bedrock-5b655cbb9b-sn897                               1/1     Running   9 (22d ago)   22d
release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Running   3 (10h ago)   22d

Now you can configure the ArgoCD controller:

helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
kubectl create namespace argocd
helm install argocd argo/argo-cd 
  --namespace argocd 
  --create-namespace

Verify the ArgoCD installation:

kubectl get pods -n argocd
NAME                                                READY   STATUS    RESTARTS   AGE
argocd-application-controller-0                     1/1     Running   0          43d
argocd-applicationset-controller-5c787df94f-7jpvp   1/1     Running   0          43d
argocd-dex-server-55d5769f46-58dwx                  1/1     Running   0          43d
argocd-notifications-controller-7ccbd7fb6-9pptz     1/1     Running   0          43d
argocd-redis-587d59bbc-rndkp                        1/1     Running   0          43d
argocd-repo-server-76f6c7686b-rhjkg                 1/1     Running   0          43d
argocd-server-64fcc786c-bd2t8                       1/1     Running   0          43d

Patch the argocd service to have an external load balancer:

kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'

You can now access the ArgoCD UI with the following load balancer endpoint and the credentials for the admin user:

kubectl get svc argocd-server -n argocd
NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE
argocd-server   LoadBalancer   10.100.168.229   a91a6fd4292ed420d92a1a5c748f43bc-653186012.us-east-1.elb.amazonaws.com   80:32334/TCP,443:32261/TCP   43d

Retrieve the credentials for the ArgoCD UI:

export argocdpassword=`kubectl -n argocd get secret argocd-initial-admin-secret 
-o jsonpath="{.data.password}" | base64 -d`

echo ArgoCD admin password - $argocdpassword

Push the credentials to AWS Secrets Manager:

aws secretsmanager create-secret 
--name argocdcreds 
--description "Credentials for argocd" 
--secret-string "{"USERNAME":"admin","PASSWORD":"$argocdpassword"}"

Configure a sample application in ArgoCD:

cat << EOF > argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: helm-guestbook
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/awsvikram/argocd-example-apps
targetRevision: HEAD
path: helm-guestbook
destination:
server: https://kubernetes.default.svc
namespace: helm-guestbook
syncPolicy:
automated:
prune: true
selfHeal: true
EOF

Apply the configuration and verify it from the ArgoCD UI by logging in as the admin user:
```
kubectl apply -f argocd-application.yaml
```
It takes some time for K8sGPT to analyze the newly created pods. To make that immediate, restart the pods created in the k8sgpt-operator-system namespace. The pods can be restarted by entering the following command:
```
kubectl -n k8sgpt-operator-system rollout restart deploy

deployment.apps/k8sgpt-bedrock restarted
deployment.apps/k8sgpt-operator-controller-manager restarted
```

Set up the Amazon Bedrock agents for K8sGPT and ArgoCD

We use a CloudFormation stack to deploy the individual agents into the US East (N. Virginia) Region. When you deploy the CloudFormation template, you deploy several resources (costs will be incurred for the AWS resources used).

Use the following parameters for the CloudFormation template:

EnvironmentName: The name for the deployment (EKSBlogSetup)

ArgoCD_LoadBalancer_URL: Extracting the ArgoCD LoadBalancer URL:

kubectl  get service argocd-server -n argocd -ojsonpath="{.status.loadBalancer.ingress[0].hostname}"

AWSSecretName: The Secrets Manager secret name that was created to store ArgoCD credentials

The stack creates the following AWS Lambda functions:

<Stack name>-LambdaK8sGPTAgent-<auto-generated>
<Stack name>-RestartRollBackApplicationArgoCD-<auto-generated>
<Stack name>-ArgocdIncreaseMemory-<auto-generated>

The stack creates the following Amazon Bedrock agents:

ArgoCDAgent, with the following action groups:
1. argocd-rollback
2. argocd-restart
3. argocd-memory-management

K8sGPTAgent, with the following action group:
1. k8s-cluster-operations

CollaboratorAgent

The stack outputs the following, with the following agents associated to it:

ArgoCDAgent
K8sGPTAgent

LambdaK8sGPTAgentRole, AWS Identity and Access Management (IAM) role Amazon Resource Name (ARN) associated to the Lambda function handing interactions with the K8sGPT agent on the EKS cluster. This role ARN will be needed at a later stage of the configuration process.
K8sGPTAgentAliasId, ID of the K8sGPT Amazon Bedrock agent alias
ArgoCDAgentAliasId, ID of the ArgoCD Amazon Bedrock Agent alias
CollaboratorAgentAliasId, ID of the collaborator Amazon Bedrock agent alias

Assign appropriate permissions to enable K8sGPT Amazon Bedrock agent to access the EKS cluster

To enable the K8sGPT Amazon Bedrock agent to access the EKS cluster, you need to configure the appropriate IAM permissions using Amazon EKS access management APIs. This is a two-step process: first, you create an access entry for the Lambda function’s execution role (which you can find in the CloudFormation template output section), and then you associate the AmazonEKSViewPolicy to grant read-only access to the cluster. This configuration makes sure that the K8sGPT agent has the necessary permissions to monitor and analyze the EKS cluster resources while maintaining the principle of least privilege.

Create an access entry for the Lambda function’s execution role

export CFN_STACK_NAME=EKS-Troubleshooter
	   export EKS_CLUSTER=PetSite

export K8SGPT_LAMBDA_ROLE=`aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='LambdaK8sGPTAgentRole'].OutputValue" --output text`

aws eks create-access-entry 
    --cluster-name $EKS_CLUSTER 
    --principal-arn $K8SGPT_LAMBDA_ROLE

Associate the EKS view policy with the access entry

aws eks associate-access-policy 
    --cluster-name $EKS_CLUSTER 
    --principal-arn  $K8SGPT_LAMBDA_ROLE
    --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy 
    --access-scope type=cluster

Verify the Amazon Bedrock agents. The CloudFormation template adds all three required agents. To view the agents, on the Amazon Bedrock console, under Builder tools in the navigation pane, select Agents, as shown in the following screenshot.

Perform Amazon EKS troubleshooting using the Amazon Bedrock agentic workflow

Now, test the solution. We explore the following two scenarios:

The agent coordinates with the K8sGPT agent to provide insights into the root cause of a pod failure
The collaborator agent coordinates with the ArgoCD agent to provide a response

Agent coordinates with K8sGPT agent to provide insights into the root cause of a pod failure

In this section, we examine a down alert for a sample application called memory-demo. We’re interested in the root cause of the issue. We use the following prompt: “We got a down alert for the memory-demo app. Help us with the root cause of the issue.”

The agent not only stated the root cause, but went one step further to potentially fix the error, which in this case is increasing memory resources to the application.

Collaborator agent coordinates with ArgoCD agent to provide a response

For this scenario, we continue from the previous prompt. We feel the application wasn’t provided enough memory, and it should be increased to permanently fix the issue. We can also tell the application is in an unhealthy state in the ArgoCD UI, as shown in the following screenshot.

Let’s now proceed to increase the memory, as shown in the following screenshot.

The agent interacted with the argocd_operations Amazon Bedrock agent and was able to successfully increase the memory. The same can be inferred in the ArgoCD UI.

Cleanup

If you decide to stop using the solution, complete the following steps:

To delete the associated resources deployed using AWS CloudFormation:
1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
2. Locate the stack you created during the deployment process (you assigned a name to it).
3. Select the stack and choose Delete.
Delete the EKS cluster if you created one specifically for this implementation.

Conclusion

By orchestrating multiple Amazon Bedrock agents, we’ve demonstrated how to build an AI-powered Amazon EKS troubleshooting system that simplifies Kubernetes operations. This integration of K8sGPT analysis and ArgoCD deployment automation showcases the powerful possibilities when combining specialized AI agents with existing DevOps tools. Although this solution represents advancement in automated Kubernetes operations, it’s important to remember that human oversight remains valuable, particularly for complex scenarios and strategic decisions.

As Amazon Bedrock and its agent capabilities continue to evolve, we can expect even more sophisticated orchestration possibilities. You can extend this solution to incorporate additional tools, metrics, and automation workflows to meet your organization’s specific needs.

To learn more about Amazon Bedrock, refer to the following resources:

About the authors

Vikram Venkataraman is a Principal Specialist Solutions Architect at Amazon Web Services (AWS). He helps customers modernize, scale, and adopt best practices for their containerized workloads. With the emergence of Generative AI, Vikram has been actively working with customers to leverage AWS’s AI/ML services to solve complex operational challenges, streamline monitoring workflows, and enhance incident response through intelligent automation.

Puneeth Ranjan Komaragiri is a Principal Technical Account Manager at Amazon Web Services (AWS). He is particularly passionate about monitoring and observability, cloud financial management, and generative AI domains. In his current role, Puneeth enjoys collaborating closely with customers, leveraging his expertise to help them design and architect their cloud workloads for optimal scale and resilience.

Sudheer Sangunni is a Senior Technical Account Manager at AWS Enterprise Support. With his extensive expertise in the AWS Cloud and big data, Sudheer plays a pivotal role in assisting customers with enhancing their monitoring and observability capabilities within AWS offerings.

Vikrant Choudhary is a Senior Technical Account Manager at Amazon Web Services (AWS), specializing in healthcare and life sciences. With over 15 years of experience in cloud solutions and enterprise architecture, he helps businesses accelerate their digital transformation initiatives. In his current role, Vikrant partners with customers to architect and implement innovative solutions, from cloud migrations and application modernization to emerging technologies such as generative AI, driving successful business outcomes through cloud adoption.

Host concurrent LLMs with LoRAX

April 16, 2025

by John Kitaoka Amazon AWS

Businesses are increasingly seeking domain-adapted and specialized foundation models (FMs) to meet specific needs in areas such as document summarization, industry-specific adaptations, and technical code generation and advisory. The increased usage of generative AI models has offered tailored experiences with minimal technical expertise, and organizations are increasingly using these powerful models to drive innovation and enhance their services across various domains, from natural language processing (NLP) to content generation.

However, using generative AI models in enterprise environments presents unique challenges. Out-of-the-box models often lack the specific knowledge required for certain domains or organizational terminologies. To address this, businesses are turning to custom fine-tuned models, also known as domain-specific large language models (LLMs). These models are tailored to perform specialized tasks within specific domains or micro-domains. Similarly, organizations are fine-tuning generative AI models for domains such as finance, sales, marketing, travel, IT, human resources (HR), procurement, healthcare and life sciences, and customer service. Independent software vendors (ISVs) are also building secure, managed, multi-tenant generative AI platforms.

As the demand for personalized and specialized AI solutions grows, businesses face the challenge of efficiently managing and serving a multitude of fine-tuned models across diverse use cases and customer segments. From résumé parsing and job skill matching to domain-specific email generation and natural language understanding, companies often grapple with managing hundreds of fine-tuned models tailored to specific needs. This challenge is further compounded by concerns over scalability and cost-effectiveness. Traditional model serving approaches can become unwieldy and resource-intensive, leading to increased infrastructure costs, operational overhead, and potential performance bottlenecks, due to the size and hardware requirements to maintain a high-performing FM. The following diagram represents a traditional approach to serving multiple LLMs.

Fine-tuning LLMs is prohibitively expensive due to the hardware requirements and the costs associated with hosting separate instances for different tasks.

In this post, we explore how Low-Rank Adaptation (LoRA) can be used to address these challenges effectively. Specifically, we discuss using LoRA serving with LoRA eXchange (LoRAX) and Amazon Elastic Compute Cloud (Amazon EC2) GPU instances, allowing organizations to efficiently manage and serve their growing portfolio of fine-tuned models, optimize costs, and provide seamless performance for their customers.

LoRA is a technique for efficiently adapting large pre-trained language models to new tasks or domains by introducing small trainable weight matrices, called adapters, within each linear layer of the pre-trained model. This approach enables efficient adaptation with a significantly reduced number of trainable parameters compared to full model fine-tuning. Although LoRA allows for efficient adaptation, typical hosting of fine-tuned models merges the fine-tuned layers and base model weights together, so organizations with multiple fine-tuned variants normally must host each on separate instances. Because the resultant adapters are relatively small compared to the base model and are the last few layers of inference, this traditional custom model-serving approach is inefficient toward both resource and cost optimization.

A solution for this is provided by an open source software tool called LoRAX that provides weight-swapping mechanisms for inference toward serving multiple variants of a base FM. LoRAX takes away having to manually set up the adapter attaching and detaching process with the pre-trained FM while you’re swapping between inferencing fine-tuned models for different domain or instruction use cases.

With LoRAX, you can fine-tune a base FM for a variety of tasks, including SQL query generation, industry domain adaptations, entity extraction, and instruction responses. They can host the different variants on a single EC2 instance instead of a fleet of model endpoints, saving costs without impacting performance.

Why LoRAX for LoRA deployment on AWS?

The surge in popularity of fine-tuning LLMs has given rise to multiple inference container methods for deploying LoRA adapters on AWS. Two prominent approaches among our customers are LoRAX and vLLM.

vLLM offers rapid inference speeds and high-performance capabilities, making it well-suited for applications that demand heavy-serving throughput at low cost, making it a perfect fit especially when running multiple fine-tuned models with the same base model. You can run vLLM inference containers using Amazon SageMaker, as demonstrated in Efficient and cost-effective multi-tenant LoRA serving with Amazon SageMaker in the AWS Machine Learning Blog. However, the complexity of vLLM currently limits ease of implementing custom integrations for applications. vLLM also has limited quantization support.

For those seeking methods to build applications with strong community support and custom integrations, LoRAX presents an alternative. LoRAX is built upon Hugging Face’s Text Generation Interface (TGI) container, which is optimized for memory and resource efficiency when working with transformer-based models. Furthermore, LoRAX supports quantization methods such as Activation-aware Weight Quantization (AWQ) and Half-Quadratic Quantization (HQQ)

Solution overview

The LoRAX inference container can be deployed on a single EC2 G6 instance, and models and adapters can be loaded in using Amazon Simple Storage Service (Amazon S3) or Hugging Face. The following diagram is the solution architecture.

Prerequisites

For this guide, you need access to the following prerequisites:

An AWS account
Proper permissions to deploy EC2 G6 instances. LoRAX is built with the intention of using NVIDIA CUDA technology, and the G6 family of EC2 instances is the most cost-efficient instance types with the more recent NVIDIA CUDA accelerators. Specifically, the G6.xlarge is the most cost-efficient for the purposes of this tutorial, at the time of this writing. Make sure that quota increases are active prior to deployment.
(Optional) A Jupyter notebook within Amazon SageMaker Studio or SageMaker Notebook Instances. After your requested quotas are applied to your account, you can use the default Studio Python 3 (Data Science) image with an ml.t3.medium instance to run the optional notebook code snippets. For the full list of available kernels, refer to available Amazon SageMaker kernels.

Walkthrough

This post walks you through creating an EC2 instance, downloading and deploying the container image, and hosting a pre-trained language model and custom adapters from Amazon S3. Follow the prerequisite checklist to make sure that you can properly implement this solution.

Configure server details

In this section, we show how to configure and create an EC2 instance to host the LLM. This guide uses the EC2 G6 instance class, and we deploy a 15 GB Llama2 7B model. It’s recommended to have about 1.5x the GPU memory capacity of the model to swiftly run inference on a language model. GPU memory specifications can be found at Amazon ECS task definitions for GPU workloads.

You have the option to quantize the model. Quantizing a language model reduces the model weights to a size of your choosing. For example, the LLM we use is Meta’s Llama2 7b, which by default has a weight size of fp16, or 16-bit floating point. We can convert the model weights to int8 or int4 (8- or 4-bit integers) to shrink the memory footprint of the model by 50% and 25% respectively. In this guide, we use the default fp16 representation of Meta’s Llama2 7B, so we require an instance type with at least 22 GB of GPU memory, or VRAM.

Depending on the language model specifications, we need to adjust the amount of Amazon Elastic Block Store (Amazon EBS) storage to properly store the base model and adapter weights.

To set up your inference server, follow these steps:

On the Amazon EC2 console, choose Launch instances, as shown in the following screenshot.
For Name, enter LoRAX - Inference Server.
To open AWS CloudShell, on the bottom left of the AWS Management Console choose CloudShell, as shown in the following screenshot.

Paste the following command into CloudShell and copy the resulting text, as shown in the screenshot that follows. This is the Amazon Machine Image (AMI) ID you will use.

aws ec2 describe-images --filters 'Name=name,Values=Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5*(Ubuntu*' 'Name=state,Values=available' --query 'sort_by(Images, &CreationDate)[-1].ImageId' --output text

In the Application and OS Images (Amazon Machine Image) search bar, enter ami-0d2047d61ff42e139 and press Enter on your keyboard.
In Selected AMI, enter the AMI ID that you got from the CloudShell command. In Community AMIs, search for the Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5.1 (Ubuntu 22.04) AMI.
Choose Select, as shown in the following screenshot.
Specify the Instance type as g6.xlarge. Depending on the size of the model, you can increase the size of the instance to accommodate your For information on GPU memory per instance type, visit Amazon EC2 task definitions for GPU workloads.
(Optional) Under Key pair (login), create a new key pair or select an existing key pair if you want to use one to connect to the instance using Secure Shell (SSH).
In Network settings, choose Edit, as shown in the following screenshot.
Leave default settings for VPC, Subnet, and Auto-assign public IP.
Under Firewall (security groups), for Security group name, enter Inference Server Security Group.
For Description, enter Security Group for Inference Server.
Under Inbound Security Group Rules, edit Security group rule 1 to limit SSH traffic to your IP address by changing Source type to My IP.
Choose Add security group rule.
Configure Security group rule 2 by changing Type to All ICMP-IPv4 and Source Type to My IP. This is to make sure the server is only reachable to your IP address and not bad actors.
Under Configure storage, set Root volume size to 128 GiB to allow enough space for storing base model and adapter weights. For larger models and more adapters, you might need to increase this value accordingly. The model card available with most open source models details the size of the model weights and other usage information. We suggest 128 GB for the starting storage size here because downloading multiple adapters along with the model weights can add up very quickly. Factoring the operating system space, downloaded drivers and dependencies, and various project files, 128 GB is a safer storage size to start off with before adjusting up or down. After setting the desired storage space, select the Advanced details dropdown menu.
Under IAM instance profile, either select or create an IAM instance profile that has S3 read access enabled.
Choose Launch instance.
When the instance finishes launching, select either SSH or Instance connect to connect to your instance and enter the following commands:
```
sudo apt update
sudo systemctl start docker 
sudo nvidia-ctk runtime configure --runtime=docker 
sudo systemctl restart docker
```

Install container and launch server

The server is now properly configured to load and run the serving software.

Enter the following commands to download and deploy the LoRAX Docker container image. For more information, refer to Run container with base LLM. Specify a model from Hugging Face or the storage volume and load the model for inference. Replace the parameters in the commands to suit your requirements (for example, <huggingface-access-token>).

Adding the -d tag as shown will run the download and installation process in the background. It can take up to 30 minutes until properly configured. Using the docker commands docker ps and docker logs <container-name>, you can view the progress of the Docker container and observe when the container is finished setting up. docker logs <container-name> --follow will continue streaming the new output from the container for continuous monitoring.

model=meta-llama/Llama-2-7b-hf
volume=$PWD/data
token=<huggingface-access-token>

docker run -d --gpus all --shm-size 1g -p 8080:80 -v $volume:/data -e HUGGING_FACE_HUB_TOKEN=$token ghcr.io/predibase/lorax:main --model-id $model

Test server and adapters

By running the container as a background process using the -d tag, you can prompt the server with incoming requests. By specifying the model-id as a Hugging Face model ID, LoRAX loads the model into memory directly from Hugging Face.

This isn’t recommended for production because relying on Hugging Face introduces yet another point of failure in case the model or adapter is unavailable. It’s recommended that models be stored locally either in Amazon S3, Amazon EBS, or Amazon Elastic File System (Amazon EFS) for consistent deployments. Later in this post, we discuss a way to load models and adapters from Amazon S3 as you go.

LoRAX also can pull adapter files from Hugging Face at runtime. You can use this capability by adding adapter_id and adapter_source within the body of the request. The first time a new adapter is requested, it can take some time to load into the server, but requests afterwards will load from memory.

Enter the following command to prompt the base model:

curl 127.0.0.1:8080/generate  
-X POST  
d '{ 
  "inputs": "why is the sky blue", 
  "parameters": { 
    "max_new_tokens": 6 
  } 
}'  
H 'Content-Type: application/json'

Enter the following command to prompt the base model with the specified adapter:

curl 127.0.0.1:8080/generate 
-X POST  
-d '{ 
  "inputs": "why is the sky blue", 
  "parameters": { 
    "max_new_tokens": 64, 
    "adapter_id": "vineetsharma/qlora-adapter-Llama-2-7b-hf-databricks-dolly-15k", 
    "adapter_source": "hub" 
  } 
}'  
-H 'Content-Type: application/json'

[Optional] Create custom adapters with SageMaker training and PEFT

Typical fine-tuning jobs for LLMs merge the adapter weights with the original base model, but using software such as Hugging Face’s PEFT library allows for fine-tuning with adapter separation.

Follow the steps outlined in this AWS Machine Learning blog post to fine-tune Meta’s Llama 2 and get the separated LoRA adapter in Amazon S3.

[Optional] Use adapters from Amazon S3

LoRAX can pull adapter files from Amazon S3 at runtime. You can use this capability by adding adapter_id and adapter_source within the body of the request. The first time a new adapter is requested, it can take some time to load into the server, but requests afterwards will load from server memory. This is the optimal method when running LoRAX in production environments compared to importing from Hugging Face because it doesn’t involve runtime dependencies.

curl 127.0.0.1:8080/generate 
-X POST 
-d '{
  "inputs": "What is process mining?",
  "parameters": {
    "max_new_tokens": 64,
    "adapter_id": "<your-adapter-s3-bucket-path>",
    "adapter_source": "s3"
  }
}' 
-H 'Content-Type: application/json'

[Optional] Use custom models from Amazon S3

LoRAX also can load custom language models from Amazon S3. If the model architecture is supported in the LoRAX documentation, you can specify a bucket name to pull the weights from, as shown in the following code example. Refer to the previous optional section on separating adapter weights from base model weights to customize your own language model.

volume=$PWD/data
bucket_name=<s3-bucket-name>
model=<model-directory-name>

docker run --gpus all --shm-size 1g -e PREDIBASE_MODEL_BUCKET=$bucket_name -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model

Reliable deployments using Amazon S3 for model and adapter storage

Storing models and adapters in Amazon S3 offers a more dependable solution for consistent deployments compared to relying on third-party services such as Hugging Face. By managing your own storage, you can implement robust protocols so your models and adapters remain accessible when needed. Additionally, you can use this approach to maintain version control and isolate your assets from external sources, which is crucial for regulatory compliance and governance.

For even greater flexibility, you can use virtual file systems such as Amazon EFS or Amazon FSx for Lustre. You can use these services to mount the same models and adapters across multiple instances, facilitating seamless access in environments with auto scaling setups. This means that all instances, whether scaling up or down, have uninterrupted access to the necessary resources, enhancing the overall reliability and scalability of your deployments.

Cost comparison and advisory on scaling

Using the LoRAX inference containers on EC2 instances means that you can drastically reduce the costs of hosting multiple fine-tuned versions of language models by storing all adapters in memory and swapping dynamically at runtime. Because LLM adapters are typically a fraction of the size of the base model, you can efficiently scale your infrastructure according to server usage and not by individual variant utilization. LoRA adapters are usually anywhere from 1/10th to 1/4th the size of the base model. But, again, it depends on the implementation and complexity of the task that the adapter is being trained on or for. Regular adapters can be as large as the base model.

In the preceding example, the model adapters resultant from the training methods were 5 MB.

Though this storage amount depends on the specific model architecture, you can dynamically swap up to thousands of fine-tuned variants on a single instance with little to no change to inference speed. It’s recommended to use instances with around 150% GPU memory to model and variant size to account for model, adapter, and KV cache (or attention cache) storage in VRAM. For GPU memory specifications, refer to Amazon ECS task definitions for GPU workloads.

Depending on the chosen base model and the number of fine-tuned adapters, you can train and deploy hundreds or thousands of customized language models sharing the same base model using LoRAX to dynamically swap out adapters. With adapter swapping mechanisms, if you have five fine-tuned variants, you can save 80% on hosting costs because all the custom adapters can be used in the same instance.

Launch templates in Amazon EC2 can be used to deploy multiple instances, with options for load balancing or auto scaling. You can additionally use AWS Systems Manager to deploy patches or changes. As discussed previously, a shared file system can be used across all deployed EC2 resources to store the LLM weights for multiple adapters, resulting in faster translation to the instances compared to Amazon S3. The difference between using a shared file system such as Amazon EFS over direct Amazon S3 access is the number of steps to load the model weights and adapters into memory. With Amazon S3, the adapter and weights need to be transferred to the local file system of the instance before being loaded. However, shared file systems don’t need to transfer the file locally and can be loaded directly. There are implementation tradeoffs that should be taken into consideration. You can also use Amazon API Gateway as an API endpoint for REST-based applications.

Host LoRAX servers for multiple models in production

If you intend to use multiple custom FMs for specific tasks with LoRAX, follow this guide for hosting multiple variants of models. Follow this AWS blog on hosting text classification with BERT to perform task routing between the expert models. For an example implementation of efficient model hosting using adapter swapping, refer to LoRA Land, which was released by Predibase, the organization responsible for LoRAX. LoRA Land is a collection of 25 fine-tuned variants of Mistral.ai’s Mistral-7b LLM that collectively outperforms top-performing LLMs hosted behind a single endpoint. The following diagram is the architecture.

Cleanup

In this guide, we created security groups, an S3 bucket, an optional SageMaker notebook instance, and an EC2 inference server. It’s important to terminate resources created during this walkthrough to avoid incurring additional costs:

Delete the S3 bucket
Terminate the EC2 inference server
Terminate the SageMaker notebook instance

Conclusion

After following this guide, you can set up an EC2 instance with LoRAX for language model hosting and serving, storing and accessing custom model weights and adapters in Amazon S3, and manage pre-trained and custom models and variants using SageMaker. LoRAX allows for a cost-efficient approach for those who want to host multiple language models at scale. For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS.

About the Authors

John Kitaoka is a Solutions Architect at Amazon Web Services, working with government entities, universities, nonprofits, and other public sector organizations to design and scale artificial intelligence solutions. With a background in mathematics and computer science, John’s work covers a broad range of ML use cases, with a primary interest in inference, AI responsibility, and security. In his spare time, he loves woodworking and snowboarding.

Varun Jasti is a Solutions Architect at Amazon Web Services, working with AWS Partners to design and scale artificial intelligence solutions for public sector use cases to meet compliance standards. With a background in Computer Science, his work covers broad range of ML use cases primarily focusing on LLM training/inferencing and computer vision. In his spare time, he loves playing tennis and swimming.

Baladithya Balamurugan is a Solutions Architect at AWS focused on ML deployments for inference and utilizing AWS Neuron to accelerate training and inference. He works with customers to enable and accelerate their ML deployments on services such as AWS Sagemaker and AWS EC2. Based out of San Francisco, Baladithya enjoys tinkering, developing applications and his homelab in his free time.

Build a computer vision-based asset inventory application with low or no training

April 16, 2025

by Elisabetta Castellano Amazon AWS

Keeping an up-to-date asset inventory with real devices deployed in the field can be a challenging and time-consuming task. Many electricity providers use manufacturer’s labels as key information to link their physical assets within asset inventory systems. Computer vision can be a viable solution to speed up operator inspections and reduce human errors by automatically extracting relevant data from the label. However, building a standard computer vision application capable of managing hundreds of different types of labels can be a complex and time-consuming endeavor.

In this post, we present a solution using generative AI and large language models (LLMs) to alleviate the time-consuming and labor-intensive tasks required to build a computer vision application, enabling you to immediately start taking pictures of your asset labels and extract the necessary information to update the inventory using AWS services like AWS Lambda, Amazon Bedrock, Amazon Titan, Anthropic’s Claude 3 on Amazon Bedrock, Amazon API Gateway, AWS Amplify, Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB.

LLMs are large deep learning models that are pre-trained on vast amounts of data. They are capable of understanding and generating human-like text, making them incredibly versatile tools with a wide range of applications. This approach harnesses the image understanding capabilities of Anthropic’s Claude 3 model to extract information directly from photographs taken on-site, by analyzing the labels present in those field images.

Solution overview

The AI-powered asset inventory labeling solution aims to streamline the process of updating inventory databases by automatically extracting relevant information from asset labels through computer vision and generative AI capabilities. The solution uses various AWS services to create an end-to-end system that enables field technicians to capture label images, extract data using AI models, verify the accuracy, and seamlessly update the inventory database.

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The process starts when an operator takes and uploads a picture of the assets using the mobile app.
The operator submits a request to extract data from the asset image.
A Lambda function retrieves the uploaded asset image from the uploaded images data store.
The function generates the asset image embeddings (vector representations of data) invoking the Amazon Titan Multimodal Embeddings G1 model.
The function performs a similarity search in the knowledge base to retrieve similar asset labels. The most relevant results will augment the prompt as similar examples to improve the response accuracy, and are sent with the instructions to the LLM to extract data from the asset image.
The function invokes Anthropic’s Claude 3 Sonnet on Amazon Bedrock to extract data (serial number, vendor name, and so on) using the augmented prompt and the related instructions.
The function sends the response to the mobile app with the extracted data.
The mobile app verifies the extracted data and assigns a confidence level. It invokes the API to process the data. Data with high confidence will be directly ingested into the system.
A Lambda function is invoked to update the asset inventory database with the extracted data if the confidence level has been indicated as high by the mobile app.
The function sends data with low confidence to Amazon Augmented AI (Amazon A2I) for further processing.
The human reviewers from Amazon A2I validate or correct the low-confidence data.
Human reviewers, such as subject matter experts, validate the extracted data, flag it, and store it in an S3 bucket.
A rule in Amazon EventBridge is defined to trigger a Lambda function to get the information from the S3 bucket when the Amazon A2I workflow processing is complete.
A Lambda function processes the output of the Amazon A2I workflow by loading data from the JSON file that stored the backend operator-validated information.
The function updates the asset inventory database with the new extracted data.
The function sends the extracted data marked as new by human reviewers to an Amazon Simple Queue Service (Amazon SQS) queue to be further processed.
Another Lambda function fetches messages from the queue and serializes the updates to the knowledge base database.
The function generates the asset image embeddings by invoking the Amazon Titan Multimodal Embeddings G1 model.
The function updates the knowledge base with the generated embeddings and notifies other functions that the database has been updated.

Let’s look at the key components of the solution in more detail.

Mobile app

The mobile app component plays a crucial role in this AI-powered asset inventory labeling solution. It serves as the primary interface for field technicians on their tablets or mobile devices to capture and upload images of asset labels using the device’s camera. The implementation of the mobile app includes an authentication mechanism that will allow access only to authenticated users. It’s also built using a serverless approach to minimize recurring costs and have a highly scalable and robust solution.

The mobile app has been built using the following services:

AWS Amplify – This provides a development framework and hosting for the static content of the mobile app. By using Amplify, the mobile app component benefits from features like seamless integration with other AWS services, offline capabilities, secure authentication, and scalable hosting.
Amazon Cognito – This handles user authentication and authorization for the mobile app.

AI data extraction service

The AI data extraction service is designed to extract critical information, such as manufacturer name, model number, and serial number from images of asset labels.

To enhance the accuracy and efficiency of the data extraction process, the service employs a knowledge base comprising sample label images and their corresponding data fields. This knowledge base serves as a reference guide for the AI model, enabling it to learn and generalize from labeled examples to new label formats effectively. The knowledge base is stored as vector embeddings in a high-performance vector database: Meta’s FAISS (Facebook AI Similarity Search), hosted on Amazon S3.

Embeddings are dense numerical representations that capture the essence of complex data like text or images in a vector space. Each data point is mapped to a vector or ordered list of numbers, where similar data points are positioned closer together. This embedding space allows for efficient similarity calculations by measuring the distance between vectors. Embeddings enable machine learning (ML) models to effectively process and understand relationships within complex data, leading to improved performance on various tasks like natural language processing and computer vision.

The following diagram illustrates an example workflow.

The vector embeddings are generated using Amazon Titan, a powerful embedding generation service, which converts the labeled examples into numerical representations suitable for efficient similarity searches. The workflow consists of the following steps:

When a new asset label image is submitted for processing, the AI data extraction service, through a Lambda function, retrieves the uploaded image from the bucket where it was uploaded.
The Lambda function performs a similarity search using Meta’s FAISS vector search engine. This search compares the new image against the vector embeddings in the knowledge base generated by Amazon Titan Multimodal Embeddings invoked through Amazon Bedrock, identifying the most relevant labeled examples.
Using the augmented prompt with context information from the similarity search, the Lambda function invokes Amazon Bedrock, specifically Anthropic’s Claude 3, a state-of-the-art generative AI model, for image understanding and optical character recognition (OCR) tasks. By using the similar examples, the AI model can more accurately extract and interpret the critical information from the new asset label image.
The response is then sent to the mobile app to be confirmed by the field technician.

In this phase, the AWS services used are:

Amazon Bedrock – A fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities.
AWS Lambda – A serverless computing service that allows you to run your code without the need to provision or manage physical servers or virtual machines. A Lambda function runs the data extraction logic and orchestrates the overall data extraction process.
Amazon S3 – A storage service offering industry-leading durability, availability, performance, security, and virtually unlimited scalability at low costs. It’s used to store the asset images uploaded by the field technicians.

Data verification

Data verification plays a crucial role in maintaining the accuracy and reliability of the extracted data before updating the asset inventory database and is included in the mobile app.

The workflow consists of the following steps:

The extracted data is shown to the field operator.
If the field operator determines that the extracted data is accurate and matches an existing asset label in the knowledge base, they can confirm the correctness of the extraction; if not, they can update the values directly using the app.
When the field technician confirms the data is correct, that information is automatically forwarded to the backend review component.

Data verification uses the following AWS services:

Amazon API Gateway – A secure and scalable API gateway that exposes the data verification component’s functionality to the mobile app and other components.
AWS Lambda – Serverless functions for implementing the verification logic and routing data based on confidence levels.

Backend review

This component assesses the discrepancy of automatically identified data by the AI data extraction service and the final data approved by the field operator and computes the difference. If the difference is below a configured threshold, the data is sent to update the inventory database; otherwise a human review process is engaged:

Subject matter experts asynchronously review flagged data entries on the Amazon A2I console.
Significant discrepancies are marked to update the generative AI’s knowledge base.
Minor OCR errors are corrected without updating the AI model’s knowledge base.

The backend review component uses the following AWS services:

Amazon A2I – A service that provides a web-based interface for human reviewers to inspect and correct the extracted data and asset label images.
Amazon EventBridge – A serverless service that uses events to connect application components together. When the Amazon A2I human workflow is complete, EventBridge is used to detect this event and trigger a Lambda function to process the output data.
Amazon S3 – Object storage to save the marked information in charge of Amazon A2I.

Inventory database

The inventory database component plays a crucial role in storing and managing the verified asset data in a scalable and efficient manner. Amazon DynamoDB, a fully managed NoSQL database service from AWS, is used for this purpose. DynamoDB is a serverless, scalable, and highly available key-value and document database service. It’s designed to handle massive amounts of data and high traffic workloads, making it well-suited for storing and retrieving large-scale inventory data.

The verified data from the AI extraction and human verification processes is ingested into the DynamoDB table. This includes data with high confidence from the initial extraction, as well as data that has been reviewed and corrected by human reviewers.

Knowledge base update

The knowledge base update component enables continuous improvement and adaptation of the generative AI models used for asset label data extraction:

During the backend review process, human reviewers from Amazon A2I validate and correct the data extracted from asset labels by the AI model.
The corrected and verified data, along with the corresponding asset label images, is marked as new label examples if not already present in the knowledge base.
A Lambda function is triggered to update the asset inventory and send the new labels to the FIFO (First-In-First-Out) queue.
A Lambda function processes the messages in the queue, updating the knowledge base vector store (S3 bucket) with the new label examples.
The update process generates the vector embeddings by invoking the Amazon Titan Multimodal Embeddings G1 model exposed by Amazon Bedrock and storing the embeddings in a Meta’s FAISS database in Amazon S3.

The knowledge base update process makes sure that the solution remains adaptive and continuously improves its performance over time, reducing the likelihood of unseen label examples and the involvement of subject matter experts to correct the extracted data.

This component uses the following AWS services:

Amazon Titan Multimodal Embeddings G1 model – This model generates the embeddings (vector representations) for the new asset images and their associated data.
AWS Lambda – Lambda functions are used to update the asset inventory database, to send and process the extracted data to the FIFO queue, and to update the knowledge base in case of new unseen labels.
Amazon SQS – Amazon SQS offers fully managed message queuing for microservices, distributed systems, and serverless applications. The extracted data marked as new by human reviewers is sent to an SQS FIFO (First-In-First-Out) queue. This makes sure that the messages are processed in the correct order; FIFO queues preserve the order in which messages are sent and received. If you use a FIFO queue, you don’t have to place sequencing information in your messages.
Amazon S3 – The knowledge base is stored in an S3 bucket, with the newly generated embeddings. This allows the AI system to improve its accuracy for future asset label recognition tasks.

Navigation flow

This section explains how users interact with the system and how data flows between different components of the solution. We’ll examine each key component’s role in the process, from initial user access through data verification and storage.

Mobile app

The end user accesses the mobile app using the browser included in the handheld device. The application URL to access the mobile app is available after you have deployed the frontend application. Using the browser on a handheld device or your PC, browse to the application URL address, where a login window will appear. Because this is a demo environment, you can register on the application by following the automated registration workflow implemented through Amazon Cognito and choosing Create Account, as shown in the following screenshot.

During the registration process, you must provide a valid email address that will be used to verify your identity, and define a password. After you’re registered, you can log in with your credentials.

After authentication is complete, the mobile app appears, as shown in the following screenshot.

The process to use the app is the following:

Use the camera button to capture a label image.
The app facilitates the upload of the captured image to a private S3 bucket specifically designated for storing asset images. S3 Transfer Acceleration is a separate AWS service that can be integrated with Amazon S3 to improve the transfer speed of data uploads and downloads. It works by using AWS edge locations, which are globally distributed and closer to the client applications, as intermediaries for data transfer. This reduces the latency and improves the overall transfer speed, especially for clients that are geographically distant from the S3 bucket’s AWS Region.
After the image is uploaded, the app sends a request to the AI data extraction service, triggering the subsequent process of data extraction and analysis. The extracted data returned by the service is displayed and editable within the form, as described later in this post. This allows for data verification.

AI data extraction service

This module uses Anthropic’s Claude 3 FM, a multimodal system capable of processing both images and text. To extract relevant data, we employ a prompt technique that uses samples to guide the model’s output. Our prompt includes two sample images along with their corresponding extracted text. The model identifies which sample image most closely resembles the one we want to analyze and uses that sample’s extracted text as a reference to determine the relevant information in the target image.

We use the following prompt to achieve this result:

{
 "role": "user",
 "content": [
 {
 "type": "text",
 "text": "first_sample_image:",
 },
 {
 "type": "image",
 "source": {
 "type": "base64",
 "media_type": "image/jpeg",
 "data": first_sample_encoded_image,
 },
 },
 {
 "type": "text",
 "text": "target_image:",
 },
 {
 "type": "image",
 "source": {
 "type": "base64",
 "media_type": "image/jpeg",
 "data": encoded_image,
 },
 },
 {"type": "text",
 "text": f"""
 answer the question using the following example as reference.
 match exactly the same set of fields and information as in the provided example.
 
 <example>
 analyze first_sample_image and answer with a json file with the following information: Model, SerialN, ZOD.
 answer only with json.
 
 Answer:
 {first_sample_answer}
 </example>
 
 <question>
 analyze target_image and answer with a json file with the following information: Model, SerialN, ZOD.
 answer only with json.
 
 Answer:
 </question>
 """},
 
 ],
 }

In the preceding code, first_sample_encoded_image and first_sample_answer are the reference image and expected output, respectively, and encoded_image contains the new image that has to be analyzed.

Data verification

After the image is processed by the AI data extraction service, the control goes back to the mobile app:

The mobile app receives the extracted data from the AI data extraction service, which has processed the uploaded asset label image and extracted relevant information using computer vision and ML models.
Upon receiving the extracted data, the mobile app presents it to the field operator, allowing them to review and confirm the accuracy of the information (see the following screenshot). If the extracted data is correct and matches the physical asset label, the technician can submit a confirmation through the app, indicating that the data is valid and ready to be inserted into the asset inventory database.
If the field operator sees any discrepancies or errors in the extracted data compared to the actual asset label, they have the option to correct those values.
The values returned by the AI data extraction service and the final values validated by the field operators are sent to the backend review service.

Backend review

This process is implemented using Amazon A2I:

A distance metric is computed to evaluate the difference between what the data extraction service has identified and the correction performed by the on-site operator.
If the difference is larger than a predefined threshold, the image and the operator modified data are submitted to an Amazon A2I workflow, creating a human-in-the-loop request.
When a backend operator becomes available, the new request is assigned.
The operator uses the Amazon A2I provided web interface, as depicted in the following screenshot, to check what the on-site operator has done and, if it’s found that this type of label is not included in the knowledge base, can decide to add it by entering Yes in the Add to Knowledge Base field.

When the A2I process is complete, a Lambda function is triggered.
This Lambda function stores the information in the inventory database and verifies whether this image also needs to be used to update the knowledge base.
If this is the case, the Lambda function files the request with the relevant data in an SQS FIFO queue.

Inventory database

To keep this solution as simple as possible while covering the required capability, we selected DynamoDB as our inventory database. This is a no SQL database, and we will store data in a table with the following information:

Manufacturers, model ID, and the serial number that is going to be the key of the table
A link to the picture containing the label used during the on-site inspection

DynamoDB offers an on-demand pricing model that allows costs to directly depend on actual database usage.

Knowledge base database

The knowledge base database is stored as two files in an S3 bucket:

The first file is a JSON array containing the metadata (manufacturer, serial number, model ID, and link to reference image) for each of the knowledge base entries
The second file is a FAISS database containing an index with the embedding for each of the images included in the first file

To be able to minimize race conditions when updating the database, a single Lambda function is configured as the consumer of the SQS queue. The Lambda function extracts the information about the link to the reference image and the metadata, certified by the back-office operator, updates both files, and stores the new version in the S3 bucket.

In the following sections, we create a seamless workflow for field data collection, AI-powered extraction, human validation, and inventory updates.

Prerequisites

You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 Region. You will also need an AWS Identity and Access Management (IAM) user with administrative privileges to deploy the required components and a development environment with access to AWS resources already configured.

For the development environment, you can use an Amazon Elastic Compute Cloud (Amazon EC2) instance (choose select at least a t3.small instance type in order to be able to build the web application) or use a development environment of your own choice. Install Python 3.9 and install and configure AWS Command Line Interface (AWS CLI).

You will also need to install the Amplify CLI. Refer to Set up Amplify CLI for more information.

The next step is to enable the models used in this workshop in Amazon Bedrock. To do this, complete the following steps:

1. On the Amazon Bedrock console, choose Model access in the navigation pane.
2. Choose Enable specific models.

1. Select all Anthropic and Amazon models and choose Next

A new window will list the requested models.

Confirm that the Amazon Titan models and Anthropic Claude models are on this list and choose Submit.

The next step is to create an Amazon SageMaker Ground Truth private labeling workforce that will be used to perform back-office activities. If you don’t already have a private labeling workforce in your account, you can create one following these steps:

On the SageMaker console, under Ground Truth in the navigation pane, choose Labeling workforce.

On the Private tab, choose Create private team.
Provide a name to the team and your organization, and insert your email address (must be a valid one) for both Email addresses and Contact email.
Leave all the other options as default.
Choose Create private team.
After your workforce is created, copy your workforce Amazon Resource Name (ARN) on the Private tab and save for later use.
Lastly, build a Lambda layer that includes two Python libraries. To build this layer, connect to your development environment and issue the following commands:

git clone https://github.com/aws-samples/Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
cd Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
bash build_lambda_layer.sh

You should get an output similar to the following screenshot.

Save theLAMBDA_LAYER_VERSION_ARN for later use.

You are now ready to deploy the backend infrastructure and frontend application.

Deploy the backend infrastructure

The backend is deployed using AWS CloudFormation to build the following components:

1. - An API Gateway to act as an integration layer between the frontend application and the backend
  - An S3 bucket to store the uploaded images and the knowledge base
  - Amazon Cognito to allow end-user authentication
  - A set of Lambda functions to implement backend services
  - An Amazon A2I workflow to support the back-office activities
  - An SQS queue to store knowledge base update requests
  - An EventBridge rule to trigger a Lambda function as soon as an Amazon A2I workflow is complete
  - A DynamoDB table to store inventory data
  - IAM roles and policies to allow access to the different components to interact with each other and also access Amazon Bedrock for generative AI-related tasks

Download the CloudFormation template, then complete the following steps:

1. 1. On the AWS CloudFormation console, chose Create stack.
  2. Choose Upload a template file and choose Choose file to upload the downloaded template.
  3. Choose Next.
  4. For Stack name, enter a name (for example, asset-inventory).
  5. For A2IWorkforceARN, enter the ARN of the labeling workforce you identified.
  6. For LambdaLayerARN, enter the ARN of the Lambda layer version you uploaded.
  7. Choose Next and Next again.
  8. Acknowledge that AWS CloudFormation is going to create IAM resources and choose Submit.

Wait until the CloudFormation stack creation process is complete; it will take about 15–20 minutes. You can then view the stack details.

Note the values on the Outputs tab. You will use the output data later to complete the configuration of the frontend application.

Deploy the frontend application

In this section, you will build the web application that is used by the on-site operator to collect a picture of the labels, submit it to the backend services to extract relevant information, validate or correct returned information, and submit the validated or corrected information to be stored in the asset inventory.

The web application uses React and will use the Amplify JavaScript Library.

Amplify provides several products to build full stack applications:

1. - Amplify CLI – A simple command line interface to set up the needed services
  - Amplify Libraries – Use case-centric client libraries to integrate the frontend code with the backend
  - Amplify UI Components – UI libraries for React, React Native, Angular, Vue, and Flutter

In this example, you have already created the needed services with the CloudFormation template, so the Amplify CLI will deploy the application on the Amplify provided hosting service.

1. 1. Log in to your development environment and download the client code from the GitHub repository using the following command:

git clone https://github.com/aws-samples/Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
cd Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
cd webapp

1. 1. If you’re running on AWS Cloud9 as a development environment, issue the following command to let the Amplify CLI use AWS Cloud9 managed credentials:

ln -s $HOME/.aws/credentials $HOME/.aws/config

1. 1. Now you can initialize the Amplify application using the CLI:

amplify init

After issuing this command, the Amplify CLI will ask you for some parameters.

1. 1. Accept the default values by pressing Enter for each question.
  2. The next step is to modify amplifyconfiguration.js.template (you can find it in folder webapp/src) with the information collected from the output of the CloudFormation stack and save as amplifyconfiguration.js. This file tells Amplify which is the correct endpoint to use to interact with the backend resources created for this application. The information required is as follows:
    1. aws_project_region and aws_cognito_region – To be filled in with the Region in which you ran the CloudFormation template (for example, us-east-1).
    2. aws_cognito_identity_pool_id, aws_user_pools_id, aws_user_pools_web_client_id – The values from the Outputs tab of the CloudFormation stack.
    3. Endpoint – In the API section, update the endpoint with the API Gateway URL listed on the Outputs tab of the CloudFormation stack.
  3. You now need to add a hosting option for the single-page application. You can use Amplify to configure and host the web application by issuing the following command:

amplify hosting add

The Amplify CLI will ask you which type of hosting service you prefer and what type of deployment.

1. 1. Answer both questions by accepting the default option by pressing Enter key.
  2. You now need to install the JavaScript libraries used by this application using npm:

npm install

1. 1. Deploy the application using the following command:

amplify publish

1. 1. Confirm you want to proceed by entering Y.

At the end of the deployment phase, Amplify will return the public URL of the web application, similar to the following:

...
Find out more about deployment here:

https://cra.link/deployment

 Zipping artifacts completed.
 Deployment complete!
https://dev.xxx.amplifyapp.com

Now you can use your browser to connect to the application using the provided URL.

Clean up

To delete the resources used to build this solution, complete the following steps:

1. 1. Delete the Amplify application:
    1. Issue the following command:

amplify delete

1. 1. 1. Confirm that you are willing to delete the application.
  2. Remove the backend resources:
    1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
    2. Select the stack and choose Delete.
    3. Choose Delete to confirm.

At the end of the deletion process, you should not see the entry related to asset-inventory on the list of stacks.

1. 1. Remove the Lambda layer by issuing the following command in the development environment:

aws lambda delete-layer-version —layer-name asset-inventory-blog —version-number 1

1. 1. If you created a new labeling workforce, remove it by using the following command:

aws delete-workteam —workteam-name <the name you defined when you created the workteam>

Conclusion

In this post, we presented a solution that incorporates various AWS services to handle image storage (Amazon S3), mobile app development (Amplify), AI model hosting (Amazon Bedrock using Anthropic’s Claude), data verification (Amazon A2I), database (DynamoDB), and vector embeddings (Amazon Bedrock using Amazon Titan Multimodal Embeddings). It creates a seamless workflow for field data collection, AI-powered extraction, human validation, and inventory updates.

By taking advantage of the breadth of AWS services and integrating generative AI capabilities, this solution dramatically improves the efficiency and accuracy of asset inventory management processes. It reduces manual labor, accelerates data entry, and maintains high-quality inventory records, enabling organizations to optimize asset tracking and maintenance operations.

You can deploy this solution and immediately start collecting images of your assets to build or update your asset inventory.

About the authors

Federico D’Alessio is an AWS Solutions Architect and joined AWS in 2018. He is currently working in the Power and Utility and Transportation market. Federico is cloud addict and when not at work, he tries to reach clouds with his hang glider.

Leonardo Fenu is a Solutions Architect, who has been helping AWS customers align their technology with their business goals since 2018. When he is not hiking in the mountains or spending time with his family, he enjoys tinkering with hardware and software, exploring the latest cloud technologies, and finding creative ways to solve complex problems.

Elisabetta Castellano is an AWS Solutions Architect focused on empowering customers to maximize their cloud computing potential, with expertise in machine learning and generative AI. She enjoys immersing herself in cinema, live music performances, and books.

Carmela Gambardella is an AWS Solutions Architect since April 2018. Before AWS, Carmela has held various roles in large IT companies, such as software engineer, security consultant and solutions architect. She has been using her experience in security, compliance and cloud operations to help public sector organizations in their transformation journey to the cloud. In her spare time, she is a passionate reader, she enjoys hiking, traveling and playing yoga.

Clario enhances the quality of the clinical trial documentation process with Amazon Bedrock

April 15, 2025

by Kim Nguyen, Shyam Banuprakash Amazon AWS

This post is co-written with Kim Nguyen and Shyam Banuprakash from Clario.

Clario is a leading provider of endpoint data solutions to the clinical trials industry, generating high-quality clinical evidence for life sciences companies seeking to bring new therapies to patients. Since Clario’s founding more than 50 years ago, the company’s endpoint data solutions have supported clinical trials more than 26,000 times with over 700 regulatory approvals across more than 100 countries. One of the critical challenges Clario faces when supporting its clients is the time-consuming process of generating documentation for clinical trials, which can take weeks.

The business challenge

When medical imaging analysis is part of a clinical trial it is supporting, Clario prepares a medical imaging charter process document that outlines the format and requirements of the central review of clinical trial images (the Charter). Based on the Charter, Clario’s imaging team creates several subsequent documents (as shown in the following figure), including the business requirement specification (BRS), training slides, and ancillary documents. The content of these documents is largely derived from the Charter, with significant reformatting and rephrasing required. This process is time-consuming, can be subject to inadvertent manual error, and carries the risk of inconsistent or redundant information, which can delay or otherwise negatively impact the clinical trial.

Clario’s imaging team recognized the need to modernize the document generation process and streamline the processes used to create end-to-end document workflows. Clario engaged with their AWS account team and AWS Generative AI Innovation Center to explore how generative AI could help streamline the process.

The solution

The AWS team worked closely with Clario to develop a prototype solution that uses AWS AI services to automate the BRS generation process. The solution involves the following key services:

Amazon Simple Storage Service (Amazon S3): A scalable object storage service used to store the charter-derived and generated BRS documents.
Amazon OpenSearch Serverless: An on-demand serverless configuration for Amazon OpenSearch Service used as a vector store.
Amazon Bedrock: Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG) and build agents that execute tasks using your enterprise systems and data sources.

The solution is shown in the following figure:

Architecture walkthrough

Charter-derived documents are processed in an on-premises script in preparation for uploading.
Files are sent to AWS using AWS Direct Connect.
The script chunks the documents and calls an embedding model to produce the document embeddings. It then stores the embeddings in an OpenSearch vector database for retrieval by our application. Clario uses an Amazon Titan Text Embeddings model offered by Amazon Bedrock. Each chunk is called to produce an embedding.
Amazon OpenSearch Serverlessis used as the durable vector store. Document chunk embeddings are stored in an OpenSearch vector index, which enables the application to search for the most semantically relevant documents. Clario also stores attributes for the source document and associated trial to allow for a richer search experience.
A custom build user interface is the primary access point for users to access the system, initiate generation jobs, and interact with a chat UI. The UI is integrated with the workflow engine that manages the orchestration process.
The workflow engine calls the Amazon Bedrock API and orchestrates the business requirement specification document generation process. The engine:
- Uses a global specification that stores the prompts to be used as input when calling the large language model.
- Queries OpenSearch for the relevant Imaging charter.
- Loops through every business requirement.
- Calls the Claude 3.7 Sonnet large language model from Amazon Bedrock to generate responses.
Outputs the business requirement specification document to the user interface, where a business requirement writer can review the answers to produce a final document. Clario uses Claude 3.7 Sonnet from Amazon Bedrock for the question-answering and the conversational AI application.
The final documents are written to Amazon S3 to be consumed and published by additional document workflows that will be built in the future.
An as-needed AI chat agent to allow document-based discovery and enable users to converse with one or more documents.

Benefits and results

By using AWS AI services, Clario has streamlined the complicated BRS generation process significantly. The prototype solution demonstrated the following benefits:

Improved accuracy: The use of generative AI models minimized the risk of translation errors and inconsistencies, reducing the need for rework and study delays.
Scalability and flexibility: The serverless architecture provided by AWS services allows the solution to scale seamlessly as demand increases, while the modular design enables straightforward integration with other Clario systems.
Security: Clario’s data security strategy revolves around confining all its information within the secure AWS ecosystem using the security features of Amazon Bedrock. By keeping data isolated within the AWS infrastructure, Clario helps ensure protection against external threats and unauthorized access. This approach enables Clario to meet compliance requirements and provide clients with confidence in the confidentiality and integrity of their sensitive data.

Lessons learned

The successful implementation of this prototype solution reinforced the value of using generative AI models for domain-specific applications like those prevalent in the life sciences industry. It also highlighted the importance of involving business stakeholders early in the process and having a clear understanding of the business value to be realized. Following the success of this project, Clario is working to productionize the solution in their Medical Imaging business during 2025 to continue offering state-of-the-art services to its customers for best quality data and successful clinical trials.

Conclusion

The collaboration between Clario and AWS demonstrated the potential of AWS AI and machine learning (AI/ML) services and generative AI models, such as Anthropic’s Claude, to streamline document generation processes in the life sciences industry and, specifically, for complicated clinical trial processes. By using these technologies, Clario was able to enhance and streamline the BRS generation process significantly, improving accuracy and scalability. As Clario continues to adopt AI/ML across its operations, the company is well-positioned to drive innovation and deliver better outcomes for its partners and patients.

About the Authors

Kim Nguyen serves as the Sr Director of Data Science at Clario, where he leads a team of data scientists in developing innovative AI/ML solutions for the healthcare and clinical trials industry. With over a decade of experience in clinical data management and analytics, Kim has established himself as an expert in transforming complex life sciences data into actionable insights that drive business outcomes. His career journey includes leadership roles at Clario and Gilead Sciences, where he consistently pioneered data automation and standardization initiatives across multiple functional teams. Kim holds a Master’s degree in Data Science and Engineering from UC San Diego and a Bachelor’s degree from the University of California, Berkeley, providing him with the technical foundation to excel in developing predictive models and data-driven strategies. Based in San Diego, California, he leverages his expertise to drive forward-thinking approaches to data science in the clinical research space.

Shyam Banuprakash serves as the Senior Vice President of Data Science and Delivery at Clario, where he leads complex analytics programs and develops innovative data solutions for the medical imaging sector. With nearly 12 years of progressive experience at Clario, he has demonstrated exceptional leadership in data-driven decision making and business process improvement. His expertise extends beyond his primary role, as he contributes his knowledge as an Advisory Board Member for both Modal and UC Irvine’s Customer Experience Program. Shyam holds a Master of Advanced Study in Data Science and Engineering from UC San Diego, complemented by specialized training from MIT in data science and big data analytics. His career exemplifies the powerful intersection of healthcare, technology, and data science, positioning him as a thought leader in leveraging analytics to transform clinical research and medical imaging.

John O’Donnell is a Principal Solutions Architect at Amazon Web Services (AWS) where he provides CIO-level engagement and design for complex cloud-based solutions in the healthcare and life sciences (HCLS) industry. With over 20 years of hands-on experience, he has a proven track record of delivering value and innovation to HCLS customers across the globe. As a trusted technical leader, he has partnered with AWS teams to dive deep into customer challenges, propose outcomes, and ensure high-value, predictable, and successful cloud transformations. John is passionate about helping HCLS customers achieve their goals and accelerate their cloud native modernization efforts.

Praveen Haranahalli is a Senior Solutions Architect at Amazon Web Services (AWS) where he provides expert guidance and architects secure, scalable cloud solutions for diverse enterprise customers. With nearly two decades of IT experience, including over ten years specializing in Cloud Computing, he has a proven track record of delivering transformative cloud implementations across multiple industries. As a trusted technical advisor, Praveen has successfully partnered with customers to implement robust DevSecOps pipelines, establish comprehensive security guardrails, and develop innovative AI/ML solutions. Praveen is passionate about solving complex business challenges through cutting-edge cloud architectures and helping organizations achieve successful digital transformations powered by artificial intelligence and machine learning technologies.

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

April 15, 2025

by Lior Sadan Amazon AWS

Organizations are constantly seeking ways to harness the power of advanced large language models (LLMs) to enable a wide range of applications such as text generation, summarizationquestion answering, and many others. As these models grow more powerful and capable, deploying them in production environments while optimizing performance and cost-efficiency becomes more challenging.

Amazon Web Services (AWS) provides highly optimized and cost-effective solutions for deploying AI models, like the Mixtral 8x7B language model, for inference at scale. The AWS Inferentia and AWS Trainium are AWS AI chips, purpose-built to deliver high throughput and low latency inference and training performance for even the largest deep learning models. The Mixtral 8x7B model adopts the Mixture-of-Experts (MoE) architecture with eight experts. AWS Neuron—the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances—employs expert parallelism for MoE architecture, sharding the eight experts across multiple NeuronCores.

This post demonstrates how to deploy and serve the Mixtral 8x7B language model on AWS Inferentia2 instances for cost-effective, high-performance inference. We’ll walk through model compilation using Hugging Face Optimum Neuron, which provides a set of tools enabling straightforward model loading, training, and inference, and the Text Generation Inference (TGI) Container, which has the toolkit for deploying and serving LLMs with Hugging Face. This will be followed by deployment to an Amazon SageMaker real-time inference endpoint, which automatically provisions and manages the Inferentia2 instances behind the scenes and provides a containerized environment to run the model securely and at scale.

While pre-compiled model versions exist, we’ll cover the compilation process to illustrate important configuration options and instance sizing considerations. This end-to-end guide combines Amazon Elastic Compute Cloud (Amazon EC2)-based compilation with SageMaker deployment to help you use Mixtral 8x7B’s capabilities with optimal performance and cost efficiency.

Step 1: Set up Hugging Face access

Before you can deploy the Mixtral 8x7B model, there some prerequisites that you need to have in place.

The model is hosted on Hugging Face and uses their transformers library. To download and use the model, you need to authenticate with Hugging Face using a user access token. These tokens allow secure access for applications and notebooks to Hugging Face’s services. You first need to create a Hugging Face account if you don’t already have one, which you can then use to generate and manage your access tokens through the user settings.
The mistralai/Mixtral-8x7B-Instruct-v0.1 model that you will be working with in this post is a gated model. This means that you need to specifically request access from Hugging Face before you can download and work with the model.

Step 2: Launch an Inferentia2-powered EC2 Inf2 instance

To get started with an Amazon EC2 Inf2 instance for deploying the Mixtral 8x7B, either deploy the AWS CloudFormation template or use the AWS Management Console.

To launch an Inferentia2 instance using the console:

Navigate to the Amazon EC2 console and choose Launch Instance.
Enter a descriptive name for your instance.
Under the Application and OS Images search for and select the Hugging Face Neuron Deep Learning AMI, which comes pre-configured with the Neuron software stack for AWS Inferentia.
For Instance type, select 24xlarge, which contains six Inferentia chips (12 NeuronCores).
Create or select an existing key pair to enable SSH access.
Create or select a security group that allows inbound SSH connections from the internet.
Under Configure Storage, set the root EBS volume to 512 GiB to accommodate the large model size.
After the settings are reviewed, choose Launch Instance.

With your Inf2 instance launched, connect to it over SSH by first locating the public IP or DNS name in the Amazon EC2 console. Later in this post, you will connect to a Jupyter notebook using a browser on port 8888. To do that, SSH tunnel to the instance using the key pair you configured during instance creation.

ssh -i "<pem file>" ubuntu@<instance DNS name> -L 8888:127.0.0.1:8888

After signing in, list the NeuronCores attached to the instance and their associated topology:

neuron-ls

For inf2.24xlarge, you should see the following output listing six Neuron devices:

instance-type: inf2.24xlarge
instance-id: i-...
+--------+--------+--------+-----------+---------+
| NEURON | NEURON | NEURON | CONNECTED |   PCI   |
| DEVICE | CORES  | MEMORY |  DEVICES  |   BDF   |
+--------+--------+--------+-----------+---------+
| 0      | 2      | 32 GB  | 1         | 10:1e.0 |
| 1      | 2      | 32 GB  | 0, 2      | 20:1e.0 |
| 2      | 2      | 32 GB  | 1, 3      | 10:1d.0 |
| 3      | 2      | 32 GB  | 2, 4      | 20:1f.0 |
| 4      | 2      | 32 GB  | 3, 5      | 10:1f.0 |
| 5      | 2      | 32 GB  | 4         | 20:1d.0 |
+--------+--------+--------+-----------+---------+

For more information on the neuron-ls command, see the Neuron LS User Guide.

Make sure the Inf2 instance is sized correctly to host the model. Each Inferentia NeuronCore processor contains 16 GB of high-bandwidth memory (HBM). To accommodate an LLM like the Mixtral 8x7B on AWS Inferentia2 (inf2) instances, a technique called tensor parallelism is used. This allows the model’s weights, activations, and computations to be split and distributed across multiple NeuronCores in parallel. To determine the degree of tensor parallelism required, you need to calculate the total memory footprint of the model. This can be computed as:

total memory = bytes per parameter * number of parameters

The Mixtral-8x7B model consists of 46.7 billion parameters. With float16 casted weights, you need 93.4 GB to store the model weights. The total space required is often greater than just the model parameters because of caching attention layer projections (KV caching). This caching mechanism grows memory allocations linearly with sequence length and batch size. With a batch size of 1 and a sequence length of 1024 tokens, the total memory footprint for the caching is 0.5 GB. The exact formula can be found in the AWS Neuron documentation and the hyper-parameter configuration required for these calculations is stored in the model config.json file.

Given that each NeuronCore has 16 GB of HBM, and the model requires approximately 94 GB of memory, a minimum tensor parallelism degree of 6 would theoretically suffice. However, with 32 attention heads, the tensor parallelism degree must be a divisor of this number.

Furthermore, considering the model’s size and the MoE implementation in transformers-neuronx, the supported tensor parallelism degrees are limited to 8, 16, and 32. For the example in this post, you will distribute the model across eight NeuronCores.

Compile Mixtral-8x7B model to AWS Inferentia2

The Neuron SDK includes a specialized compiler that automatically optimizes the model format for efficient execution on AWS Inferentia2.

To start this process, launch the container and pass the Inferentia devices to the container. For more information about launching the neuronx-tgi container see Deploy the Text Generation Inference (TGI) Container on a dedicated host.

docker run -it --entrypoint /bin/bash 
  --net=host -v $(pwd):$(pwd) -w $(pwd) 
  --device=/dev/neuron0 
  --device=/dev/neuron1 
  --device=/dev/neuron2 
  --device=/dev/neuron3 
  --device=/dev/neuron4 
  --device=/dev/neuron5 
  ghcr.io/huggingface/neuronx-tgi:0.0.25

Inside the container, sign in to the Hugging Face Hub to access gated models, such as the Mixtral-8x7B-Instruct-v0.1. See the previous section for Setup Hugging Face Access. Make sure to use a token with read and write permissions so you can later save the compiled model to the Hugging Face Hub.

huggingface-cli login --token hf_...

After signing in, compile the model with optimum-cli. This process will download the model artifacts, compile the model, and save the results in the specified directory.
The Neuron chips are designed to execute models with fixed input shapes for optimal performance. This requires that the compiled artifact shapes must be known at compilation time. In the following command, you will set the batch size, input/output sequence length, data type, and tensor-parallelism degree (number of neuron cores). For more information about these parameters, see Export a model to Inferentia.

Let’s discuss these parameters in more detail:

The parameter batch_size is the number of input sequences that the model will accept.
sequence_length specifies the maximum number of tokens in an input sequence. This affects memory usage and model performance during inference or training on Neuron hardware. A larger number will increase the model’s memory requirements because the attention mechanism needs to operate over the entire sequence, which leads to more computations and memory usage; while a smaller number will do the opposite. The value 1024 will be adequate for this example.
auto_cast_type parameter controls quantization. It allows type casting for model weights and computations during inference. The options are: bf16, fp16, or tf32. For more information about defining which lower-precision data type the compiler should use see Mixed Precision and Performance-accuracy Tuning. For models trained in float32, the 16-bit mixed precision options (bf16, f16) generally provide sufficient accuracy while significantly improving performance. We use data type float16 with the argument auto_cast_type fp16.
The num_cores parameter controls the number of cores on which the model should be deployed. This will dictate the number of parallel shards or partitions the model is split into. Each shard is then executed on a separate NeuronCore, taking advantage of the 16 GB high-bandwidth memory available per core. As discussed in the previous section, given the Mixtral-8x7B model’s requirements, Neuron supports 8, 16, or 32 tensor parallelism The inf2.24xlarge instance contains 12 Inferentia NeuronCores. Therefore, to optimally distribute the model, we set num_cores to 8.

optimum-cli export neuron 
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 
  --batch_size 1 
  --sequence_length 1024 
  --auto_cast_type fp16 
  --num_cores 8 
  ./neuron_model_path

Download and compilation should take 10–20 minutes. After the compilation completes successfully, you can check the artifacts created in the output directory:

neuron_model_path
├── compiled
│ ├── 2ea52780bf51a876a581.neff
│ ├── 3fe4f2529b098b312b3d.neff
│ ├── ...
│ ├── ...
│ ├── cfda3dc8284fff50864d.neff
│ └── d6c11b23d8989af31d83.neff
├── config.json
├── generation_config.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json

Push the compiled model to the Hugging Face Hub with the following command. Make sure to change <user_id> to your Hugging Face username. If the model repository doesn’t exist, it will be created automatically. Alternatively, store the model on Amazon Simple Storage Service (Amazon S3).

huggingface-cli upload <user_id>/Mixtral-8x7B-Instruct-v0.1 ./neuron_model_path ./

Deploy Mixtral-8x7B SageMaker real-time inference endpoint

Now that the model has been compiled and stored, you can deploy it for inference using SageMaker. To orchestrate the deployment, you will run Python code from a notebook hosted on an EC2 instance. You can use the instance created in the first section or create a new instance. Note that this EC2 instance can be of any type (for example t2.micro with an Amazon Linux 2023 image). Alternatively, you can use a notebook hosted in Amazon SageMaker Studio.

Set up AWS authorization for SageMaker deployment

You need AWS Identity and Access Management (IAM) permissions to manage SageMaker resources. If you created the instance with the provided CloudFormation template, these permissions are already created for you. If not, the following section takes you through the process of setting up the permissions for an EC2 instance to run a notebook that deploys a real-time SageMaker inference endpoint.

Create an AWS IAM role and attach SageMaker permission policy

Go to the IAM console.
Choose the Roles tab in the navigation pane.
Choose Create role.
Under Select trusted entity, select AWS service.
Choose Use case and select EC2.
Select EC2 (Allows EC2 instances to call AWS services on your behalf.)
Choose Next: Permissions.
In the Add permissions policies screen, select AmazonSageMakerFullAccess and IAMReadOnlyAccess. Note that the AmazonSageMakerFullAccess permission is overly permissive. We use it in this example to simplify the process but recommend applying the principle of least privilege when setting up IAM permissions.
Choose Next: Review.
In the Role name field, enter a role name.
Choose Create role to complete the creation.
With the role created, choose the Roles tab in the navigation pane and select the role you just created.
Choose the Trust relationships tab and then choose Edit trust policy.
Choose Add next to Add a principal.
For Principal type, select AWS services.
Enter sagemaker.amazonaws.com and choose Add a principal.
Choose Update policy. Your trust relationship should look like the following:

{
    "Version": "2012-10-17",
    "Statement": [
    {
        "Effect": "Allow",
        "Principal": {
            "Service": [
                "ec2.amazonaws.com",
                "sagemaker.amazonaws.com"
            ]
        },
        "Action": "sts:AssumeRole"
        }
    ]
}

Attach the IAM role to your EC2 instance

Go to the Amazon EC2 console.
Choose Instances in the navigation pane.
Select your EC2 instance.
Choose Actions, Security, and then Modify IAM role.
Select the role you created in the previous step.
Choose Update IAM role.

Launch a Jupyter notebook

Your next goal is to run a Jupyter notebook hosted in a container running on the EC2 instance. The notebook will be run using a browser on port 8888 by default. For this example, you will use SSH port forwarding from your local machine to the instance to access the notebook.

Continuing from the previous section, you are still within the container. The following steps install Jupyter Notebook:

pip install ipykernel
python3 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python Neuronx"
pip install jupyter notebook
pip install environment_kernels

Launch the notebook server using:

jupyter notebook

Then connect to the notebook using your browser over SSH tunneling

http://localhost:8888/tree?token=…

If you get a blank screen, try opening this address using your browser’s incognito mode.

Deploy the model for inference with SageMaker

After connecting to Jupyter Notebook, follow this notebook. Alternatively, choose File, New, Notebook, and then select Python 3 as the kernel. Use the following instructions and run the notebook cells.

In the notebook, install the sagemaker and huggingface_hub libraries.

!pip install sagemaker

Next, get a SageMaker session and execution role that will allow you to create and manage SageMaker resources. You’ll use a Deep Learning Container.

import os
import sagemaker
from sagemaker.huggingface import get_huggingface_llm_image_uri

os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
print(f"sagemaker role arn: {role}")

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
	"huggingface-neuronx",
	version="0.0.25"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

Deploy the compiled model to a SageMaker real-time endpoint on AWS Inferentia2.

Change user_id in the following code to your Hugging Face username. Make sure to update HF_MODEL_ID and HUGGING_FACE_HUB_TOKEN with your Hugging Face username and your access token.

from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.inf2.24xlarge"
health_check_timeout=2400 # additional time to load the model
volume_size=512 # size in GB of the EBS volume

# Define Model and Endpoint configuration parameter
config = {
	"HF_MODEL_ID": "user_id/Mixtral-8x7B-Instruct-v0.1", # replace with your model id if you are using your own model
	"HF_NUM_CORES": "4", # number of neuron cores
	"HF_AUTO_CAST_TYPE": "fp16",  # dtype of the model
	"MAX_BATCH_SIZE": "1", # max batch size for the model
	"MAX_INPUT_LENGTH": "1000", # max length of input text
	"MAX_TOTAL_TOKENS": "1024", # max length of generated text
	"MESSAGES_API_ENABLED": "true", # Enable the messages API
	"HUGGING_FACE_HUB_TOKEN": "hf_..." # Add your Hugging Face token here
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
	role=role,
	image_uri=llm_image,
	env=config
)

You’re now ready to deploy the model to a SageMaker real-time inference endpoint. SageMaker will provision the necessary compute resources instance and retrieve and launch the inference container. This will download the model artifacts from your Hugging Face repository, load the model to the Inferentia devices and start inference serving. This process can take several minutes.

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy

llm_model._is_compiled_model = True # We precompiled the model

llm = llm_model.deploy(
	initial_instance_count=1,
	instance_type=instance_type,
	container_startup_health_check_timeout=health_check_timeout,
	volume_size=volume_size
)

Next, run a test to check the endpoint. Update user_id to match your Hugging Face username, then create the prompt and parameters.

# Prompt to generate
messages=[
	{ "role": "system", "content": "You are a helpful assistant." },
	{ "role": "user", "content": "What is deep learning?" }
]

# Generation arguments
parameters = {
	"model": "user_id/Mixtral-8x7B-Instruct-v0.1", # replace user_id
	"top_p": 0.6,
	"temperature": 0.9,
	"max_tokens": 1000,
}

Send the prompt to the SageMaker real-time endpoint for inference

chat = llm.predict({"messages" :messages, **parameters})

print(chat["choices"][0]["message"]["content"].strip())

In the future, if you want to connect to this inference endpoint from other applications, first find the name of the inference endpoint. Alternatively, you can use the SageMaker console and choose Inference, and then Endpoints to see a list of the SageMaker endpoints deployed in your account.

endpoints = sess.sagemaker_client.list_endpoints()

for endpoint in endpoints['Endpoints']:
	print(endpoint['EndpointName'])

Use the endpoint name to update the following code, which can also be run in other locations.

from sagemaker.huggingface import HuggingFacePredictor

endpoint_name="endpoint_name..."

llm = HuggingFacePredictor(
	endpoint_name=endpoint_name,
	sagemaker_session=sess
)

Cleanup

Delete the endpoint to prevent future charges for the provisioned resources.

llm.delete_model()
llm.delete_endpoint()

Conclusion

In this post, we covered how to compile and deploy the Mixtral 8x7B language model on AWS Inferentia2 using the Hugging Face Optimum Neuron container and Amazon SageMaker. AWS Inferentia2 offers a cost-effective solution for hosting models like Mixtral, providing high-performance inference at a lower cost.

For more information, see Deploy Mixtral 8x7B on AWS Inferentia2 with Hugging Face Optimum.

For other methods to compile and run Mixtral inference on Inferentia2 and Trainium see the Run Hugging Face mistralai/Mixtral-8x7B-v0.1 autoregressive sampling on Inf2 & Trn1 tutorial located in the AWS Neuron Documentation and Notebook.

About the authors

Lior Sadan is a Senior Solutions Architect at AWS, with an affinity for storage solutions and AI/ML implementations. He helps customers architect scalable cloud systems and optimize their infrastructure. Outside of work, Lior enjoys hands-on home renovation and construction projects.

Stenio de Lima Ferreira is a Senior Solutions Architect passionate about AI and automation. With over 15 years of work experience in the field, he has a background in cloud infrastructure, devops and data science. He specializes in codifying complex requirements into reusable patterns and breaking down difficult topics into accessible content.

Elevate business productivity with Amazon Q and Amazon Connect

April 15, 2025

by Sujatha Dantuluri Amazon AWS

Modern banking faces dual challenges: delivering rapid loan processing while maintaining robust security against sophisticated fraud. Amazon Q Business provides AI-driven analysis of regulatory requirements and lending patterns. Additionally, you can now report fraud from the same interface with a custom plugin capability that can integrate with Amazon Connect. This fusion of technology transforms traditional lending by enabling faster processing times, faster fraud prevention, and a seamless user experience.

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business provides plugins to interact with popular third-party applications, such as Jira, ServiceNow, Salesforce, PagerDuty, and more. Administrators can enable these plugins with a ready-to-use library of over 50 actions to their Amazon Q Business application. Where pre-built plugins are not available, Amazon Q Business provides capabilities to build custom plugins to integrate with your application. Plugins help streamline tasks and boost productivity by integrating external services into the Amazon Q Business chat interface.

Amazon Connect is an AI-powered application that provides one seamless experience for your contact center customers and users. It’s comprised of a full suite of features across communication channels. Amazon Connect Cases, a feature of Amazon Connect, allows your agents to track and manage customer issues that require multiple interactions, follow-up tasks, and teams in your contact center. Agents can document customer issues with the relevant case details, such as date/time opened, issue summary, customer information, and status, in a single unified view.

The solution integrates with Okta Identity Management Platform to provide robust authentication, authorization, and single sign-on (SSO) capabilities across applications. Okta can support enterprise federation clients like Active Directory, LDAP, or Ping.

For loan approval officers reviewing mortgage applications, the seamless integration of Amazon Q Business directly into their primary workflow transforms the user experience. Rather than context-switching between applications, officers can harness the capabilities of Amazon Q to conduct research, analyze data, and report potential fraud cases within their mortgage approval interface.

In this post, we demonstrate how to elevate business productivity by leveraging Amazon Q to provide insights that enable research, data analysis, and report potential fraud cases within Amazon Connect.

Solution overview

The following diagram illustrates the solution architecture.

The solution includes the following steps:

Users in Okta are configured to be federated to AWS IAM Identity Center, and a unique ID (audience) is configured for an Amazon API Gateway
When the user chooses to chat in the web application, the following flow is initiated:
1. The Amazon Q Business application uses the client ID and client secret key to exchange the Okta-generated JSON Web Token (JWT) with IAM Identity Center. The token includes the AWS Security Token Service (AWS STS) context identity.
2. A temporary token is issued to the application server to assume the role and access the Amazon Q Business API.
The Amazon Q Business application fetches information from the Amazon Simple Storage Service (Amazon S3) data source to answer questions or generate summaries.
The Amazon Q custom plugin uses an Open API schema to discover and understand the capabilities of the API Gateway API.
A client secret is stored in AWS Secrets Manager and the information is provided to the plugin.
The plugin assumes the AWS Identity and Access Management (IAM) role with the kms:decrypt action to access the secrets in Secret Manager.
When a user wants to send a case, the custom plugin invokes the API hosted on API Gateway.
API Gateway uses the same Okta user’s session and authorizes the access.
API Gateway invokes AWS Lambda to create a case in Amazon Connect.
Lambda hosted in Amazon Virtual Private Cloud (Amazon VPC) internally calls the Amazon Connect API using an Amazon Connect VPC interface endpoint powered by AWS PrivateLink.
The contact center agents can also use Amazon Q in Connect to further assist the user.

Prerequisites

The following prerequisites need to be met before you can build the solution:

Have a valid AWS account.
Have an Amazon Q Business Pro subscription to create Amazon Q applications.
Have the service-linked IAM role AWSServiceRoleForQBusiness. If you don’t have one, create it with the amazonaws.com service name.
Have an IAM role in the account that will allow the AWS CloudFormation template to create new roles and add policies. If you have administrator access to the account, no action is required.
Enable logging in AWS CloudTrail for operational and risk auditing.

Okta prerequisites:

Have an Okta developer account and setup an application and API. If you do not have an Okta, please see the following instructions.

Set up an application and API in Okta

Complete the following steps to set up an application and API in Okta:

Log in to the Okta console.
Provide credentials and choose Login.
Choose Continue with Google.
You might need to set up multi-factor authentication following the instructions on the page.
Log in using the authentication code.
In the navigation pane, choose Applications and choose Create App Integration.

Select OIDC – OpenID for Sign-in method and Web Application for Application type, then choose Next.

For App integration name, enter a name (for example, myConnectApp).
Select Authorization Code and Refresh Token for Grant type.
Select Skip group assignment for now for Control Access.
Choose Save to create an application.
Take note of the client ID and secret.

Add Authentication server and metadata

In the navigation pane, choose Security, then choose API.
Choose Add Authorization Server, provide the necessary details, and choose Save.

Take note of the Audience value and choose Metadata URI.

Audience is provided as an input to the CloudFormation template later in the section.

The response will provide the metadata.

From the response, take note of the following:
- issuer
- authorization_endpoint
- token_endpoint
Under Scopes, choose Add Scope, provide the name write/tasks, and choose Create.

On the Access Policies tab, choose Add Policy.
Provide a name and description.
Select The following clients and choose the application by entering my in the text box and choosing the application created earlier.
Choose Create Policy to add a policy.

Choose Add Rule to add a rule and select only Authorization Code for Grant type is.
For Scopes requested, select The following scopes, then enter write in the text box and select the write/tasks
Adjust Access token lifetime is and Refresh token lifetime is to minutes.
Add but will expire if not used every as 5 minutes.
Choose Create rule to create the rule.

Add users

In the navigation pane, choose Directory and choose People.
Choose Add person.

Complete the fields:
1. First name
2. Last name
3. Username (use the same as the primary email)
4. Primary email
Select Send user activation email now.
Choose Save to save the user.

You will receive an email. Choose the link in the email to activate the user.
Choose Groups, then choose Add group to add the group.
Provide a name and optional description.
Refresh the page and choose the newly created group.
Choose Assign people to assign users.
Add the newly created user by choosing the plus sign next to the user.

Under Applications, select the application name created earlier.
On the Assignments tab, choose Assign to People.

Select the user and choose Assign.
Choose Done to complete the assignment.

Set up Okta as an identity source in IAM Identity Center

Complete the following steps to set up Okta as an identity source:

Enable an IAM Identity Center instance.
Configure SAML and SCIM with Okta and IAM Identity Center.
On the IAM Identity Center console, navigate to the instance.
Under Settings, copy the value Instance ARN. You will need it when you run the CloudFormation template.

Deploy resources using AWS CloudFormation

In this step, we use a CloudFormation template to deploy a Lambda function, configure the REST API, and create identities. Complete the following steps:

Open the AWS CloudFormation console in the us-east-1 AWS Region.
Choose Create stack.
Download the CloudFormation template and upload it in the Specify template
Choose Next.
For Stack name, enter a name (for example, QIntegrationWithConnect).
In the Parameters section, provide values for the following:
1. Audience
2. AuthorizationUrl
3. ClientId
4. ClientSecret
5. IdcInstanceArn
6. Issuer
7. TokenUrl

Choose Next.
Keep the other values as default and select I acknowledge that AWS CloudFormation might create IAM resources in the Capabilities.
Select I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND in the Capabilities.
Choose Submit to create the CloudFormation stack.
After the successful deployment of the stack, on the Outputs tab, note the value for ALBDNSName.

The CloudFormation template does not deploy certificates for Application Load Balancer. We strongly recommend creating a secure listener for the Application Load Balancer and deploying at least one certificate.

Assign user to Amazon Q Application

On the Amazon Q Business console, navigate to the application named qbusiness-connect-case.
Under User Access, choose Manage user access.
On the user tab, choose Add groups and users and search for the user you created in Okta and propagated in IAM Identity Center.
Choose Assign and Done.

Choose Confirm to confirm the subscription.
Copy the link for Deployed URL.

Create a callback URL: <Deployed URL>/oauth/callback.

We recommend that you enable a budget policy notification to prevent unwanted billing.

Configure login credentials for the web application

Complete the following steps to configure login credentials for the web application:

Navigate to the Okta developer login.
Under Applications, choose the web application myConnectApp created earlier.
Choose Edit in the General Settings
Enter the callback URL for Sign-in redirect URIs.
Choose Save.

Sync the knowledge base

Complete the following steps to sync your knowledge base:

On the Amazon S3 console, choose Buckets in the navigation pane.
Search for AmazonQDataSourceBucket and choose the bucket.
Download the sample AnyBank regulations document.
Upload the PDF file to the S3 bucket.
On the Amazon Q Business console, navigate to the Amazon Q Business application.
In the Data sources section, select the data source.
Choose Sync now to sync the data source.

Embed the web application

Complete the following steps to embed the web application:

On the Amazon Q Business console, under Enhancements, choose Amazon Q embedded.
Choose Add allowed website.
For Enter website URL, enter http://<ALBDNSName>.

Test the solution

Complete the following steps to test the solution:

Copy the ALBDNSName value from the outputs section of the CloudFormation stack and open it in a browser.

You will see an AnyBank website.

Choose Chat with us and the Okta sign-in page will pop up.
Provide the sign-in details.

Upon verification, close the browser tab.
Navigate to the Amazon Q Business application in the chat window.
In the chat window, enter “What are the Fraud Detection and Prevention Measures?”

Amazon Q Business will provide the answers from the knowledge base.

Next, let’s assume that you detected a fraud and want to create a case.

Choose the plugin CreateCase and ask the question, “Can you create a case reporting fraud?”

Amazon Q Business generates the title of the case based on the question.

Choose Submit.
If Amazon Q Business asks you to authorize your access, choose Authorize.

The CreateCase plugin will create a case in Amazon Connect

Navigate to Amazon Connect and open the access URL in a browser.
Provide the user name admin and get the password from visiting the parameter store in AWS Systems Manager.

Choose Agent Workspace.

You can see the case that was created by Amazon Q Business using the custom plugin.

Clean up

To avoid incurring future charges, delete the resources that you created and clean up your account:

Empty the contents of the S3 buckets you created as part of the CloudFormation stack.
Delete the CloudFormation stack you created as part of this post.
Disable the application from IAM Identity Center.

Conclusion

As businesses navigate the ever-changing corporate environment, the combination of Amazon Q Business and Amazon Connect emerges as a transformative approach to optimizing employee assistance and operational effectiveness. Harnessing the capabilities of AI-powered assistants and advanced contact center tools, organizations can empower their teams to access data, initiate support requests, and collaborate cohesively through a unified solution. This post showcased a banking portal, but this can be used for other industrial sectors or organizational verticals.

Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.

About the Authors

Sujatha Dantuluri is a seasoned Senior Solutions Architect in the US federal civilian team at AWS, with over two decades of experience supporting commercial and federal government clients. Her expertise lies in architecting mission-critical solutions and working closely with customers to ensure their success. Sujatha is an accomplished public speaker, frequently sharing her insights and knowledge at industry events and conferences. She has contributed to IEEE standards and is passionate about empowering others through her engaging presentations and thought-provoking ideas.

Dr Anil Giri is a Solutions Architect at Amazon Web Services. He works with enterprise software and SaaS customers to help them build generative AI applications and implement serverless architectures on AWS. His focus is on guiding clients to create innovative, scalable solutions using cutting-edge cloud technologies.

Build multi-agent systems with LangGraph and Amazon Bedrock

April 14, 2025

by Jagdeep Singh Soni Amazon AWS

Large language models (LLMs) have raised the bar for human-computer interaction where the expectation from users is that they can communicate with their applications through natural language. Beyond simple language understanding, real-world applications require managing complex workflows, connecting to external data, and coordinating multiple AI capabilities. Imagine scheduling a doctor’s appointment where an AI agent checks your calendar, accesses your provider’s system, verifies insurance, and confirms everything in one go—no more app-switching or hold times. In these real-world scenarios, agents can be a game changer, delivering more customized generative AI applications.

LLM agents serve as decision-making systems for application control flow. However, these systems face several operational challenges during scaling and development. The primary issues include tool selection inefficiency, where agents with access to numerous tools struggle with optimal tool selection and sequencing, context management limitations that prevent single agents from effectively managing increasingly complex contextual information, and specialization requirements as complex applications demand diverse expertise areas such as planning, research, and analysis. The solution lies in implementing a multi-agent architecture, which involves decomposing the main system into smaller, specialized agents that operate independently. Implementation options range from basic prompt-LLM combinations to sophisticated ReAct (Reasoning and Acting) agents, allowing for more efficient task distribution and specialized handling of different application components. This modular approach enhances system manageability and allows for better scaling of LLM-based applications while maintaining functional efficiency through specialized components.

This post demonstrates how to integrate open-source multi-agent framework, LangGraph, with Amazon Bedrock. It explains how to use LangGraph and Amazon Bedrock to build powerful, interactive multi-agent applications that use graph-based orchestration.

AWS has introduced a multi-agent collaboration capability for Amazon Bedrock Agents, enabling developers to build, deploy, and manage multiple AI agents working together on complex tasks. This feature allows for the creation of specialized agents that handle different aspects of a process, coordinated by a supervisor agent that breaks down requests, delegates tasks, and consolidates outputs. This approach improves task success rates, accuracy, and productivity, especially for complex, multi-step tasks.

Challenges with multi-agent systems

In a single-agent system, planning involves the LLM agent breaking down tasks into a sequence of small tasks, whereas a multi-agent system must have workflow management involving task distribution across multiple agents. Unlike single-agent environments, multi-agent systems require a coordination mechanism where each agent must maintain alignment with others while contributing to the overall objective. This introduces unique challenges in managing inter-agent dependencies, resource allocation, and synchronization, necessitating robust frameworks that maintain system-wide consistency while optimizing performance.

Memory management in AI systems differs between single-agent and multi-agent architectures. Single-agent systems use a three-tier structure: short-term conversational memory, long-term historical storage, and external data sources like Retrieval Augmented Generation (RAG). Multi-agent systems require more advanced frameworks to manage contextual data, track interactions, and synchronize historical records across agents. These systems must handle real-time interactions, context synchronization, and efficient data retrieval, necessitating careful design of memory hierarchies, access patterns, and inter-agent sharing.

Agent frameworks are essential for multi-agent systems because they provide the infrastructure for coordinating autonomous agents, managing communication and resources, and orchestrating workflows. Agent frameworks alleviate the need to build these complex components from scratch.

LangGraph, part of LangChain, orchestrates agentic workflows through a graph-based architecture that handles complex processes and maintains context across agent interactions. It uses supervisory control patterns and memory systems for coordination.

LangGraph Studio enhances development with graph visualization, execution monitoring, and runtime debugging capabilities. The integration of LangGraph with Amazon Bedrock empowers you to take advantage of the strengths of multiple agents seamlessly, fostering a collaborative environment that enhances the efficiency and effectiveness of LLM-based systems.

Understanding LangGraph and LangGraph Studio

LangGraph implements state machines and directed graphs for multi-agent orchestration. The framework provides fine-grained control over both the flow and state of your agent applications. LangGraph models agent workflows as graphs. You define the behavior of your agents using three key components:

State – A shared data structure that represents the current snapshot of your application.
Nodes – Python functions that encode the logic of your agents.
Edges – Python functions that determine which Node to execute next based on the current state. They can be conditional branches or fixed transitions.

LangGraph implements a central persistence layer, enabling features that are common to most agent architectures, including:

Memory – LangGraph persists arbitrary aspects of your application’s state, supporting memory of conversations and other updates within and across user interactions.
Human-in-the-loop – Because state is checkpointed, execution can be interrupted and resumed, allowing for decisions, validation, and corrections at key stages through human input.

LangGraph Studio is an integrated development environment (IDE) specifically designed for AI agent development. It provides developers with powerful tools for visualization, real-time interaction, and debugging capabilities. The key features of LangGraph Studio are:

Visual agent graphs – The IDE’s visualization tools allow developers to represent agent flows as intuitive graphic wheels, making it straightforward to understand and modify complex system architectures.
Real-time debugging – The ability to interact with agents in real time and modify responses mid-execution creates a more dynamic development experience.
Stateful architecture – Support for stateful and adaptive agents within a graph-based architecture enables more sophisticated behaviors and interactions.

The following screenshot shows the nodes, edges, and state of a typical LangGraph agent workflow as viewed in LangGraph Studio.

Figure 1: LangGraph Studio UI

In the preceding example, the state begins with __start__ and ends with __end__. The nodes for invoking the model and tools are defined by you and the edges tell you which paths can be followed by the workflow.

LangGraph Studio is available as a desktop application for MacOS users. Alternatively, you can run a local in-memory development server that can be used to connect a local LangGraph application with a web version of the studio.

Solution overview

This example demonstrates the supervisor agentic pattern, where a supervisor agent coordinates multiple specialized agents. Each agent maintains its own scratchpad while the supervisor orchestrates communication and delegates tasks based on agent capabilities. This distributed approach improves efficiency by allowing agents to focus on specific tasks while enabling parallel processing and system scalability.

Let’s walk through an example with the following user query: “Suggest a travel destination and search flight and hotel for me. I want to travel on 15-March-2025 for 5 days.” The workflow consists of the following steps:

The Supervisor Agent receives the initial query and breaks it down into sequential tasks:
1. Destination recommendation required.
2. Flight search needed for March 15, 2025.
3. Hotel booking required for 5 days.
The Destination Agent begins its work by accessing the user’s stored profile. It searches its historical database, analyzing patterns from similar user profiles to recommend the destination. Then it passes the destination back to the Supervisor Agent.
The Supervisor Agent forwards the chosen destination to the Flight Agent, which searches available flights for the given date.
The Supervisor Agent activates the Hotel Agent, which searches for hotels in the destination city.
The Supervisor Agent compiles the recommendations into a comprehensive travel plan, presenting the user with a complete itinerary including destination rationale, flight options, and hotel suggestions.

The following figure shows a multi-agent workflow of how these agents connect to each other and which tools are involved with each agent.

Figure 2: Multi-agent workflow

Prerequisites

You will need the following prerequisites before you can proceed with this solution. For this post, we use the us-west-2 AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.

A valid AWS account.
An AWS Identity and Access Management (IAM) role in the account that has sufficient permissions to create the necessary resources.
Access to Anthropic’s Claude 3 Sonnet and Claude 3.5 Sonnet in Amazon Bedrock. For instructions, see Access Amazon Bedrock foundation models.
A LangGraph application up and running locally. For instructions, see Quickstart: Launch Local LangGraph Server.

Core components

Each agent is structured with two primary components:

graph.py – This script defines the agent’s workflow and decision-making logic. It implements the LangGraph state machine for managing agent behavior and configures the communication flow between different components. For example:
- The Flight Agent’s graph manages the flow between chat and tool operations.
- The Hotel Agent’s graph handles conditional routing between search, booking, and modification operations.
- The Supervisor Agent’s graph orchestrates the overall multi-agent workflow.
tools.py – This script contains the concrete implementations of agent capabilities. It implements the business logic for each operation and handles data access and manipulation. It provides specific functionalities like:
- Flight tools: search_flights, book_flights, change_flight_booking, cancel_flight_booking.
- Hotel tools: suggest_hotels, book_hotels, change_hotel_booking, cancel_hotel_booking.

This separation between graph (workflow) and tools (implementation) allows for a clean architecture where the decision-making process is separate from the actual execution of tasks. The agents communicate through a state-based graph system implemented using LangGraph, where the Supervisor Agent directs the flow of information and tasks between the specialized agents.

To set up Amazon Bedrock with LangGraph, refer to the following GitHub repo. The high-level steps are as follows:

Install the required packages:

pip install boto3 langchain-aws

These packages are essential for AWS Bedrock integration:

boto: AWS SDK for Python, handles AWS service communication
langchain-aws: Provides LangChain integrations for AWS services

Import the modules:

from langchain_aws import ChatBedrockConverse 
from langchain_aws import ChatBedrock

Create an LLM object:

bedrock_client = boto3.client("bedrock-runtime", region_name="<region_name>")
llm = ChatBedrockConverse(
        model="anthropic.claude-3-haiku-20240307-v1:0",
        temperature=0,
        max_tokens=None,
        client=bedrock_client,
        # other params...
    )

LangGraph Studio configuration

This project uses a langgraph.json configuration file to define the application structure and dependencies. This file is essential for LangGraph Studio to understand how to run and visualize your agent graphs.

{
"dependencies": [
"boto3>=1.35.87",
"langchain-aws>=0.2.10",
"."
],
"graphs": {
"supervisor": "./src/supervisor_agent/graph.py:graph",
"flight": "./src/flight_agent/graph.py:graph",
"hotel": "./src/hotel_agent/graph.py:graph"
},
"env": "./.env"
}

LangGraph Studio uses this file to build and visualize the agent workflows, allowing you to monitor and debug the multi-agent interactions in real time.

Testing and debugging

You’re now ready to test the multi-agent travel assistant. You can start the graph using the langgraph dev command. It will start the LangGraph API server in development mode with hot reloading and debugging capabilities. As shown in the following screenshot, the interface provides a straightforward way to select which graph you want to test through the dropdown menu at the top left. The Manage Configuration button at the bottom lets you set up specific testing parameters before you begin. This development environment provides everything you need to thoroughly test and debug your multi-agent system with real-time feedback and monitoring capabilities.

Figure 3: LangGraph studio with Destination Agent recommendation

LangGraph Studio offers flexible configuration management through its intuitive interface. As shown in the following screenshot, you can create and manage multiple configuration versions (v1, v2, v3) for your graph execution. For example, in this scenario, we want to use user_id to fetch historic use information. This versioning system makes it simple to track and switch between different test configurations while debugging your multi-agent system.

Figure 4: Runnable configuration details

In the preceding example, we set up the user_id that tools can use to retrieve history or other details.

Let’s test the Planner Agent. This agent has the compare_and_recommend_destination tool, which can check past travel data and recommend travel destinations based on the user profile. We use user_id in the configuration so that can it be used by the tool.

LangGraph has concept of checkpoint memory that is managed using a thread. The following screenshot shows that you can quickly manage threads in LangGraph Studio.

Figure 5: View graph state in the thread

In this example, destination_agent is using a tool; you can also check the tool’s output. Similarly, you can test flight_agent and hotel_agent to verify each agent.

When all the agents are working well, you’re ready to test the full workflow. You can evaluate the state a verify input and output of each agent.

The following screenshot shows the full view of the Supervisor Agent with its sub-agents.

Figure 6: Supervisor Agent with complete workflow

Considerations

Multi-agent architectures must consider agent coordination, state management, communication, output consolidation, and guardrails, maintaining processing context, error handling, and orchestration. Graph-based architectures offer significant advantages over linear pipelines, enabling complex workflows with nonlinear communication patterns and clearer system visualization. These structures allow for dynamic pathways and adaptive communication, ideal for large-scale deployments with simultaneous agent interactions. They excel in parallel processing and resource allocation but require sophisticated setup and might demand higher computational resources. Implementing these systems necessitates careful planning of system topology, robust monitoring, and well-designed fallback mechanisms for failed interactions.

When implementing multi-agent architectures in your organization, it’s crucial to align with your company’s established generative AI operations and governance frameworks. Prior to deployment, verify alignment with your organization’s AI safety protocols, data handling policies, and model deployment guidelines. Although this architectural pattern offers significant benefits, its implementation should be tailored to fit within your organization’s specific AI governance structure and risk management frameworks.

Clean up

Delete any IAM roles and policies created specifically for this post. Delete the local copy of this post’s code. If you no longer need access to an Amazon Bedrock FM, you can remove access from it. For instructions, see Add or remove access to Amazon Bedrock foundation models

Conclusion

The integration of LangGraph with Amazon Bedrock significantly advances multi-agent system development by providing a robust framework for sophisticated AI applications. This combination uses LangGraph’s orchestration capabilities and FMs in Amazon Bedrock to create scalable, efficient systems. It addresses challenges in multi-agent architectures through state management, agent coordination, and workflow orchestration, offering features like memory management, error handling, and human-in-the-loop capabilities. LangGraph Studio’s visualization and debugging tools enable efficient design and maintenance of complex agent interactions. This integration offers a powerful foundation for next-generation multi-agent systems, providing effective workflow handling, context maintenance, reliable results, and optimal resource utilization.

For the example code and demonstration discussed in this post, refer to the accompanying GitHub repository. You can also refer to the following GitHub repo for Amazon Bedrock multi-agent collaboration code samples.

About the Authors

Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in the Netherlands. He uses his passion for generative AI to help customers and partners build generative AI applications using AWS services. Jagdeep has 15 years of experience in innovation, experience engineering, digital transformation, cloud architecture, and ML applications.

Ajeet Tewari is a Senior Solutions Architect for Amazon Web Services. He works with enterprise customers to help them navigate their journey to AWS. His specialties include architecting and implementing scalable OLTP systems and leading strategic AWS initiatives.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.