Amazon AWS – Page 35

Harnessing Amazon Bedrock generative AI for resilient supply chain

January 31, 2025

by Sujatha Dantuluri Amazon AWS

From pandemic shutdowns to geopolitical tensions, recent years have thrown our global supply chains into unexpected chaos. This turbulent period has taught both governments and organizations a crucial lesson: supply chain excellence depends not just on efficiency but on the ability to navigate disruptions through strategic risk management. By leveraging the generative AI capabilities and tooling of Amazon Bedrock, you can create an intelligent nerve center that connects diverse data sources, converts data into actionable insights, and creates a comprehensive plan to mitigate supply chain risks.

Amazon Bedrock is a fully managed service that enables the development and deployment of generative AI applications using high-performance foundation models (FMs) from leading AI companies through a single API.

Amazon Bedrock Flows affords you the ability to use supported FMs to build workflows by linking prompts, FMs, data sources, and other Amazon Web Services (AWS) services to create end-to-end solutions. Its visual workflow builder and serverless infrastructure enables organizations to accelerate the development and deployment of AI-powered supply chain solutions, improving agility and resilience in the face of evolving challenges. The drag and drop capability of Amazon Bedrock Flows efficiently integrates with Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents and other ever-growing AWS services such as Amazon Simple Storage Service (Amazon S3), AWS Lambda and Amazon Lex.

This post walks through how Amazon Bedrock Flows connects your business systems, monitors medical device shortages, and provides mitigation strategies based on knowledge from Amazon Bedrock Knowledge Bases or data stored in Amazon S3 directly. You’ll learn how to create a system that stays ahead of supply chain risks.

Business workflow

The following is the supply chain business workflow implemented as an Amazon Bedrock flow.

The following are the steps of the workflow in detail:

The JSON request with the medical device name is submitted to the prompt flow.
The workflow determines if the medical device needs review by following these steps:

1. The assistant invokes a Lambda function to check the device classification and any shortages.
2. If there is no shortage, the workflow informs the user that no action is required.
3. If the device classification is 3 (high-risk medical devices that are essential for sustaining life or health) and there is a shortage, the assistant determines the necessary mitigation steps Devices with classification 3 are treated as high-risk devices and require a comprehensive mitigation strategy. The following steps are followed in this scenario.
  1. Amazon Bedrock Knowledge Bases RetrieveAndGenerate API creates a comprehensive strategy.
  2. The flow emails the mitigation to the given email address.
4. If the device classification is 2 (medium-risk medical devices that can pose harm to patients) and there is a shortage, the flow lists the mitigation steps as output. Classification device 2 doesn’t require a comprehensive mitigation strategy. We recommend to use these when the information retrieved fits the context size of the model. Mitigation is fetched from Amazon S3 directly.
5. If the device classification is 1(low-risk devices that don’t pose significant risk to patients) and there is a shortage, the flow outputs only the details of the shortage because no action is required.

Solution overview

The following diagram illustrates the solution architecture. The solution uses Amazon Bedrock Flows to orchestrate the generative AI workflow. An Amazon Bedrock flow consists of nodes, which is a step in the flow and connections to connect to various data sources or to execute various conditions.

The system workflow includes the following steps:

The user interacts with generative AI applications, which connect with Amazon Bedrock Flows. The user provides information about the device.
A workflow in Amazon Bedrock Flows is a construct consisting of a name, description, permissions, a collection of nodes, and connections between nodes.
A Lambda function node in Amazon Bedrock Flows is used to invoke AWS Lambda to get supply shortage and device classifications. AWS Lambda calculates this information based on the data from Amazon DynamoDB.
If the device classification is 3, the flow queries the knowledge base node to find mitigations and create a comprehensive plan. Amazon Bedrock Guardrails can be applied in a knowledge base node.
A Lambda function node in Amazon Bedrock Flows invokes another Lambda function to email the mitigation plan to the users. AWS Lambda uses Amazon Simple Email Service (Amazon SES) SDK to send emails to verified identities.
Lambda functions are within the private subnet of Amazon Virtual Private Cloud (Amazon VPC) and provide least privilege access to the services using roles and permissions policies. AWS Lambda uses gateway endpoints or NAT gateways to connect to Amazon DynamoDB or Amazon SES, respectively
If the device classification is 2, the flow queries Amazon S3 to fetch the mitigation. In this case, comprehensive mitigation isn’t needed, and it can fit in the model context. This reduces overall cost and simplifies maintenance.

Prerequisites

The following prerequisites need to be completed before you can build the solution.

Have an AWS account.
Have an Amazon VPC with private subnet and public subnet and egress internet access.
This solution is supported only in US East (N. Virginia) us-east-1 AWS Region. You can make the necessary changes to your AWS CloudFormation template to deploy to other Regions.
Have permission to create Lambda functions and configure AWS Identity and Access Management (IAM)
Have permissions to create Amazon Bedrock prompts.
Sign up for model access on the Amazon Bedrock console (for more information, refer to model access in the Amazon Bedrock documentation). For information about pricing for using Amazon Bedrock, refer to Amazon Bedrock pricing. For this post, we use Anthropic’s Claude 3.5 Sonnet, and all instructions pertain to that model.
Enable AWS CloudTrail logging for operational and risk auditing.
Enable budget policy notification to protect the customer from unwanted billing.

Deployment with AWS CloudFormation console

In this step, you deploy the CloudFormation template.

Navigate to the CloudFormation console us-east-1
Download the CloudFormation template and upload it in the Specify template Choose Next.
Enter a name with the following details, as shown in the following screenshot:
- Stack name
- Fromemailaddress
- Toemailaddress
- VPCId
- VPCCecurityGroupIds
- VPCSubnets

Keep the other values as default. Under Capabilities on the last page, select I acknowledge that AWS CloudFormation might create IAM resources. Choose Submit to create the CloudFormation stack.
After the successful deployment of the whole stack, from the Resources tab, make a note of the following output key values. You’ll need them later.
- BedrockKBQDataSourceBucket
- Device2MitigationsBucket
- KMSKey

This is a sample code for nonproduction use. You should work with your security and legal teams to align with your organizational security, regulatory, and compliance requirements before deployment.

Upload mitigation documents to Amazon S3

In this step, you upload the mitigation documents to Amazon S3.

Download the device 2 mitigation strategy documents
On the Amazon S3 console, search for the Device2MitigationsBucket captured earlier
Upload the downloaded file to the bucket
Download the device 3 mitigation strategy documents
On the Amazon S3 console, search for the BedrockKBQDataSourceBucket captured earlier
Upload these documents to the S3 bucket

Configure Amazon Bedrock Knowledge Bases

In this section, you create an Amazon Bedrock knowledge base and sync it.

Create a knowledge base in Amazon Bedrock Knowledge Bases with BedrockKBQDataSourceBucket as a data source.
Add an inline policy to the service role for Amazon Bedrock Knowledge Bases to decrypt the AWS Key Management Service (AWS KMS) key.
Sync the data with the knowledge base.

Create an Amazon Bedrock workflow

In this section, you create a workflow in Amazon Bedrock Flows.

On the Amazon Bedrock console, select Amazon Bedrock Flows from the left navigation pane. Choose Create flow to create a flow, as shown in the following screenshot.

Enter a Name for the flow and an optional Description.
For the Service role name, choose Create and use a new service role to create a service role for you to use.
Choose Create, as shown in the following screenshot. Your flow is created, and you’ll be taken to the flow builder where you can build your flow.

Amazon Bedrock Flow configurations

This section walks through the process of creating the flow. Using Amazon Bedrock Flows, you can quickly build complex generative AI workflows using a visual flow builder. The following steps walk through configuring different components of the business process.

On the Amazon Bedrock console, select Flows from the left navigation pane.
Choose a flow in the Amazon Bedrock Flows
Choose Edit in flow builder.
In the Flow builder section, the center pane displays a Flow input node and a Flow output These are the input and output nodes for your flow.

Select the Flow Input
In Configure in the left-hand menu, change the Type of the Output to Object, as shown in the following screenshot.

In the Flow builder pane, select Nodes.

Add prompt node to process the incoming data

A prompt node defines a prompt to use in the flow. You use this node to refine the input for Lambda processing.

Drag the Prompts node and drop it in the center pane.

Select the node you just added.
In the Configure section of the Flow builder pane, choose Define in node.
Define the following values:
- Choose Select model and Anthropic Claude 3 Sonnet.
- In the Message section add the following prompt:
  Given a supply chain issue description enclosed in description tag <desc> </desc>, classify the device and problem type. Respond only with a JSON object in the following format: { "device": "<device_name>", "problem_type": "<problem_type>" } Device types include but are not limited to: Oxygen Mask Ventilator Hospital Bed Surgical Gloves Defibrillator pacemaker Problem types include but are not limited to: scarcity malfunction quality_issue If an unknown device type is provided respond with unknown for any of the fields <desc> {{description}}</desc>

In the Input section, change the Expression of the input variable description to the following, as shown in the following screenshot:
- $.data.description

The circles on the nodes are connection points. To connect the Prompt node to the input node, drag a line from the circle on the Flow input node to the circle in the Input section of the Prompt
Delete the connection between the Flow Input node and the Flow Output node by double clicking on it. The following video illustrates steps 6 and 7.

Add Lambda node to fetch classifications from database

A Lambda node lets you call a Lambda function in which you can define code to carry out business logic. This solution uses a Lambda node to fetch the shortage information, classification of the device, Amazon S3 object key, and instructions for retrieving information from the knowledge base.

Add the Lambda node by dragging to the center.
From configuration of the node, choose the Lambda function with the name containing SupplyChainMgmt from the dropdown menu, as shown in the following screenshot.

Update the Output type as Object, as shown in the following screenshot.

Connect the Lambda node input to the Prompt node output.

Add condition node to determine the need for mitigation

A condition node sends data from the previous node to different nodes, depending on the conditions that are defined. A condition node can take multiple inputs. This node determines if there is a shortage and follows the appropriate path.

Add the Condition node by dragging it to the center.

From configuration of the Condition node, in the Input section, update the first input with the following details:

- Name: classification
- Type: Number
- Expression: $.data.classification

Choose Add input to add the new input with the following details:
- Name: shortage
- Type: Number
- Expression: $.data.shortage
Connect the output of the Lambda node to the two inputs of the Condition

From configuration of the Condition node, in the Conditions section, add the following details:
- Name: Device2Condition
- Condition: (classification == 2) and (shortage >10)
Choose Add condition and enter the following details:
- Name: Device3Condition
- Condition: (classification == 3) and (shortage >10)

Connect the circle from If all conditions are false to input of default Flow output
Connect output of Lambda node to default Flow output input node.

In the configurations of the default Flow output node, update the expression to the following:
- $.data.message

Fetch mitigation using the S3 Retrieval Node

An S3 retrieval node lets you retrieve data from an Amazon S3 location to introduce to the flow. This node will retrieve mitigations directly from Amazon S3 for type 2 devices.

Add an S3 Retrieval node by dragging it to the center.
In the configurations of the node, choose the newly created S3 bucket with a name containing device2mitigationsbucket.
Update the Expression of the input to the following:
- $.data.S3instruction

Connect the circle from the Device2Condition condition of the Condition node to the S3 Retrieval.
Connect the output of the Lambda node to the input of the S3 Retrieval.

Add the Flow output node by dragging it to the center.
In the configuration of the node, give the node the name
Connect the output of the S3 Retrieval node to S3Output node.

Fetch mitigations using the Knowledge Base Node

A Knowledge Base node lets you send a query to a knowledge base from Amazon Bedrock Knowledge Bases. This node will fetch a comprehensive mitigation strategy from Amazon Bedrock Knowledge Bases for type 3 devices.

Add the Knowledge Base node by dragging it to the center.
From the configuration of the Knowledge Base node, select the knowledge base created earlier.
Select Generate responses based on retrieved results and select Claude 3 Sonnet from the dropdown menu of Select model.

In the Input section, update the input expression as the following:
- Expression: $.data.retrievalQuery

Connect the circle from the Device3Condition condition of the Condition node to the Knowledge base
Connect the output of the Knowledge base node to the Lambda node input with the name codeHookInput.
Add the Flow output node by dragging it to the center.
In the configuration of the node, give the Node name KBOutput.
Connect the output of the Knowledge Base node to KBOutput node

Add the Lambda node by dragging it to the center.
From the configuration of the node, choose the Lambda function with the name containing EmailReviewersFunction from the dropdown menu.

Choose Add input to add the new input with the following details:
- Name: email
- Type: String
- Expression: $.data.email

Change output Type to Object.

Connect the output of the Knowledge base to the new Lambda node input with the name codeHookInput.
Connect the output of the Flow input node to the new Lambda node input with the name email.
Add the Flow output node by dragging it to the center.
In the configuration of the node, give the Node name
In the configurations of the emailOutput Flow output node, update the expression to the following:
- $.data.message

Connect the output of the Lambda node node to emailOutput Flow Output node
Choose Save to save the flow.

Testing

To test the agent, use the Amazon Bedrock flow builder console. You can embed the API calls into your applications.

In the test window of the newly created flow, give the following prompt by replacing the “To email address” with Toemail provided in the CloudFormation template.
{"description": "Cochlear implants are in shortage ","retrievalQuery":"find the mitigation for device shortage", "email": "<To email address>"}

SupplyChainManagement Lambda randomly generates shortages. If a shortage is detected, you’ll see an answer from Amazon Bedrock Knowledge Bases.
An email is also sent to the email address provided in the context.

Test the solution for classification 2 devices by giving the following prompt. Replace the To email address with Toemail provided in the CloudFormation template.
{"description": " oxygen mask are in shortage ","retrievalQuery":"find the mitigation for device shortage", "email": "<To email address>"}

The flow will fetch the results from Amazon S3 directly.

Clean up

To avoid incurring future charges, delete the resources you created. To clean up the AWS environment, use the following steps:

Empty the contents of the S3 bucket you created as part of the CloudFormation stack.
Delete the flow from Amazon Bedrock.
Delete the Amazon Bedrock knowledge base.
Delete the CloudFormation stack you created.

Conclusion

As we navigate an increasingly unpredictable global business landscape, the ability to anticipate and respond to supply chain disruptions isn’t just a competitive advantage—it’s a necessity for survival. The Amazon Bedrock suite of generative AI–powered tools offers organizations the capability to transform their supply chain management from reactive to proactive, from fragmented to integrated, and from rigid to resilient.

By implementing the solutions outlined in this guide, organizations can:

Build automated, intelligent monitoring systems
Create predictive risk management frameworks
Use AI-driven insights for faster decision-making
Develop adaptive supply chain strategies that evolve with emerging challenges

Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.

About the Authors

Marcelo Silva is a Principal Product Manager at Amazon Web Services, leading strategy and growth for Amazon Bedrock Knowledge Bases and Amazon Lex.

Sujatha Dantuluri is a Senior Solutions Architect in the US federal civilian team at AWS. Her expertise lies in architecting mission-critical solutions and working closely with customers to ensure their success. Sujatha is an accomplished public speaker, frequently sharing her insights and knowledge at industry events and conferences.

Ishan Gupta is a Software Engineer at Amazon Bedrock, where he focuses on developing cutting-edge generative AI applications. His interests lie in exploring the potential of large language models and creating innovative solutions that leverage the power of AI.

How Travelers Insurance classified emails with Amazon Bedrock and prompt engineering

January 31, 2025

by Jordan Knight Amazon AWS

This is a guest blog post co-written with Jordan Knight, Sara Reynolds, George Lee from Travelers.

Foundation models (FMs) are used in many ways and perform well on tasks including text generation, text summarization, and question answering. Increasingly, FMs are completing tasks that were previously solved by supervised learning, which is a subset of machine learning (ML) that involves training algorithms using a labeled dataset. In some cases, smaller supervised models have shown the ability to perform in production environments while meeting latency requirements. However, there are benefits to building an FM-based classifier using an API service such as Amazon Bedrock, such as the speed to develop the system, the ability to switch between models, rapid experimentation for prompt engineering iterations, and the extensibility into other related classification tasks. An FM-driven solution can also provide rationale for outputs, whereas a traditional classifier lacks this capability. In addition to these features, modern FMs are powerful enough to meet accuracy and latency requirements to replace supervised learning models.

In this post, we walk through how the Generative AI Innovation Center (GenAIIC) collaborated with leading property and casualty insurance carrier Travelers to develop an FM-based classifier through prompt engineering. Travelers receives millions of emails a year with agent or customer requests to service policies. The system GenAIIC and Travelers built uses the predictive capabilities of FMs to classify complex, and sometimes ambiguous, service request emails into several categories. This FM classifier powers the automation system that can save tens of thousands of hours of manual processing and redirect that time toward more complex tasks. With Anthropic’s Claude models on Amazon Bedrock, we formulated the problem as a classification task, and through prompt engineering and partnership with the business subject matter experts, we achieved 91% classification accuracy.

Problem Formulation

The main task was classifying emails received by Travelers into a service request category. Requests involved areas like address changes, coverage adjustments, payroll updates, or exposure changes. Although we used a pre-trained FM, the problem was formulated as a text classification task. However, instead of using supervised learning, which normally involves training resources, we used prompt engineering with few-shot prompting to predict the class of an email. This allowed us to use a pre-trained FM without having to incur the costs of training. The workflow started with an email, then, given the email’s text and any PDF attachments, the email was given a classification by the model.

It should be noted that fine-tuning an FM is another approach that could have improved the performance of the classifier with an additional cost. By curating a longer list of examples and expected outputs, an FM can be trained to perform better on a specific task. In this case, given the accuracy was already high by just using prompt engineering, the accuracy after fine-tuning would have to justify the cost. Although at the time of the engagement, Anthropic’s Claude models weren’t available for fine-tuning on Amazon Bedrock, now Anthropic’s Claude Haiku fine-tuning is in beta testing through Amazon Bedrock.

Overview of solution

The following diagram illustrates the solution pipeline to classify an email.

The workflow consists of the following steps:

The raw email is ingested into the pipeline. The body text is extracted from the email text files.
If the email has a PDF attachment, the PDF is parsed.
The PDF is split into individual pages. Each page is saved as an image.
The PDF page images are processed by Amazon Textract to extract text, specific entities, and table data using Optical Character Recognition (OCR).
Text from the email is parsed.
The text is then cleaned of HTML tags, if necessary.
The text from the email body and PDF attachment are combined into a single prompt for the large language model (LLM).
Anthropic’s Claude classifies this content into one of 13 defined categories and then returns that class. The predictions for each email are further used for analysis of performance.

Amazon Textract served multiple purposes, such as extracting the raw text of the forms included in as attachments in emails. Additional entity extraction and table data detection was included to identify names, policy numbers, dates, and more. The Amazon Textract output was then combined with the email text and given to the model to decide the appropriate class.

This solution is serverless, which has many benefits for the organization. With a serverless solution, AWS provides a managed solution, facilitating lower cost of ownership and reduced complexity of maintenance.

Data

The ground truth dataset contained over 4,000 labeled email examples. The raw emails were in Outlook .msg format and raw .eml format. Approximately 25% of the emails had PDF attachments, of which most were ACORD insurance forms. The PDF forms included additional details that provided a signal for the classifier. Only PDF attachments were processed to limit the scope; other attachments were ignored. For most examples, the body text contained the majority of the predictive signal that aligned with one of the 13 classes.

Prompt engineering

To build a strong prompt, we needed to fully understand the differences between categories to provide sufficient explanations for the FM. Through manually analyzing email texts and consulting with business experts, the prompt included a list of explicit instructions on how to classify an email. Additional instructions showed Anthropic’s Claude how to identify key phrases that help distinguish an email’s class from the others. The prompt also included few-shot examples that demonstrated how to perform the classification, and output examples that showed how the FM is to format its response. By providing the FM with examples and other prompting techniques, we were able to significantly reduce the variance in the structure and content of the FM output, leading to explainable, predictable, and repeatable results.

The structure of the prompt was as follows:

Persona definition
Overall instruction
Few-shot examples
Detailed definitions for each class
Email data input
Final output instruction

To learn more about prompt engineering for Anthropic’s Claude, refer to Prompt engineering in the Anthropic documentation.

“Claude’s ability to understand complex insurance terminology and nuanced policy language makes it particularly adept at tasks like email classification. Its capacity to interpret context and intent, even in ambiguous communications, aligns perfectly with the challenges faced in insurance operations. We’re excited to see how Travelers and AWS have harnessed these capabilities to create such an efficient solution, demonstrating the potential for AI to transform insurance processes.”

– Jonathan Pelosi, Anthropic

Results

For an FM-based classifier to be used in production, it must show a high level of accuracy. Initial testing without prompt engineering yielded 68% accuracy. After using a variety of techniques with Anthropic’s Claude v2, such as prompt engineering, condensing categories, adjusting document processing process, and improving instructions, accuracy increased to 91%. Anthropic’s Claude Instant on Amazon Bedrock also performed well, with 90% accuracy, with additional areas of improvement identified.

Conclusion

In this post, we discussed how FMs can reliably automate the classification of insurance service emails through prompt engineering. When formulating the problem as a classification task, an FM can perform well enough for production environments, while maintaining extensibility into other tasks and getting up and running quickly. All experiments were conducted using Anthropic’s Claude models on Amazon Bedrock.

About the Authors

Jordan Knight is a Senior Data Scientist working for Travelers in the Business Insurance Analytics & Research Department. His passion is for solving challenging real-world computer vision problems and exploring new state-of-the-art methods to do so. He has a particular interest in the social impact of ML models and how we can continue to improve modeling processes to develop ML solutions that are equitable for all. In his free time you can find him either rock climbing, hiking, or continuing to develop his somewhat rudimentary cooking skills.

Sara Reynolds is a Product Owner at Travelers. As a member of the Enterprise AI team, she has advanced efforts to transform processing within Operations using AI and cloud-based technologies. She recently earned her MBA and PhD in Learning Technologies and is serving as an Adjunct Professor at the University of North Texas.

George Lee is AVP, Data Science & Generative AI Lead for International at Travelers Insurance. He specializes in developing enterprise AI solutions, with expertise in Generative AI and Large Language Models. George has led several successful AI initiatives and holds two patents in AI-powered risk assessment. He received his Master’s in Computer Science from the University of Illinois at Urbana-Champaign.

Francisco Calderon is a Data Scientist at the Generative AI Innovation Center (GAIIC). As a member of the GAIIC, he helps discover the art of the possible with AWS customers using generative AI technologies. In his spare time, Francisco likes playing music and guitar, playing soccer with his daughters, and enjoying time with his family.

Isaac Privitera is a Principal Data Scientist with the AWS Generative AI Innovation Center, where he develops bespoke generative AI-based solutions to address customers’ business problems. His primary focus lies in building responsible AI systems, using techniques such as RAG, multi-agent systems, and model fine-tuning. When not immersed in the world of AI, Isaac can be found on the golf course, enjoying a football game, or hiking trails with his loyal canine companion, Barry.

Accelerate digital pathology slide annotation workflows on AWS using H-optimus-0

January 31, 2025

by Pierre de Malliard Amazon AWS

Digital pathology is essential for the diagnosis and treatment of cancer, playing a critical role in healthcare delivery and pharmaceutical research and development. Pathology traditionally relies heavily on pathologist expertise and experience to conduct meticulous examination of tissue samples to identify abnormalities. However, the increasing complexity and volume of cases necessitate advanced tools to assist pathologists in making faster, more accurate diagnoses.

The digitization of pathology slides, known as whole slide images (WSIs), gave rise to the new field of computational pathology. By applying AI to these digitized WSIs, researchers are working to unlock new insights and enhance current annotations workflows. A pivotal advancement in the field of computational pathology has been the emergence of large-scale deep neural network architectures, known as foundation models (FMs). These models are trained using self-supervised learning algorithms on expansive datasets, enabling them to capture a comprehensive repertoire of visual representations and patterns inherent within pathology images. The power of FMs lies in their ability to learn robust and generalizable data embeddings that can be effectively transferred and fine-tuned for a wide variety of downstream tasks, ranging from automated disease detection and tissue characterization to quantitative biomarker analysis and pathological subtyping.

Recently, French startup Bioptimus announced the release of a new pathology vision FM: H-optimus-0, the world’s largest publicly available FM for pathology. With 1.1 billion parameters, H-optimus-0 was trained on a proprietary dataset of several hundreds of millions of images extracted from over 500,000 histopathology slides. This sets a new benchmark for state-of-the-art performance in critical medical diagnostic tasks, from identifying cancerous cells to detecting genetic abnormalities in tumors.

The recent addition of H-optimus-0 to Amazon SageMaker JumpStart marks a significant milestone in making advanced AI capabilities accessible to healthcare organizations. This powerful FM, with its comprehensive training on over 500,000 histopathology slides, represents a valuable tool for organizations looking to enhance their digital pathology workflows.

In this post, we demonstrate how to use H-optimus-0 for two common digital pathology tasks: patch-level analysis for detailed tissue examination, and slide-level analysis for broader diagnostic assessment. Through practical examples, we show you how to adapt this FM to these specific use cases while optimizing computational resources.

Solution overview

Our solution uses the AWS integrated ecosystem to create an efficient scalable pipeline for digital pathology AI workflows. The architecture combines the following services:

Amazon Elastic File System (Amazon EFS) for scalable high-throughput data management of pathology slides
Amazon Elastic Container Registry (Amazon ECR) for managing custom training containers
Amazon Simple Storage Service (Amazon S3) for secure model artifact storage
Amazon SageMaker for end-to-end machine learning (ML) operations and efficient compute resource allocation

The following diagram illustrates the solution architecture for training and deploying fine-tuned FMs using H-optimus-0.

This diagram illustrates the solution architecture for training and deploying fine-tuned FMs using H-optimus-0

This post provides example scripts and training notebooks in the following GitHub repository.

Prerequisites

We assume you have access to and are authenticated in an AWS account. The AWS CloudFormation template for this solution uses t3.medium instances to host the SageMaker notebook. Feature extraction uses g5.2xlarge instance types powered by NVIDIA T4 GPU tested in the us-west-2 AWS Region. Training jobs are run on p3.2xlarge and g5.2xlarge instances. Check your AWS service quotas to make sure you have sufficient access to these instance types.

Create the AWS infrastructure

To get started with pathology AI workflows, we use AWS CloudFormation to automate the setup of our core infrastructure. The provided infra-stack.yml template creates a complete environment ready for model fine-tuning and training.

Our CloudFormation stack configures a secure networking environment using Amazon Virtual Private Cloud (Amazon VPC), establishing both public and private subnets with appropriate gateways for internet connectivity. Within this network, it creates an EFS file system to efficiently store and serve large pathology slide images. The stack also provisions a SageMaker notebook instance that automatically connects to the EFS storage, providing seamless access to training data.

The template handles all necessary security configurations, including AWS Identity and Access Management (IAM) roles. When deploying the stack, make note of the private subnet and security group identifiers; you will need to make sure your training jobs can access the EFS data storage.

For detailed setup instructions and configuration options, refer to the README in our GitHub repository.

Use FMs for patch-level prediction tasks

Patch-level analysis is fundamental to digital pathology AI workflows. Instead of processing entire WSIs that can exceed several gigabytes, patch-level analysis focuses on specific tissue regions. This targeted approach enables efficient resource utilization and faster model development cycles. The following diagram illustrates the workflow of patch-level prediction tasks on a WSI.

The following diagram illustrates the workflow of patch-level prediction tasks on a WSI

This diagram illustrates the workflow of patch-level prediction tasks on a WSI

Classification task: MHIST dataset

We demonstrate patch-level classification using the MHIST dataset, which contains colorectal polyp images. Early detection of potentially cancerous polyps directly impacts patient survival rates, making this a clinically relevant use case. By adding a simple classification head on top of H-optimus-0’s pretrained features and using linear probing, we achieve 83% accuracy. The implementation uses Amazon EFS for efficient data streaming and p3.2xlarge instances for optimal GPU utilization.

To access the MHIST dataset, submit a data request through their portal to obtain the annotations.csv file and images.zip file. Our repository includes a download_mhist.sh script that automatically downloads and organizes the data in your EFS storage.

Segmentation task: Lizard dataset

For our second patch-level task, we demonstrate nuclear segmentation using the Lizard dataset, which requires precise pixel-level predictions of nuclear boundaries in colon tissue. We adapt H-optimus-0 for segmentation by adding a Mask2Former ViT adapter head, allowing the model to generate detailed segmentation masks while using the FM’s powerful feature extraction capabilities.

The Lizard dataset is available on Kaggle, and our repository includes scripts to automatically download and prepare the data for training. The segmentation implementation runs on g5.16xlarge instances to handle the computational demands of pixel-level predictions.

Use FMs for WSI-level tasks

Analyzing entire WSIs presents unique challenges due to their massive size, often exceeding 50,000 x 50,000 pixels. To address this, we implement multiple instance learning (MIL), which treats each WSI as a collection of smaller patches. Our attention-based MIL approach automatically learns which regions are most relevant for the final prediction. The following diagram illustrates the workflow for WSI-level prediction tasks using MIL.

The following diagram illustrates the workflow for WSI-level prediction tasks using MIL

This diagram illustrates the workflow for WSI-level prediction tasks using MIL

WSI processing pipeline

Our implementation optimizes WSI analysis through the following methods:

Intelligent patching – We use the GPU-accelerated CuCIM library to efficiently load WSIs and apply Canny edge detection to identify and extract only tissue-containing regions
Feature extraction – The selected patches are processed in parallel using GPU acceleration, with features stored in space-efficient HDF5 format for downstream analysis

MSI status prediction

We demonstrate our WSI pipeline by predicting microsatellite instability (MSI) status, a crucial biomarker that guides immunotherapy decisions in cancer treatment. The TCGA-COAD dataset used for this task can be accessed through the GDC Data Portal, and our repository provides detailed instructions for downloading the WSIs and corresponding MSI labels.

Clean up

After you’ve finished, don’t forget to delete the associated resources (Amazon EFS storage and SageMaker notebook instances) to avoid unexpected costs.

Conclusion

In this post, we demonstrated how you can use AWS services to build scalable digital pathology AI workflows using the H-optimus-0 FM. Through practical examples of both patch-level tasks (MHIST classification and Lizard nuclear segmentation) and WSI analysis (MSI status prediction), we showed how to efficiently handle the unique challenges of computational pathology.

Our implementation highlights the seamless integration between AWS services for handling large-scale pathology data processing. Although we used Amazon EFS for this demonstration to enable high-throughput training workflows, production deployments might consider AWS HealthImaging for long-term storage of medical imaging data.

We hope this pipeline serves as a starting point for your own pathology AI initiatives. The provided GitHub repository contains the necessary components to help you begin building and scaling pathology workflows for your specific use cases. You can clone the repository and set up the infrastructure using the provided CloudFormation template. Then try fine-tuning H-optimus-0 on your own pathology datasets and downstream tasks and compare the results with your current methods.

We’d love to hear about your experiences and insights. Reach out to us or contribute to the publicly available FMs to help advance the field of computational pathology.

About the Authors

Pierre de Malliard is a Senior AI/ML Solutions Architect at Amazon Web Services and supports customers in the healthcare and life sciences industry. In his free time, Pierre enjoys skiing and exploring the New York food scene.

Christopher is a senior partner account manager at Amazon Web Services (AWS), helping independent software vendors (ISVs) innovate, build, and co-sell cloud-based healthcare software-as-a-service (SaaS) solutions in public sector. Part of the Healthcare and Life Sciences Technical Field Community (TFC), Christopher aims to accelerate the digitization and utilization of healthcare data to drive improved outcomes and personalized care delivery.

DeepSeek-R1 model now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

January 31, 2025

by Vivek Gangasani Amazon AWS

Today, we are announcing that DeepSeek AI’s first-generation frontier model, DeepSeek-R1, is available through Amazon SageMaker JumpStart and Amazon Bedrock Marketplace to deploy for inference. You can now use DeepSeek-R1 to build, experiment, and responsibly scale your generative AI ideas on AWS.

In this post, we demonstrate how to get started with DeepSeek-R1 on Amazon Bedrock and SageMaker JumpStart.

Overview of DeepSeek-R1

DeepSeek-R1 is a large language model (LLM) developed by DeepSeek-AI that uses reinforcement learning to enhance reasoning capabilities through a multi-stage training process from a DeepSeek-V3-Base foundation. A key distinguishing feature is its reinforcement learning (RL) step, which was used to refine the model’s responses beyond the standard pre-training and fine-tuning process. By incorporating RL, DeepSeek-R1 can adapt more effectively to user feedback and objectives, ultimately enhancing both relevance and clarity. In addition, DeepSeek-R1 employs a chain-of-thought (CoT) approach, meaning it’s equipped to break down complex queries and reason through them in a step-by-step manner. This guided reasoning process allows the model to produce more accurate, transparent, and detailed answers. This model combines RL-based fine-tuning with CoT capabilities, aiming to generate structured responses while focusing on interpretability and user interaction. With its wide-ranging capabilities DeepSeek-R1 has captured the industry’s attention as a versatile text-generation model that can be integrated into various workflows such as agents, logical reasoning and data interpretation tasks

DeepSeek-R1 uses a Mixture of Experts (MoE) architecture and is 671 billion parameters in size. The MoE architecture allows activation of 37 billion parameters, enabling efficient inference by routing queries to the most relevant expert “clusters.” This approach allows the model to specialize in different problem domains while maintaining overall efficiency. DeepSeek-R1 requires at least 800 GB of HBM memory in FP8 format for inference. In this post, we will use an ml.p5e.48xlarge instance to deploy the model. ml.p5e.48xlarge comes with 8 Nvidia H200 GPUs providing 1128 GB of GPU memory.

You can deploy DeepSeek-R1 model either through SageMaker JumpStart or Bedrock Marketplace. Because DeepSeek-R1 is an emerging model, we recommend deploying this model with guardrails in place. In this blog, we will use Amazon Bedrock Guardrails to introduce safeguards, prevent harmful content, and evaluate models against key safety criteria. At the time of writing this blog, for DeepSeek-R1 deployments on SageMaker JumpStart and Bedrock Marketplace, Bedrock Guardrails supports only the ApplyGuardrail API. You can create multiple guardrails tailored to different use cases and apply them to the DeepSeek-R1 model, improving user experiences and standardizing safety controls across your generative AI applications.

Prerequisites

To deploy the DeepSeek-R1 model, you need access to an ml.p5e instance. To check if you have quotas for P5e, open the Service Quotas console and under AWS Services, choose Amazon SageMaker, and confirm you’re using ml.p5e.48xlarge for endpoint usage. Make sure that you have at least one ml.P5e.48xlarge instance in the AWS Region you are deploying. To request a limit increase, create a limit increase request and reach out to your account team.

Because you will be deploying this model with Amazon Bedrock Guardrails, make sure you have the correct AWS Identity and Access Management (IAM) permissions to use Amazon Bedrock Guardrails. For instructions, see Set up permissions to use guardrails for content filtering.

Implementing guardrails with the ApplyGuardrail API

Amazon Bedrock Guardrails allows you to introduce safeguards, prevent harmful content, and evaluate models against key safety criteria. You can implement safety measures for the DeepSeek-R1 model using the Amazon Bedrock ApplyGuardrail API. This allows you to apply guardrails to evaluate user inputs and model responses deployed on Amazon Bedrock Marketplace and SageMaker JumpStart. You can create a guardrail using the Amazon Bedrock console or the API. For the example code to create the guardrail, see the GitHub repo.

The general flow involves the following steps: First, the system receives an input for the model. This input is then processed through the ApplyGuardrail API. If the input passes the guardrail check, it’s sent to the model for inference. After receiving the model’s output, another guardrail check is applied. If the output passes this final check, it’s returned as the final result. However, if either the input or output is intervened by the guardrail, a message is returned indicating the nature of the intervention and whether it occurred at the input or output stage. The examples showcased in the following sections demonstrate inference using this API.

Deploy DeepSeek-R1 in Amazon Bedrock Marketplace

Amazon Bedrock Marketplace gives you access to over 100 popular, emerging, and specialized foundation models (FMs) through Amazon Bedrock. To access DeepSeek-R1 in Amazon Bedrock, complete the following steps:

On the Amazon Bedrock console, choose Model catalog under Foundation models in the navigation pane.
At the time of writing this post, you can use the InvokeModel API to invoke the model. It doesn’t support Converse APIs and other Amazon Bedrock tooling.
Filter for DeepSeek as a provider and choose the DeepSeek-R1 model.

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration. The model supports various text generation tasks, including content creation, code generation, and question answering, using its reinforcement learning optimization and CoT reasoning capabilities.
The page also includes deployment options and licensing information to help you get started with DeepSeek-R1 in your applications.
To begin using DeepSeek-R1, choose Deploy.

You will be prompted to configure the deployment details for DeepSeek-R1. The model ID will be pre-populated.
For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters).
For Number of instances, enter a number of instances (between 1–100).
For Instance type, choose your instance type. For optimal performance with DeepSeek-R1, a GPU-based instance type like ml.p5e.48xlarge is recommended.
Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you might want to review these settings to align with your organization’s security and compliance requirements.
Choose Deploy to begin using the model.

When the deployment is complete, you can test DeepSeek-R1’s capabilities directly in the Amazon Bedrock playground.
Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters like temperature and maximum length.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.

You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with any Amazon Bedrock APIs, you need to get the endpoint ARN.

Run inference using guardrails with the deployed DeepSeek-R1 endpoint

The following code example demonstrates how to perform inference using a deployed DeepSeek-R1 model through Amazon Bedrock using the invoke_model and ApplyGuardrail API. You can create a guardrail using the Amazon Bedrock console or the API. For the example code to create the guardrail, see the GitHub repo. After you have created the guardrail, use the following code to implement guardrails. The script initializes the bedrock_runtime client, configures inference parameters, and sends a request to generate text based on a user prompt.

import boto3
import json

# Initialize Bedrock client
bedrock_runtime = boto3.client("bedrock-runtime")

# Configuration
MODEL_ID = "your-model-id"  # Bedrock model ID
GUARDRAIL_ID = "your-guardrail-id"
GUARDRAIL_VERSION = "your-guardrail-version"

def invoke_with_guardrails(prompt, max_tokens=1000, temperature=0.6, top_p=0.9):
    """
    Invoke Bedrock model with input and output guardrails
    """
    # Apply input guardrails
    input_guardrail = bedrock_runtime.apply_guardrail(
        guardrailIdentifier=GUARDRAIL_ID,
        guardrailVersion=GUARDRAIL_VERSION,
        source='INPUT',
        content=[{"text": {"text": prompt}}]
    )
    
    if input_guardrail['action'] == 'GUARDRAIL_INTERVENED':
        return f"Input blocked: {input_guardrail['outputs'][0]['text']}"

    # Prepare model input
    request_body = {
        "inputs": f"""You are an AI assistant. Do as the user asks.
### Instruction: {prompt}
### Response: <think>""",
        "parameters": {
            "max_new_tokens": max_tokens,
            "top_p": top_p,
            "temperature": temperature
        }
    }

    # Invoke model
    response = bedrock_runtime.invoke_model(
        modelId=MODEL_ID,
        body=json.dumps(request_body)
    )
    
    # Parse model response
    model_output = json.loads(response['body'].read())['generated_text']

    # Apply output guardrails
    output_guardrail = bedrock_runtime.apply_guardrail(
        guardrailIdentifier=GUARDRAIL_ID,
        guardrailVersion=GUARDRAIL_VERSION,
        source='OUTPUT',
        content=[{"text": {"text": model_output}}]
    )

    if output_guardrail['action'] == 'GUARDRAIL_INTERVENED':
        return f"Output blocked: {output_guardrail['outputs'][0]['text']}"
    
    return model_output

# Example usage
if __name__ == "__main__":
    prompt = "What's 1+1?"
    result = invoke_with_guardrails(prompt)
    print(result)

Deploy DeepSeek-R1 with SageMaker JumpStart

SageMaker JumpStart is a machine learning (ML) hub with FMs, built-in algorithms, and prebuilt ML solutions that you can deploy with just a few clicks. With SageMaker JumpStart, you can customize pre-trained models to your use case, with your data, and deploy them into production using either the UI or SDK.

Deploying DeepSeek-R1 model through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.

Deploy DeepSeek-R1 through SageMaker JumpStart UI

Complete the following steps to deploy DeepSeek-R1 using SageMaker JumpStart:

On the SageMaker console, choose Studio in the navigation pane.
First-time users will be prompted to create a domain.
On the SageMaker Studio console, choose JumpStart in the navigation pane.

The model browser displays available models, with details like the provider name and model capabilities.
Search for DeepSeek-R1 to view the DeepSeek-R1 model card.
Each model card shows key information, including:
- Model name
- Provider name
- Task category (for example, Text Generation)
- Bedrock Ready badge (if applicable), indicating that this model can be registered with Amazon Bedrock, allowing you to use Amazon Bedrock APIs to invoke the model
Choose the model card to view the model details page.

The model details page includes the following information:
- The model name and provider information
- Deploy button to deploy the model
- About and Notebooks tabs with detailed information
The About tab includes important details, such as:
- Model description
- License information
- Technical specifications
- Usage guidelines
Before you deploy the model, it’s recommended to review the model details and license terms to confirm compatibility with your use case.
Choose Deploy to proceed with deployment.
For Endpoint name, use the automatically generated name or create a custom one.
For Instance type¸ choose an instance type (default: ml.p5e.48xlarge).
For Initial instance count, enter the number of instances (default: 1).
Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed.Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.
Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
Choose Deploy to deploy the model.

The deployment process can take several minutes to complete.

When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. You can monitor the deployment progress on the SageMaker console Endpoints page, which will display relevant metrics and status information. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.

Deploy DeepSeek-R1 using the SageMaker Python SDK

To get started with DeepSeek-R1 using the SageMaker Python SDK, you will need to install the SageMaker Python SDK and make sure you have the necessary AWS permissions and environment setup. The following is a step-by-step code example that demonstrates how to deploy and use DeepSeek-R1 for inference programmatically. The code for deploying the model is provided in the Github here . You can clone the notebook and run from SageMaker Studio.

!pip install --force-reinstall --no-cache-dir sagemaker==2.235.2

from sagemaker.serve.builder.model_builder import ModelBuilder 
from sagemaker.serve.builder.schema_builder import SchemaBuilder 
from sagemaker.jumpstart.model import ModelAccessConfig 
from sagemaker.session import Session 
import logging 

sagemaker_session = Session()
 
artifacts_bucket_name = sagemaker_session.default_bucket() 
execution_role_arn = sagemaker_session.get_caller_identity_arn()
 
js_model_id = "deepseek-llm-r1"

gpu_instance_type = "ml.p5e.48xlarge"
 
response = "Hello, I'm a language model, and I'm here to help you with your English."

 sample_input = {
 "inputs": "Hello, I'm a language model,",
 "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
 }
  
 sample_output = [{"generated_text": response}]
  
 schema_builder = SchemaBuilder(sample_input, sample_output)
  
 model_builder = ModelBuilder( 
 model=js_model_id, 
 schema_builder=schema_builder, 
 sagemaker_session=sagemaker_session, 
 role_arn=execution_role_arn, 
 log_level=logging.ERROR ) 
 
 model= model_builder.build() 
 predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True) 
 
 
 predictor.predict(sample_input)

You can run additional requests against the predictor:

new_input = {
    "inputs": "What is Amazon doing in Generative AI?",
    "parameters": {"max_new_tokens": 64, "top_p": 0.8, "temperature": 0.7},
}

prediction = predictor.predict(new_input)
print(prediction)

Implement guardrails and run inference with your SageMaker JumpStart predictor

Similar to Amazon Bedrock, you can also use the ApplyGuardrail API with your SageMaker JumpStart predictor. You can create a guardrail using the Amazon Bedrock console or the API, and implement it as shown in the following code:

import boto3
import json
bedrock_runtime = boto3.client('bedrock-runtime')
sagemaker_runtime = boto3.client('sagemaker-runtime')

# Add your guardrail identifier and version created from Bedrock Console or AWSCLI
guardrail_id = "" # Your Guardrail ID
guardrail_version = "" # Your Guardrail Version
endpoint_name = "" # Endpoint Name

prompt = "What's 1+1 equal?"

# Apply guardrail to input before sending to model
input_guardrail_response = bedrock_runtime.apply_guardrail(
    guardrailIdentifier=guardrail_id,
    guardrailVersion=guardrail_version,
    source='INPUT',
    content=[{ "text": { "text": prompt }}]
)

# If input guardrail passes, proceed with model inference
if input_guardrail_response['action'] != 'GUARDRAIL_INTERVENED':
    # Prepare the input for the SageMaker endpoint
    template = f"""You are an AI assistant. Do as the user asks.
### Instruction: {prompt}
### Response: <think>"""
    
    input_payload = {
        "inputs": template,
        "parameters": {
            "max_new_tokens": 1000,
            "top_p": 0.9,
            "temperature": 0.6
        }
    }
    
    # Convert the payload to JSON string
    input_payload_json = json.dumps(input_payload)
    
    # Invoke the SageMaker endpoint
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=input_payload_json
    )
    
    # Get the response from the model
    model_response = json.loads(response['Body'].read().decode())
    
    # Apply guardrail to output
    output_guardrail_response = bedrock_runtime.apply_guardrail(
        guardrailIdentifier=guardrail_id,
        guardrailVersion=guardrail_version,
        source='OUTPUT',
        content=[{ "text": { "text": model_response['generated_text'] }}]
    )
    
    # Check if output passes guardrails
    if output_guardrail_response['action'] != 'GUARDRAIL_INTERVENED':
        print(model_response['generated_text'])
    else:
        print("Output blocked: ", output_guardrail_response['outputs'][0]['text'])
else:
    print("Input blocked: ", input_guardrail_response['outputs'][0]['text'])

Clean up

To avoid unwanted charges, complete the steps in this section to clean up your resources.

Delete the Amazon Bedrock Marketplace deployment

If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Marketplace deployments.
In the Managed deployments section, locate the endpoint you want to delete.
Select the endpoint, and on the Actions menu, choose Delete.
Verify the endpoint details to make sure you’re deleting the correct deployment:
1. Endpoint name
2. Model name
3. Endpoint status
Choose Delete to delete the endpoint.
In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart predictor

The SageMaker JumpStart model you deployed will incur costs if you leave it running. Use the following code to delete the endpoint if you want to stop incurring charges. For more details, see Delete Endpoints and Resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we explored how you can access and deploy the DeepSeek-R1 model using Bedrock Marketplace and SageMaker JumpStart. Visit SageMaker JumpStart in SageMaker Studio or Amazon Bedrock Marketplace now to get started. For more information, refer to Use Amazon Bedrock tooling with Amazon SageMaker JumpStart models, SageMaker JumpStart pretrained models, Amazon SageMaker JumpStart Foundation Models, Amazon Bedrock Marketplace, and Getting started with Amazon SageMaker JumpStart.

About the Authors

Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.

Jonathan Evans is a Specialist Solutions Architect working on generative AI with the Third-Party Model Science team at AWS.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, SageMaker’s machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Streamline grant proposal reviews using Amazon Bedrock

January 30, 2025

by Carolyn Vigil Amazon AWS

Government and non-profit organizations evaluating grant proposals face a significant challenge: sifting through hundreds of detailed submissions, each with unique merits, to identify the most promising initiatives. This arduous, time-consuming process is typically the first step in the grant management process, which is critical to driving meaningful social impact.

The AWS Social Responsibility & Impact (SRI) team recognized an opportunity to augment this function using generative AI. The team developed an innovative solution to streamline grant proposal review and evaluation by using the natural language processing (NLP) capabilities of Amazon Bedrock. Amazon Bedrock is a fully managed service that lets you use your choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities that you need to build generative AI applications with security, privacy, and responsible AI.

Historically, AWS Health Equity Initiative applications were reviewed manually by a review committee. It took 14 or more days each cycle for all applications to be fully reviewed. On average, the program received 90 applications per cycle. The June 2024 AWS Health Equity Initiative application cycle received 139 applications, the program’s largest influx to date. It would have taken an estimated 21 days for the review committee to manually process these many applications. The Amazon Bedrock centered approach reduced the review time to 2 days (a 90% reduction).

The goal was to enhance the efficiency and consistency of the review process, empowering customers to build impactful solutions faster. By combining the advanced NLP capabilities of Amazon Bedrock with thoughtful prompt engineering, the team created a dynamic, data-driven, and equitable solution demonstrating the transformative potential of large language models (LLMs) in the social impact domain.

In this post, we explore the technical implementation details and key learnings from the team’s Amazon Bedrock powered grant proposal review solution, providing a blueprint for organizations seeking to optimize their grants management processes.

Building an effective prompt for reviewing grant proposals using generative AI

Prompt engineering is the art of crafting effective prompts to instruct and guide generative AI models, such as LLMs, to produce the desired outputs. By thoughtfully designing prompts, practitioners can unlock the full potential of generative AI systems and apply them to a wide range of real-world scenarios.

When building a prompt for our Amazon Bedrock model to review grant proposals, we used multiple prompt engineering techniques to make sure the model’s responses were tailored, structured, and actionable. This included assigning the model a specific persona, providing step-by-step instructions, and specifying the desired output format.

First, we assigned the model the persona of an expert in public health, with a focus on improving healthcare outcomes for underserved populations. This context helps prime the model to evaluate the proposal from the perspective of a subject matter expert (SME) who thinks holistically about global challenges and community-level impact. By clearly defining the persona, we make sure the model’s responses are tailored to the desired evaluation lens.

Your task is to review a proposal document from the perspective of a given persona, and assess it based on dimensions defined in a rubric. Here are the steps to follow:

1. Review the provided proposal document: {PROPOSAL}

2. Adopt the perspective of the given persona: {PERSONA}

Multiple personas can be assigned against the same rubric to account for various perspectives. For example, when the persona “Public Health Subject Matter Expert” was assigned, the model provided keen insights on the project’s impact potential and evidence basis. When the persona “Venture Capitalist” was assigned, the model provided more robust feedback on the organization’s articulated milestones and sustainability plan for post funding. Similarly, when the persona “Software Development Engineer” was assigned, the model relayed subject matter expertise on the proposed use of AWS technology.

Next, we broke down the review process into a structured set of instructions for the model to follow. This includes reviewing the proposal, assessing it across specific dimensions (impact potential, innovation, feasibility, sustainability), and then providing an overall summary and score. Outlining these step-by-step directives gives the model clear guidance on the required task elements and helps produce a comprehensive and consistent assessment.

3. Assess the proposal based on each dimension in the provided rubric: {RUBRIC}

For each dimension, follow this structure:
<Dimension Name>
 <Summary> Provide a brief summary (2-3 sentences) of your assessment of how well the proposal meets the criteria for this dimension from the perspective of the given persona. </Summary>
 <Score> Provide a score from 0 to 100 for this dimension. Start with a default score of 0 and increase it based on the information in the proposal. </Score>
 <Recommendations> Provide 2-3 specific recommendations for how the author could improve the proposal in this dimension. </Recommendations>
</Dimension Name>

4. After assessing each dimension, provide an <Overall Summary> section with:
 - An overall assessment summary (3-4 sentences) of the proposal's strengths and weaknesses across all dimensions from the persona's perspective
 - Any additional feedback beyond the rubric dimensions
 - Identification of any potential risks or biases in the proposal or your assessment

5. Finally, calculate the <Overall Weighted Score> by applying the weightings specified in the rubric to your scores for each dimension.

Finally, we specified the desired output format as JSON, with distinct sections for the dimensional assessments, overall summary, and overall score. Prescribing this structured response format makes sure that the model’s output can be ingested, stored, and analyzed by our grant review team, rather than being delivered in free-form text. This level of control over the output helps streamline the downstream use of the model’s evaluations.

6. Return your assessment in JSON format with the following structure:

{{ "dimensions": [ {{ "name": "<Dimension Name>", "summary": "<Summary>", "score": <Score>, "recommendations": [ "<Recommendation 1>", "<Recommendation 2>", ... ] }}, ... ], "overall_summary": "<Overall Summary>","overall_score": <Overall Weighted Score> }}

Do not include any other commentary beyond following the specified structure. Focus solely on providing the assessment based on the given inputs.

By combining these prompt engineering techniques—role assignment, step-by-step instructions, and output formatting—we were able to craft a prompt that elicits thorough, objective, and actionable grant proposal assessments from our generative AI model. This structured approach enables us to effectively use the model’s capabilities to support our grant review process in a scalable and efficient manner.

Building a dynamic proposal review application with Streamlit and generative AI

To demonstrate and test the capabilities of a dynamic proposal review solution, we built a rapid prototype implementation using Streamlit, Amazon Bedrock, and Amazon DynamoDB. It’s important to note that this implementation isn’t intended for production use, but rather serves as a proof of concept and a starting point for further development. The application allows users to define and save various personas and evaluation rubrics, which can then be dynamically applied when reviewing proposal submissions. This approach enables a tailored and relevant assessment of each proposal, based on the specified criteria.

The application’s architecture consists of several key components, which we discuss in this section.

The team used DynamoDB, a NoSQL database, to store the personas, rubrics, and submitted proposals. The data stored in DynamoDB was sent to Streamlit, a web application interface. On Streamlit, the team added the persona and rubric to the prompt and sent the prompt to Amazon Bedrock.

import boto3
import json

from api.personas import Persona
from api.rubrics import Rubric
from api.submissions import Submission

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

def _construct_prompt(persona: Persona, rubric: Rubric, submission: Submission):
    rubric_dimensions = [
        f"{dimension['name']}|{dimension['description']}|{dimension['weight']}"
        for dimension in rubric.dimensions
    ]

    # add the table headers the prompt is expecting to the front of the dimensions list
    rubric_dimensions[:0] = ["dimension_name|dimension_description|dimension_weight"]
    rubric_string = "n".join(rubric_dimensions)
    print(rubric_string)

    with open("prompt/prompt_template.txt", "r") as prompt:
        prompt = prompt.read()
        print(prompt)
        return prompt.format(
            PROPOSAL=submission.content,
            PERSONA=persona.description,
            RUBRIC=rubric_string,
        )

Amazon Bedrock used Anthropic’s Claude 3 Sonnet FM to evaluate the submitted proposals against the prompt. The model’s prompts are dynamically generated based on the selected persona and rubric. Amazon Bedrock would send the evaluation results back to Streamlit for team review.

def get_assessment(submission: Submission, persona: Persona, rubric: Rubric):
    prompt = _construct_prompt(persona, rubric, submission)

    body = json.dumps(
        {
            "anthropic_version": "",
            "max_tokens": 2000,
            "temperature": 0.5,
            "top_p": 1,
            "messages": [{"role": "user", "content": prompt}],
        }
    )
    response = bedrock.invoke_model(
        body=body, modelId="anthropic.claude-3-haiku-20240307-v1:0"
    )
    response_body = json.loads(response.get("body").read())
    return response_body.get("content")[0].get("text")

The following diagram illustrates the show of the preceding figure.

The workflow consists of the following steps:

Users can create and manage personas and rubrics through the Streamlit application. These are stored in the DynamoDB database.
When a user submits a proposal for review, they choose the desired persona and rubric from the available options.
The Streamlit application generates a dynamic prompt for the Amazon Bedrock model, incorporating the selected persona and rubric details.
The Amazon Bedrock model evaluates the proposal based on the dynamic prompt and returns the assessment results.
The evaluation results are stored in the DynamoDB database and presented to the user through the Streamlit application.

Impact

This rapid prototype demonstrates the potential for a scalable and flexible proposal review process, allowing organizations to:

Reduce application processing time by up to 90%
Streamline the review process by automating the evaluation tasks
Capture structured data on the proposals and assessments for further analysis
Incorporate diverse perspectives by enabling the use of multiple personas and rubrics

Throughout the implementation, the AWS SRI team focused on creating an interactive and user-friendly experience. By working hands-on with the Streamlit application and observing the impact of dynamic persona and rubric selection, users can gain practical experience in building AI-powered applications that address real-world challenges.

Considerations for a production-grade implementation

Although the rapid prototype demonstrates the potential of this solution, a production-grade implementation requires additional considerations and the use of additional AWS services. Some key considerations include:

Scalability and performance – For handling large volumes of proposals and concurrent users, a serverless architecture using AWS Lambda, Amazon API Gateway, DynamoDB, and Amazon Simple Storage Service (Amazon S3) would provide improved scalability, availability, and reliability.
Security and compliance – Depending on the sensitivity of the data involved, additional security measures such as encryption, authentication and access control, and auditing are necessary. Services like AWS Key Management Service (KMS), Amazon Cognito, AWS Identity and Access Management (IAM), and AWS CloudTrail can help meet these requirements.
Monitoring and logging – Implementing robust monitoring and logging mechanisms using services like Amazon CloudWatch and AWS X-Ray enable tracking performance, identifying issues, and maintaining compliance.
Automated testing and deployment – Implementing automated testing and deployment pipelines using services like AWS CodePipeline, AWS CodeBuild, and AWS CodeDeploy help provide consistent and reliable deployments, reducing the risk of errors and downtime.
Cost optimization – Implementing cost optimization strategies, such as using AWS Cost Explorer and AWS Budgets, can help manage costs and help maintain efficient resource utilization.
Responsible AI considerations – Implementing safeguards—such as Amazon Bedrock Guardrails—and monitoring mechanisms can help enforce the responsible and ethical use of the generative AI model, including bias detection, content moderation, and human oversight. Although the AWS Health Equity Initiative application form collected customer information such as name, email address, and country of operation, this was systematically omitted when sent to the Amazon Bedrock enabled tool to avoid bias in the model and protect customer data.

By using the full suite of AWS services and following best practices for security, scalability, and responsible AI, organizations can build a production-ready solution that meets their specific requirements while achieving compliance, reliability, and cost-effectiveness.

Conclusion

Amazon Bedrock—coupled with effective prompt engineering—enabled AWS SRI to review grant proposals and deliver awards to customers in days instead of weeks. The skills developed in this project—such as building web applications with Streamlit, integrating with NoSQL databases like DynamoDB, and customizing generative AI prompts—are highly transferable and applicable to a wide range of industries and use cases.

About the authors

Carolyn Vigil is a Global Lead for AWS Social Responsibility & Impact Customer Engagement. She drives strategic initiatives that leverage cloud computing for social impact worldwide. A passionate advocate for underserved communities, she has co-founded two non-profit organizations serving individuals with developmental disabilities and their families. Carolyn enjoys Mountain adventures with her family and friends in her free time.

Lauren Hollis is a Program Manager for AWS Social Responsibility and Impact. She leverages her background in economics, healthcare research, and technology to support mission-driven organizations deliver social impact using AWS cloud technology. In her free time, Lauren enjoys reading an playing the piano and cello.

Ben West is a hands-on builder with experience in machine learning, big data analytics, and full-stack software development. As a technical program manager on the AWS Social Responsibility & Impact team, Ben leverages a wide variety of cloud, edge, and Internet of Things (IoT) technologies to develop innovative prototypes and help public sector organizations make a positive impact in the world. Ben is an Army Veteran that enjoys cooking and being outdoors.

Mike Haggerty is a Senior Systems Development Engineer (Sr. SysDE) at Amazon Web Services (AWS), working within the PACE-EDGE team. In this role, he contributes to AWS’s edge computing initiatives as part of the Worldwide Public Sector (WWPS) organization’s PACE (Prototyping and Customer Engineering) team. Beyond his professional duties, Mike is a pet therapy volunteer who, together with his dog Gnocchi, provides support services at local community facilities.

How Aetion is using generative AI and Amazon Bedrock to unlock hidden insights about patient populations

January 30, 2025

by Javier Beltrán Amazon AWS

The real-world data collected and derived from patient journeys offers a wealth of insights into patient characteristics and outcomes and the effectiveness and safety of medical innovations. Researchers ask questions about patient populations in the form of structured queries; however, without the right choice of structured query and deep familiarity with complex real-world patient datasets, many trends and patterns can remain undiscovered.

Aetion is a leading provider of decision-grade real-world evidence software to biopharma, payors, and regulatory agencies. The company provides comprehensive solutions to healthcare and life science customers to transform real-world data into real-world evidence.

The use of unsupervised learning methods on semi-structured data along with generative AI has been transformative in unlocking hidden insights. With Aetion Discover, users can conduct rapid, exploratory analyses with real-world data while experiencing a structured approach to research questions. To help accelerate data exploration and hypothesis generation, Discover uses unsupervised learning methods to uncover Smart Subgroups. These subgroups of patients within a larger population display similar characteristics or profiles across a vast range of factors, including diagnoses, procedures, and therapies.

In this post, we review how Aetion’s Smart Subgroups Interpreter enables users to interact with Smart Subgroups using natural language queries. Powered by Amazon Bedrock and Anthropic’s Claude 3 large language models (LLMs), the interpreter responds to user questions expressed in conversational language about patient subgroups and provides insights to generate further hypotheses and evidence. Aetion chose to use Amazon Bedrock for working with LLMs due to its vast model selection from multiple providers, security posture, extensibility, and ease of use.

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI startups and Amazon through a unified API. It offers a wide range of FMs, allowing you to choose the model that best suits your specific use case.

Aetion’s technology

Aetion uses the science of causal inference to generate real-world evidence on the safety, effectiveness, and value of medications and clinical interventions. Aetion has partnered with the majority of top 20 biopharma, leading payors, and regulatory agencies.

Aetion brings deep scientific expertise and technology to life sciences, regulatory agencies (including FDA and EMA), payors, and health technology assessment (HTA) customers in the US, Canada, Europe, and Japan with analytics that can achieve the following:

Optimize clinical trials by identifying target populations, creating external control arms, and contextualizing settings and populations underrepresented in controlled settings
Expand industry access through label changes, pricing, coverage, and formulary decisions
Conduct safety and effectiveness studies for medications, treatments, and diagnostics

Aetion’s applications, including Discover and Aetion Substantiate, are powered by the Aetion Evidence Platform (AEP), a core longitudinal analytic engine capable of applying rigorous causal inference and statistical methods to hundreds of millions of patient journeys.

AetionAI is a set of generative AI capabilities embedded across the core environment and applications. Smart Subgroups Interpreter is an AetionAI feature in Discover.

The following figure illustrates the organization of Aetion’s services.

Smart Subgroups

For a user-specified patient population, the Smart Subgroups feature identifies clusters of patients with similar characteristics (for example, similar prevalence profiles of diagnoses, procedures, and therapies).

These subgroups are further classified and labeled by generative AI models based on each subgroup’s prevalent characteristics. For example, as shown in the following generated heat map, the first two Smart Subgroups within a population of patients who were prescribed GLP-1 agonists are labeled “Cataract and Retinal Disease” and “Inflammatory Skin Conditions,” respectively, to capture their defining characteristics.

After the subgroups are displayed, a user engages with AetionAI to probe further with inquiries expressed in natural language. The user can express questions about the subgroups, such as “What are the most common characteristics for patients in the cataract disorders subgroup?” As shown in the following screenshot, AetionAI responds to the user in natural language, citing relevant subgroup statistics in its response.

A user might also ask AetionAI detailed questions such as “Compare the prevalence of cardiovascular diseases or conditions among the ‘Dulaglutide’ group vs the overall population.” The following screenshot shows AetionAI’s response.

In this example, the insights enable the user to hypothesize that Dulaglutide patients might experience fewer circulatory signs and symptoms. They can explore this further in Aetion Substantiate to produce decision-grade evidence with causal inference to assess the effectiveness of Dulaglutide use in cardiovascular disease outcomes.

Solution overview

Smart Subgroups Interpreter combines elements of unsupervised machine learning with generative AI to uncover hidden patterns in real-world data. The following diagram illustrates the workflow.

Let’s review each step in detail:

Create the patient population – Users define a patient population using the Aetion Measure Library (AML) features. The AML feature store standardizes variable definitions using scientifically validated algorithms. The user selects the AML features that define the patient population for analysis.
Generate features for the patient population – The AEP computes over 1,000 AML features for each patient across various categories, such as diagnoses, therapies, and procedures.
Build clusters and summarize cluster features – The Smart Subgroups component trains a topic model using the patient features to determine the optimal number of clusters and assign patients to clusters. The prevalences of the most distinctive features within each cluster, as determined by a trained classification model, are used to describe the cluster characteristics.
Generate cluster names and answer user queries – A prompt engineering technique for Anthropic’s Claude 3 Haiku on Amazon Bedrock generates descriptive cluster names and answers user queries. Amazon Bedrock provides access to LLMs from a variety of model providers. Anthropic’s Claude 3 Haiku was selected as the model due to its speed and satisfactory intelligence level.

The solution uses Amazon Simple Storage Service (Amazon S3) and Amazon Aurora for data persistence and data exchange, and Amazon Bedrock with Anthropic’s Claude 3 Haiku models for cluster names generation. Discover and its transactional and batch applications are deployed and scaled on a Kubernetes on AWS cluster to optimize performance, user experience, and portability.

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

Users create Smart Subgroups for their patient population of interest.
AEP uses real-world data and a custom query language to compute over 1,000 science-validated features for the user-selected population. The features are stored in Amazon S3 and encrypted with AWS Key Management Service (AWS KMS) for downstream use.
The Smart Subgroups component trains the clustering algorithm and summarizes the most important features of each cluster. The cluster feature summaries are stored in Amazon S3 and displayed as a heat map to the user. Smart Subgroups is deployed as a Kubernetes job and is run on demand.
Users interact with the Interpreter API microservice by using questions expressed in natural language to retrieve descriptive subgroup names. The data transmitted to the service is encrypted using Transport Layer Security 1.2 (TLS). The Interpreter API uses composite prompt engineering techniques with Anthropic’s Claude 3 Haiku to answer user queries:
- Versioned prompt templates generate descriptive subgroup names and answer user queries.
- AML features are added to the prompt template. For example, the description of the feature “Benign Ovarian Cyst” is expanded in a prompt to the LLM as “This measure covers different types of cysts that can form in or on a woman’s ovaries, including follicular cysts, corpus luteum cysts, endometriosis, and unspecified ovarian cysts.”
- Lastly, the top feature prevalences of each subgroup are added to the prompt template. For example: “In Smart Subgroup 1 the relative prevalence of ‘Cornea and external disease (EYE001)’ is 30.32% In Smart Subgroup 1 the relative prevalence of ‘Glaucoma (EYE003)’ is 9.94%…”
Amazon Bedrock responds back to the application that displays the heat map to the user.

Outcomes

Smart Subgroups Interpreter enables users of the AEP who are unfamiliar with real-world data to discover patterns among patient populations using natural language queries. Users now can turn findings from such discoveries into hypotheses for further analyses across Aetion’s software to generate decision-grade evidence in a matter of minutes, as opposed to days, and without the need of support staff.

Conclusion

In this post, we demonstrated how Aetion uses Amazon Bedrock and other AWS services to help users uncover meaningful patterns within patient populations, even without prior expertise in real-world data. These discoveries lay the groundwork for deeper analysis within Aetion’s Evidence Platform, generating decision-grade evidence that drives smarter, data-informed outcomes.

As we continue expanding our generative AI capabilities, Aetion remains committed to enhancing user experiences and accelerating the journey from real-world data to real-world evidence.

With Amazon Bedrock, the future of innovation is at your fingertips. Explore Generative AI Application Builder on AWS to learn more about building generative AI capabilities to unlock new insights, build transformative solutions, and shape the future of healthcare today.

About the Authors

Javier Beltrán is a Senior Machine Learning Engineer at Aetion. His career has focused on natural language processing, and he has experience applying machine learning solutions to various domains, from healthcare to social media.

Ornela Xhelili is a Staff Machine Learning Architect at Aetion. Ornela specializes in natural language processing, predictive analytics, and MLOps, and holds a Master’s of Science in Statistics. Ornela has spent the past 8 years building AI/ML products for tech startups across various domains, including healthcare, finance, analytics, and ecommerce.

Prasidh Chhabri is a Product Manager at Aetion, leading the Aetion Evidence Platform, core analytics, and AI/ML capabilities. He has extensive experience building quantitative and statistical methods to solve problems in human health.

Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare life sciences customers and specializes in data analytics services. Mikhail has more than 20 years of industry experience covering a wide range of technologies and sectors.

Deploy DeepSeek-R1 Distilled Llama models in Amazon Bedrock

January 30, 2025

by Raj Pathak Amazon AWS

Open foundation models (FMs) have become a cornerstone of generative AI innovation, enabling organizations to build and customize AI applications while maintaining control over their costs and deployment strategies. By providing high-quality, openly available models, the AI community fosters rapid iteration, knowledge sharing, and cost-effective solutions that benefit both developers and end-users. DeepSeek AI, a research company focused on advancing AI technology, has emerged as a significant contributor to this ecosystem. Their DeepSeek-R1 models represent a family of large language models (LLMs) designed to handle a wide range of tasks, from code generation to general reasoning, while maintaining competitive performance and efficiency.

Amazon Bedrock Custom Model Import enables the import and use of your customized models alongside existing FMs through a single serverless, unified API. You can access your imported custom models on-demand and without the need to manage underlying infrastructure. Accelerate your generative AI application development by integrating your supported custom models with native Bedrock tools and features like Knowledge Bases, Guardrails, and Agents.

In this post, we explore how to deploy distilled versions of DeepSeek-R1 with Amazon Bedrock Custom Model Import, making them accessible to organizations looking to use state-of-the-art AI capabilities within the secure and scalable AWS infrastructure at an effective cost.

DeepSeek-R1 distilled variations

From the foundation of DeepSeek-R1, DeepSeek AI has created a series of distilled models based on both Meta’s Llama and Qwen architectures, ranging from 1.5–70 billion parameters. The distillation process involves training smaller, more efficient models to mimic the behavior and reasoning patterns of the larger DeepSeek-R1 model by using it as a teacher—essentially transferring the knowledge and capabilities of the 671 billion parameter model into more compact architectures. The resulting distilled models, such as DeepSeek-R1-Distill-Llama-8B (from base model Llama-3.1-8B) and DeepSeek-R1-Distill-Llama-70B (from base model Llama-3.3-70B-Instruct), offer different trade-offs between performance and resource requirements. Although distilled models might show some reduction in reasoning capabilities compared to the original 671B model, they significantly improve inference speed and reduce computational costs. For instance, smaller distilled models like the 8B version can process requests much faster and consume fewer resources, making them more cost-effective for production deployments, whereas larger distilled versions like the 70B model maintain closer performance to the original while still offering meaningful efficiency gains.

Solution overview

In this post, we demonstrate how to deploy distilled versions of DeepSeek-R1 models using Amazon Bedrock Custom Model Import. We focus on importing the variants currently supported DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Llama-70B, which offer an optimal balance between performance and resource efficiency. You can import these models from Amazon Simple Storage Service (Amazon S3) or an Amazon SageMaker AI model repo, and deploy them in a fully managed and serverless environment through Amazon Bedrock. The following diagram illustrates the end-to-end flow.

In this workflow, model artifacts stored in Amazon S3 are imported into Amazon Bedrock, which then handles the deployment and scaling of the model automatically. This serverless approach eliminates the need for infrastructure management while providing enterprise-grade security and scalability.

You can use the Amazon Bedrock console for deploying using the graphical interface and following the instructions in this post, or alternatively use the following notebook to deploy programmatically with the Amazon Bedrock SDK.

Prerequisites

You should have the following prerequisites:

An AWS account with access to Amazon Bedrock.
Appropriate AWS Identity and Access Management (IAM) roles and permissions for Amazon Bedrock and Amazon S3. For more information, see Create a service role for model import.
An S3 bucket prepared to store the custom model. For more information, see Creating a bucket.
Sufficient local storage space, at least 17 GB for the 8B model or 135 GB for the 70B model.

Prepare the model package

Complete the following steps to prepare the model package:

Download the DeepSeek-R1-Distill-Llama model artifacts from Hugging Face, from one of the following links, depending on the model you want to deploy:
1. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tree/main
2. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/tree/main

For more information, you can follow the Hugging Face’s Downloading models or Download files from the hub instructions.

You typically need the following files:

- Model configuration file: config.json
- Tokenizer files: tokenizer.json, tokenizer_config.json, and tokenizer.mode
- Model weights files in .safetensors format

Upload these files to a folder in your S3 bucket, in the same AWS Region where you plan to use Amazon Bedrock. Take note of the S3 path you’re using.

Import the model

Complete the following steps to import the model:

On the Amazon Bedrock console, choose Imported models under Foundation models in the navigation pane.

Choose Import model.

For Model name, enter a name for your model (it’s recommended to use a versioning scheme in your name, for tracking your imported model).
For Import job name, enter a name for your import job.
For Model import settings, select Amazon S3 bucket as your import source, and enter the S3 path you noted earlier (provide the full path in the form s3://<your-bucket>/folder-with-model-artifacts/).
For Encryption, optionally choose to customize your encryption settings.
For Service access role, choose to either create a new IAM role or provide your own.
Choose Import model.

Importing the model will take several minutes depending on the model being imported (for example, the Distill-Llama-8B model could take 5–20 minutes to complete).

Watch this video demo for a step-by-step guide.

Test the imported model

After you import the model, you can test it by using the Amazon Bedrock Playground or directly through the Amazon Bedrock invocation APIs. To use the Playground, complete the following steps:

On the Amazon Bedrock console, choose Chat / Text under Playgrounds in the navigation pane.
From the model selector, choose your imported model name.
Adjust the inference parameters as needed and write your test prompt. For example:
<｜begin▁of▁sentence｜><｜User｜>Given the following financial data: - Company A's revenue grew from $10M to $15M in 2023 - Operating costs increased by 20% - Initial operating costs were $7M Calculate the company's operating margin for 2023. Please reason step by step, and put your final answer within \boxed{}<｜Assistant｜>

As we’re using an imported model in the playground, we must include the “beginning_of_sentence” and “user/assistant” tags to properly format the context for DeepSeek models; these tags help the model understand the structure of the conversation and provide more accurate responses. If you’re following the programmatic approach in the following notebook then this is being automatically taken care of by configuring the model.

Review the model response and metrics provided.

Note: When you invoke the model for the first time, if you encounter a ModelNotReadyException error the SDK automatically retries the request with exponential backoff. The restoration time varies depending on the on-demand fleet size and model size. You can customize the retry behavior using the AWS SDK for Python (Boto3) Config object. For more information, see Handling ModelNotReadyException.

Once you are ready to import the model, use this step-by-step video demo to help you get started.

Pricing

Custom Model Import enables you to use your custom model weights within Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a fully managed way through On-Demand mode. Custom Model Import does not charge for model import, you are charged for inference based on two factors: the number of active model copies and their duration of activity.

Billing occurs in 5-minute windows, starting from the first successful invocation of each model copy. The pricing per model copy per minute varies based on factors including architecture, context length, region, and compute unit version, and is tiered by model copy size. The Custom Model Units required for hosting depends on the model’s architecture, parameter count, and context length, with examples ranging from 2 Units for a Llama 3.1 8B 128K model to 8 Units for a Llama 3.1 70B 128K model.

Amazon Bedrock automatically manages scaling, maintaining zero to three model copies by default (adjustable through Service Quotas) based on your usage patterns. If there are no invocations for 5 minutes, it scales to zero and scales up when needed, though this may involve cold-start latency of tens of seconds. Additional copies are added if inference volume consistently exceeds single-copy concurrency limits. The maximum throughput and concurrency per copy is determined during import, based on factors such as input/output token mix, hardware type, model size, architecture, and inference optimizations.

Consider the following pricing example: An application developer imports a customized Llama 3.1 type model that is 8B parameter in size with a 128K sequence length in us-east-1 region and deletes the model after 1 month. This requires 2 Custom Model Units. So, the price per minute will be $0.1570 and the model storage costs will be $3.90 for the month.

For more information, see Amazon Bedrock pricing.

Benchmarks

DeepSeek has published benchmarks comparing their distilled models against the original DeepSeek-R1 and base Llama models, available in the model repositories. The benchmarks show that depending on the task DeepSeek-R1-Distill-Llama-70B maintains between 80-90% of the original model’s reasoning capabilities, while the 8B version achieves between 59-92% performance with significantly reduced resource requirements. Both distilled versions demonstrate improvements over their corresponding base Llama models in specific reasoning tasks.

Other considerations

When deploying DeepSeek models in Amazon Bedrock, consider the following aspects:

Model versioning is essential. Because Custom Model Import creates unique models for each import, implement a clear versioning strategy in your model names to track different versions and variations.
The current supported model formats focus on Llama-based architectures. Although DeepSeek-R1 distilled versions offer excellent performance, the AI ecosystem continues evolving rapidly. Keep an eye on the Amazon Bedrock model catalog as new architectures and larger models become available through the platform.
Evaluate your use case requirements carefully. Although larger models like DeepSeek-R1-Distill-Llama-70B provide better performance, the 8B version might offer sufficient capability for many applications at a lower cost.
Consider implementing monitoring and observability. Amazon CloudWatch provides metrics for your imported models, helping you track usage patterns and performance. You can monitor costs with AWS Cost Explorer.
Start with a lower concurrency quota and scale up based on actual usage patterns. The default limit of three concurrent model copies per account is suitable for most initial deployments.

Conclusion

Amazon Bedrock Custom Model Import empowers organizations to use powerful publicly available models like DeepSeek-R1 distilled versions, among others, while benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing model deployments and operations, allowing teams to focus on building applications rather than infrastructure. With features like auto scaling, pay-per-use pricing, and seamless integration with AWS services, Amazon Bedrock provides a production-ready environment for AI workloads. The combination of DeepSeek’s innovative distillation approach and the Amazon Bedrock managed infrastructure offers an optimal balance of performance, cost, and operational efficiency. Organizations can start with smaller models and scale up as needed, while maintaining full control over their model deployments and benefiting from AWS security and compliance capabilities.

The ability to choose between proprietary and open FMs Amazon Bedrock gives organizations the flexibility to optimize for their specific needs. Open models enable cost-effective deployment with full control over the model artifacts, making them ideal for scenarios where customization, cost optimization, or model transparency are crucial. This flexibility, combined with the Amazon Bedrock unified API and enterprise-grade infrastructure, allows organizations to build resilient AI strategies that can adapt as their requirements evolve.

For more information, refer to the Amazon Bedrock User Guide.

About the Authors

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Morgan Rankey is a Solutions Architect based in New York City, specializing in Hedge Funds. He excels in assisting customers to build resilient workloads within the AWS ecosystem. Prior to joining AWS, Morgan led the Sales Engineering team at Riskified through its IPO. He began his career by focusing on AI/ML solutions for machine asset management, serving some of the largest automotive companies globally.

Harsh Patel is an AWS Solutions Architect supporting 200+ SMB customers across the United States to drive digital transformation through cloud-native solutions. As an AI&ML Specialist, he focuses on Generative AI, Computer Vision, Reinforcement Learning and Anomaly Detection. Outside the tech world, he recharges by hitting the golf course and embarking on scenic hikes with his dog.

Generative AI operating models in enterprise organizations with Amazon Bedrock

January 29, 2025

by Martin Tunstall Amazon AWS

Generative AI can revolutionize organizations by enabling the creation of innovative applications that offer enhanced customer and employee experiences. Intelligent document processing, translation and summarization, flexible and insightful responses for customer support agents, personalized marketing content, and image and code generation are a few use cases using generative AI that organizations are rolling out in production.

Large organizations often have many business units with multiple lines of business (LOBs), with a central governing entity, and typically use AWS Organizations with an Amazon Web Services (AWS) multi-account strategy. They implement landing zones to automate secure account creation and streamline management across accounts, including logging, monitoring, and auditing. Although LOBs operate their own accounts and workloads, a central team, such as the Cloud Center of Excellence (CCoE), manages identity, guardrails, and access policies

As generative AI adoption grows, organizations should establish a generative AI operating model. An operating model defines the organizational design, core processes, technologies, roles and responsibilities, governance structures, and financial models that drive a business’s operations.

In this post, we evaluate different generative AI operating model architectures that could be adopted.

Operating model patterns

Organizations can adopt different operating models for generative AI, depending on their priorities around agility, governance, and centralized control. Governance in the context of generative AI refers to the frameworks, policies, and processes that streamline the responsible development, deployment, and use of these technologies. It encompasses a range of measures aimed at mitigating risks, promoting accountability, and aligning generative AI systems with ethical principles and organizational objectives. Three common operating model patterns are decentralized, centralized, and federated, as shown in the following diagram.

Decentralized model

In a decentralized approach, generative AI development and deployment are initiated and managed by the individual LOBs themselves. LOBs have autonomy over their AI workflows, models, and data within their respective AWS accounts.

This enables faster time-to-market and agility because LOBs can rapidly experiment and roll out generative AI solutions tailored to their needs. However, even in a decentralized model, often LOBs must align with central governance controls and obtain approvals from the CCoE team for production deployment, adhering to global enterprise standards for areas such as access policies, model risk management, data privacy, and compliance posture, which can introduce governance complexities.

Centralized model

In a centralized operating model, all generative AI activities go through a central generative artificial intelligence and machine learning (AI/ML) team that provisions and manages end-to-end AI workflows, models, and data across the enterprise.

LOBs interact with the central team for their AI needs, trading off agility and potentially increased time-to-market for stronger top-down governance. A centralized model may introduce bottlenecks that slow down time-to-market, so organizations need to adequately resource the team with sufficient personnel and automated processes to meet the demand from various LOBs efficiently. Failure to scale the team can negate the governance benefits of a centralized approach.

Federated model

A federated model strikes a balance by having key activities of the generative AI processes managed by a central generative AI/ML platform team.

While LOBs drive their AI use cases, the central team governs guardrails, model risk management, data privacy, and compliance posture. This enables agile LOB innovation while providing centralized oversight on governance areas.

Generative AI architecture components

Before diving deeper into the common operating model patterns, this section provides a brief overview of a few components and AWS services used in the featured architectures.

Large language models

Large language models (LLMs) are large-scale ML models that contain billions of parameters and are pre-trained on vast amounts of data. LLMs may hallucinate, which means a model can provide a confident but factually incorrect response. Furthermore, the data that the model was trained on might be out of date, which leads to providing inaccurate responses. One way to mitigate LLMs from giving incorrect information is by using a technique known as Retrieval Augmented Generation (RAG). RAG is an advanced natural language processing technique that combines knowledge retrieval with generative text models. RAG combines the powers of pre-trained language models with a retrieval-based approach to generate more informed and accurate responses. To set up RAG, you need to have a vector database to provide your model with related source documents. Using RAG, the relevant document segments or other texts are retrieved and shared with LLMs to generate targeted responses with enhanced content quality and relevance.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, including AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon using a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Amazon SageMaker JumpStart provides access to proprietary FMs from third-party providers such as AI21 Labs, Cohere, and LightOn. In addition, Amazon SageMaker JumpStart onboards and maintains open source FMs from third-party sources such as Hugging Face.

Data sources, embeddings, and vector store

Organizations’ domain-specific data, which provides context and relevance, typically resides in internal databases, data lakes, unstructured data repositories, or document stores, collectively referred to as organizational data sources or proprietary data stores.

A vector store is a system you can use to store and query vectors at scale, with efficient nearest neighbor query algorithms and appropriate indexes to improve data retrieval. It includes not only the embeddings of an organization’s data (mathematical representation of data in the form of vectors) but also raw text from the data in chunks. These vectors are generated by specialized embedding LLMs, which process the organization’s text chunks to create numerical representations (vectors), which are stored with the text chunks in the vector store. For a comprehensive read about vector store and embeddings, you can refer to The role of vector databases in generative AI applications.

With Amazon Bedrock Knowledge Bases, you securely connect FMs in Amazon Bedrock to your company data for RAG. Amazon Bedrock Knowledge Bases facilitates data ingestion from various supported data sources; manages data chunking, parsing, and embeddings; and populates the vector store with the embeddings. With all that provided as a service, you can think of Amazon Bedrock Knowledge Bases as a fully managed and serverless option to build powerful conversational AI systems using RAG.

Guardrails

Content filtering mechanisms are implemented as safeguards to control user-AI interactions, aligning with application requirements and responsible AI policies by minimizing undesirable and harmful content. Guardrails can check user inputs and FM outputs and filter or deny topics that are unsafe, redact personally identifiable information (PII), and enhance content safety and privacy in generative AI applications.

Amazon Bedrock Guardrails is a feature of Amazon Bedrock that you can use to put safeguards in place. You determine what qualifies based on your company policies. These safeguards are FM agnostic. You can create multiple guardrails with different configurations tailored to specific use cases. For a review on Amazon Bedrock Guardrails, you can refer to these blog posts: Guardrails for Amazon Bedrock helps implement safeguards customized to your use cases and responsible AI policies and Guardrails for Amazon Bedrock now available with new safety filters and privacy controls.

Operating model architectures

This section provides an overview of the three kinds of operating models.

Decentralized operating model

In a decentralized operating model, LOB teams maintain control and ownership of their AWS accounts. Each LOB configures and orchestrates generative AI components, common functionalities, applications, and Amazon Bedrock configurations within their respective AWS accounts. This model empowers LOBs to tailor their generative AI solutions according to their specific requirements, while taking advantage of the power of Amazon Bedrock.

With this model, the LOBs configure the core components, such as LLMs and guardrails, and the Amazon Bedrock service account manages the hosting, execution, and provisioning of interface endpoints. These endpoints enable LOBs to access and interact with the Amazon Bedrock services they’ve configured.

Each LOB performs monitoring and auditing of their configured Amazon Bedrock services within their account, using Amazon CloudWatch Logs and AWS CloudTrail for log capture, analysis, and auditing tailored to their needs. Amazon Bedrock cost and usage will be recorded in each LOB’s AWS accounts. By adopting this decentralized model, LOBs retain control over their generative AI solutions through a decentralized configuration, while benefiting from the scalability, reliability, and security of Amazon Bedrock.

The following diagram shows the architecture of the decentralized operating model.

Centralized operating model

The centralized AWS account serves as the primary hub for configuring and managing the core generative AI functionalities, including reusable agents, prompt flows, and shared libraries. LOB teams contribute their business-specific requirements and use cases to the centralized team, which then integrates and orchestrates the appropriate generative AI components within the centralized account.

Although the orchestration and configuration of generative AI solutions reside in the centralized account, they often require interaction with LOB-specific resources and services. To facilitate this, the centralized account uses API gateways or other integration points provided by the LOBs’ AWS accounts. These integration points enable secure and controlled communication between the centralized generative AI orchestration and the LOBs’ business-specific applications, data sources, or services. This centralized operating model promotes consistency, governance, and scalability of generative AI solutions across the organization.

The centralized team maintains adherence to common standards, best practices, and organizational policies, while also enabling efficient sharing and reuse of generative AI components. Furthermore, the core components of Amazon Bedrock, such as LLMs and guardrails, continue to be hosted and executed by AWS in the Amazon Bedrock service account, promoting secure, scalable, and high-performance execution environments for these critical components. In this centralized model, monitoring and auditing of Amazon Bedrock can be achieved within the centralized account, allowing for comprehensive monitoring, auditing, and analysis of all generative AI activities and configurations. Amazon CloudWatch Logs provides a unified view of generative AI operations across the organization.

By consolidating the orchestration and configuration of generative AI solutions in a centralized account while enabling secure integration with LOB-specific resources, this operating model promotes standardization, governance, and centralized control over generative AI operations. It uses the scalability, reliability, security, and centralized monitoring capabilities of AWS managed infrastructure and services, while still allowing for integration with LOB-specific requirements and use cases.

The following is the architecture for a centralized operating model.

Federated operating model

In a federated model, Amazon Bedrock enables a collaborative approach where LOB teams can develop and contribute common generative AI functionalities within their respective AWS accounts. These common functionalities, such as reusable agents, prompt flows, or shared libraries, can then be migrated to a centralized AWS account managed by a dedicated team or CCoE.

The centralized AWS account acts as a hub for integrating and orchestrating these common generative AI components, providing a unified platform for action groups and prompt flows. Although the orchestration and configuration of generative AI solutions remain within the LOBs’ AWS accounts, they can use the centralized Amazon Bedrock agents, prompt flows, and other shared components defined in the centralized account.

This federated model allows LOBs to retain control over their generative AI solutions, tailoring them to specific business requirements while benefiting from the reusable and centrally managed components. The centralized account maintains consistency, governance, and scalability of these shared generative AI components, promoting collaboration and standardization across the organization.

Organizations frequently prefer storing sensitive data, including Payment Card Industry (PCI), PII, General Data Protection Regulation (GDPR), and Health Insurance Portability and Accountability Act (HIPAA) information, within their respective LOB AWS accounts. This approach makes sure that LOBs maintain control over their sensitive business data in the vector store while preventing centralized teams from accessing it without proper governance and security measures.

A federated model combines decentralized development, centralized integration, and centralized monitoring. This operating model fosters collaboration, reusability, and standardization while empowering LOBs to retain control over their generative AI solutions. It uses the scalability, reliability, security, and centralized monitoring capabilities of AWS managed infrastructure and services, promoting a harmonious balance between autonomy and governance.

The following is the architecture for a federated operating model.

Cost management

Organizations may want to analyze Amazon Bedrock usage and costs per LOB. To track the cost and usage of FMs across LOBs’ AWS accounts, solutions that record model invocations per LOB can be implemented.

Amazon Bedrock now supports model invocation resources that use inference profiles. Inference profiles can be defined to track Amazon Bedrock usage metrics, monitor model invocation requests, or route model invocation requests to multiple AWS Regions for increased throughput.

There are two types of inference profiles. Cross-Region inference profiles, which are predefined in Amazon Bedrock and include multiple AWS Regions to which requests for a model can be routed. The other is application inference profiles, which are user created to track cost and model usage when submitting on-demand model invocation requests. You can attach custom tags, such as cost allocation tags, to your application inference profiles. When submitting a prompt, you can include an inference profile ID or its Amazon Resource Name (ARN). This capability enables organizations to track and monitor costs for various LOBs, cost centers, or applications. For a detailed explanation of application inference profiles refer to this post: Track, allocate, and manage your generative AI cost and usage with Amazon Bedrock.

Conclusion

Although enterprises often begin with a centralized operating model, the rapid pace of development in generative AI technologies, the need for agility, and the desire to quickly capture value often lead organizations to converge on a federated operating model.

In a federated operating model, lines of business have the freedom to innovate and experiment with generative AI solutions, taking advantage of their domain expertise and proximity to business problems. Key aspects of the AI workflow, such as data access policies, model risk management, and compliance monitoring, are managed by a central cloud governance team. Successful generative AI solutions developed by a line of business can be promoted and productionized by the central team for enterprise-wide re-use.

This federated model fosters innovation from the lines of business closest to domain problems. Simultaneously, it allows the central team to curate, harden, and scale those solutions adherent to organizational policies, then redeploy them efficiently to other relevant areas of the business.

To sustain this operating model, enterprises often establish a dedicated product team with a business owner that works in partnership with lines of business. This team is responsible for continually evolving the operating model, refactoring and enhancing the generative AI services to help meet the changing needs of the lines of business and keep up with the rapid advancements in LLMs and other generative AI technologies.

Federated operating models strike a balance, mitigating the risks of fully decentralized initiatives while minimizing bottlenecks from overly centralized approaches. By empowering business agility with curation by a central team, enterprises can accelerate compliant, high-quality generative AI capabilities aligned with their innovation goals, risk tolerances, and need for rapid value delivery in the evolving AI landscape.

As enterprises look to capitalize on the generative AI revolution, Amazon Bedrock provides the ideal foundation to establish a flexible operating model tailored to their organization’s needs. Whether you’re starting with a centralized, decentralized, or federated approach, AWS offers a comprehensive suite of services to support the full generative AI lifecycle.

Try Amazon Bedrock and let us know your feedback on how you’re planning to implement the operating model that suits your organization.

About the Authors

Martin Tunstall is a Principal Solutions Architect at AWS. With over three decades of experience in the finance sector, he helps global finance and insurance customers unlock the full potential of Amazon Web Services (AWS).

Yashar Araghi is a Senior Solutions Architect at AWS. He has over 20 years of experience designing and building infrastructure and application security solutions. He has worked with customers across various industries such as government, education, finance, energy, and utilities. In the last 6 years at AWS, Yashar has helped customers design, build, and operate their cloud solutions that are secure, reliable, performant and cost optimized in the AWS Cloud.

Develop a RAG-based application using Amazon Aurora with Amazon Kendra

January 28, 2025

by Aravind Hariharaputran Amazon AWS

Generative AI and large language models (LLMs) are revolutionizing organizations across diverse sectors to enhance customer experience, which traditionally would take years to make progress. Every organization has data stored in data stores, either on premises or in cloud providers.

You can embrace generative AI and enhance customer experience by converting your existing data into an index on which generative AI can search. When you ask a question to an open source LLM, you get publicly available information as a response. Although this is helpful, generative AI can help you understand your data along with additional context from LLMs. This is achieved through Retrieval Augmented Generation (RAG).

RAG retrieves data from a preexisting knowledge base (your data), combines it with the LLM’s knowledge, and generates responses with more human-like language. However, in order for generative AI to understand your data, some amount of data preparation is required, which involves a big learning curve.

Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. Aurora combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases.

In this post, we walk you through how to convert your existing Aurora data into an index without needing data preparation for Amazon Kendra to perform data search and implement RAG that combines your data along with LLM knowledge to produce accurate responses.

Solution overview

In this solution, use your existing data as a data source (Aurora), create an intelligent search service by connecting and syncing your data source to Amazon Kendra search, and perform generative AI data search, which uses RAG to produce accurate responses by combining your data along with the LLM’s knowledge. For this post, we use Anthropic’s Claude on Amazon Bedrock as our LLM.

The following are the high-level steps for the solution:

Create an Amazon Aurora PostgreSQL-Compatible Edition
Ingest data to Aurora PostgreSQL-Compatible.
Create an Amazon Kendra index.
Set up the Amazon Kendra Aurora PostgreSQL connector.
Invoke the RAG application.

The following diagram illustrates the solution architecture.

Prerequisites

To follow this post, the following prerequisites are required:

The AWS Command Line Interface (AWS CLI) installed and configured
An AWS account and appropriate permissions to interact with resources in your AWS account
The AWS managed AWS Identity and Access Management (IAM) policy AmazonKendraReadOnlyAccess should be part of an Amazon SageMaker IAM role
An Aurora DB cluster where the current data is present
Your preferred interactive development environment (IDE) to run the Python script (such as SageMaker, or VS Code)
The pgAdmin tool for data loading and validation

Create an Aurora PostgreSQL cluster

Run the following AWS CLI commands to create an Aurora PostgreSQL Serverless v2 cluster:

aws rds create-db-cluster 
--engine aurora-postgresql 
--engine-version 15.4 
--db-cluster-identifier genai-kendra-ragdb 
--master-username postgres 
--master-user-password XXXXX 
--db-subnet-group-name dbsubnet 
--vpc-security-group-ids "sg-XXXXX" 
--serverless-v2-scaling-configuration "MinCapacity=2,MaxCapacity=64" 
--enable-http-endpoint 
--region us-east-2

aws rds create-db-instance 
--db-cluster-identifier genai-kendra-ragdb 
--db-instance-identifier genai-kendra-ragdb-instance 
--db-instance-class db.serverless 
--engine aurora-postgresql

The following screenshot shows the created instance.

Ingest data to Aurora PostgreSQL-Compatible

Connect to the Aurora instance using the pgAdmin tool. Refer to Connecting to a DB instance running the PostgreSQL database engine for more information. To ingest your data, complete the following steps:

Run the following PostgreSQL statements in pgAdmin to create the database, schema, and table:

CREATE DATABASE genai;
CREATE SCHEMA 'employees';

CREATE DATABASE genai;
SET SCHEMA 'employees';

CREATE TABLE employees.amazon_review(
pk int GENERATED ALWAYS AS IDENTITY NOT NULL,
id varchar(50) NOT NULL,
name varchar(300) NULL,
asins Text NULL,
brand Text NULL,
categories Text NULL,
keys Text NULL,
manufacturer Text NULL,
reviews_date Text NULL,
reviews_dateAdded Text NULL,
reviews_dateSeen Text NULL,
reviews_didPurchase Text NULL,
reviews_doRecommend varchar(100) NULL,
reviews_id varchar(150) NULL,
reviews_numHelpful varchar(150) NULL,
reviews_rating varchar(150) NULL,
reviews_sourceURLs Text NULL,
reviews_text Text NULL,
reviews_title Text NULL,
reviews_userCity varchar(100) NULL,
reviews_userProvince varchar(100) NULL,
reviews_username Text NULL,
PRIMARY KEY
(
pk
)
) ;

In your pgAdmin Aurora PostgreSQL connection, navigate to Databases, genai, Schemas, employees, Tables.
Choose (right-click) Tables and choose PSQL Tool to open a PSQL client connection.

Place the csv file under your pgAdmin location and run the following command:

copy employees.amazon_review (id, name, asins, brand, categories, keys, manufacturer, reviews_date, reviews_dateadded, reviews_dateseen, reviews_didpurchase, reviews_dorecommend, reviews_id, reviews_numhelpful, reviews_rating, reviews_sour
ceurls, reviews_text, reviews_title, reviews_usercity, reviews_userprovince, reviews_username) FROM 'C:Program FilespgAdmin 4runtimeamazon_review.csv' DELIMITER ',' CSV HEADER ENCODING 'utf8';

Run the following PSQL query to verify the number of records copied:
```
Select count (*) from employees.amazon_review;
```

Create an Amazon Kendra index

The Amazon Kendra index holds the contents of your documents and is structured in a way to make the documents searchable. It has three index types:

Generative AI Enterprise Edition index – Offers the highest accuracy for the Retrieve API operation and for RAG use cases (recommended)
Enterprise Edition index – Provides semantic search capabilities and offers a high-availability service that is suitable for production workloads
Developer Edition index – Provides semantic search capabilities for you to test your use cases

To create an Amazon Kendra index, complete the following steps:

On the Amazon Kendra console, choose Indexes in the navigation pane.
Choose Create an index.
On the Specify index details page, provide the following information:
- For Index name, enter a name (for example, genai-kendra-index).
- For IAM role, choose Create a new role (Recommended).
- For Role name, enter an IAM role name (for example, genai-kendra). Your role name will be prefixed with AmazonKendra-<region>- (for example, AmazonKendra-us-east-2-genai-kendra).
Choose Next.
On the Add additional capacity page, select Developer edition (for this demo) and choose Next.
On the Configure user access control page, provide the following information:
- Under Access control settings¸ select No.
- Under User-group expansion, select None.
Choose Next.
On the Review and create page, verify the details and choose Create.

It might take some time for the index to create. Check the list of indexes to watch the progress of creating your index. When the status of the index is ACTIVE, your index is ready to use.

Set up the Amazon Kendra Aurora PostgreSQL connector

Complete the following steps to set up your data source connector:

On the Amazon Kendra console, choose Data sources in the navigation pane.
Choose Add data source.
Choose Aurora PostgreSQL connector as the data source type.
On the Specify data source details page, provide the following information:
- For Data source name, enter a name (for example, data_source_genai_kendra_postgresql).
- For Default language¸ choose English (en).
- Choose Next.
On the Define access and security page, under Source, provide the following information:
- For Host, enter the host name of the PostgreSQL instance (cvgupdj47zsh.us-east-2.rds.amazonaws.com).
- For Port, enter the port number of the PostgreSQL instance (5432).
- For Instance, enter the database name of the PostgreSQL instance (genai).
Under Authentication, if you already have credentials stored in AWS Secrets Manager, choose it on the dropdown Otherwise, choose Create and add new secret.
In the Create an AWS Secrets Manager secret pop-up window, provide the following information:
- For Secret name, enter a name (for example, AmazonKendra-Aurora-PostgreSQL-genai-kendra-secret).
- For Data base user name, enter the name of your database user.
- For Password¸ enter the user password.
Choose Add Secret.
Under Configure VPC and security group, provide the following information:
- For Virtual Private Cloud, choose your virtual private cloud (VPC).
- For Subnet, choose your subnet.
- For VPC security groups, choose the VPC security group to allow access to your data source.
Under IAM role¸ if you have an existing role, choose it on the dropdown menu. Otherwise, choose Create a new role.
On the Configure sync settings page, under Sync scope, provide the following information:
- For SQL query, enter the SQL query and column values as follows: select * from employees.amazon_review.
- For Primary key, enter the primary key column (pk).
- For Title, enter the title column that provides the name of the document title within your database table (reviews_title).
- For Body, enter the body column on which your Amazon Kendra search will happen (reviews_text).
Under Sync node, select Full sync to convert the entire table data into a searchable index.

After the sync completes successfully, your Amazon Kendra index will contain the data from the specified Aurora PostgreSQL table. You can then use this index for intelligent search and RAG applications.

Under Sync run schedule, choose Run on demand.
Choose Next.
On the Set field mappings page, leave the default settings and choose Next.
Review your settings and choose Add data source.

Your data source will appear on the Data sources page after the data source has been created successfully.

Invoke the RAG application

The Amazon Kendra index sync can take minutes to hours depending on the volume of your data. When the sync completes without error, you are ready to develop your RAG solution in your preferred IDE. Complete the following steps:

Configure your AWS credentials to allow Boto3 to interact with AWS services. You can do this by setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables or by using the ~/.aws/credentials file:

import boto3
  pip install langchain

# Create a Boto3 session

session = boto3.Session(
   aws_access_key_id='YOUR_AWS_ACCESS_KEY_ID',
   aws_secret_access_key='YOUR_AWS_SECRET_ACCESS_KEY',
   region_name='YOUR_AWS_REGION'
)

Import LangChain and the necessary components:

from langchain_community.llms import Bedrock
from langchain_community.retrievers import AmazonKendraRetriever
from langchain.chains import RetrievalQA

Create an instance of the LLM (Anthropic’s Claude):

llm = Bedrock(
region_name = "bedrock_region_name",
model_kwargs = {
"max_tokens_to_sample":300,
"temperature":1,
"top_k":250,
"top_p":0.999,
"anthropic_version":"bedrock-2023-05-31"
},
model_id = "anthropic.claude-v2"
)

Create your prompt template, which provides instructions for the LLM:

from langchain_core.prompts import PromptTemplate

prompt_template = """
You are a <persona>Product Review Specialist</persona>, and you provide detail product review insights.
You have access to the product reviews in the <context> XML tags below and nothing else.

<context>
{context}
</context>

<question>
{question}
</question>
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

Initialize the KendraRetriever with your Amazon Kendra index ID by replacing the Kendra_index_id that you created earlier and the Amazon Kendra client:

session = boto3.Session(region_name='Kendra_region_name')
kendra_client = session.client('kendra')
# Create an instance of AmazonKendraRetriever
kendra_retriever = AmazonKendraRetriever(
kendra_client=kendra_client,
index_id="Kendra_Index_ID"
)

Combine Anthropic’s Claude and the Amazon Kendra retriever into a RetrievalQA chain:

qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=kendra_retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt},
)

Invoke the chain with your own query:

query = "What are some products that has bad quality reviews, summarize the reviews"
result_ = qa.invoke(
query
)
result_

Clean up

To avoid incurring future charges, delete the resources you created as part of this post:

Conclusion

In this post, we discussed how to convert your existing Aurora data into an Amazon Kendra index and implement a RAG-based solution for the data search. This solution drastically reduces the data preparation need for Amazon Kendra search. It also increases the speed of generative AI application development by reducing the learning curve behind data preparation.

Try out the solution, and if you have any comments or questions, leave them in the comments section.

About the Authors

Aravind Hariharaputran is a Data Consultant with the Professional Services team at Amazon Web Services. He is passionate about Data and AIML in general with extensive experience managing Database technologies .He helps customers transform legacy database and applications to Modern data platforms and generative AI applications. He enjoys spending time with family and playing cricket.

Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.

Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference

January 28, 2025

by Ishan Singh Amazon AWS

In production generative AI applications, responsiveness is just as important as the intelligence behind the model. Whether it’s customer service teams handling time-sensitive inquiries or developers needing instant code suggestions, every second of delay, known as latency, can have a significant impact. As businesses increasingly use large language models (LLMs) for these critical tasks and processes, they face a fundamental challenge: how to maintain the quick, responsive performance users expect while delivering the high-quality outputs these sophisticated models promise.

The impact of latency on user experience extends beyond mere inconvenience. In interactive AI applications, delayed responses can break the natural flow of conversation, diminish user engagement, and ultimately affect the adoption of AI-powered solutions. This challenge is compounded by the increasing complexity of modern LLM applications, where multiple LLM calls are often needed to solve a single problem, significantly increasing total processing times.

During re:Invent 2024, we launched latency-optimized inference for foundation models (FMs) in Amazon Bedrock. This new inference feature provides reduced latency for Anthropic’s Claude 3.5 Haiku model and Meta’s Llama 3.1 405B and 70B models compared to their standard versions. This feature is especially helpful for time-sensitive workloads where rapid response is business critical.

In this post, we explore how Amazon Bedrock latency-optimized inference can help address the challenges of maintaining responsiveness in LLM applications. We’ll dive deep into strategies for optimizing application performance and improving user experience. Whether you’re building a new AI application or optimizing an existing one, you’ll find practical guidance on both the technical aspects of latency optimization and real-world implementation approaches. We begin by explaining latency in LLM applications.

Understanding latency in LLM applications

Latency in LLM applications is a multifaceted concept that goes beyond simple response times. When you interact with an LLM, you can receive responses in one of two ways: streaming or nonstreaming mode. In nonstreaming mode, you wait for the complete response before receiving any output—like waiting for someone to finish writing a letter. In streaming mode, you receive the response as it’s being generated—like watching someone type in real time.

To effectively optimize AI applications for responsiveness, we need to understand the key metrics that define latency and how they impact user experience. These metrics differ between streaming and nonstreaming modes and understanding them is crucial for building responsive AI applications.

Time to first token (TTFT) represents how quickly your streaming application starts responding. It’s the amount of time from when a user submits a request until they receive the beginning of a response (the first word, token, or chunk). Think of it as the initial reaction time of your AI application.

TTFT is affected by several factors:

Length of your input prompt (longer prompts generally mean higher TTFT)
Network conditions and geographic location (if the prompt is getting processed in a different region, it will take longer)

Calculation: TTFT = Time to first chunk/token – Time from request submission
Interpretation: Lower is better

Output tokens per second (OTPS) indicates how quickly your model generates new tokens after it starts responding. This metric is crucial for understanding the actual throughput of your model and how it maintains its response speed throughout longer generations.

OTPS is influenced by:

Model size and complexity
Length of the generated response
Complexity of the task and prompt
System load and resource availability

Calculation: OTPS = Total number of output tokens / Total generation time
Interpretation: Higher is better

End-to-end latency (E2E) measures the total time from request to complete response. As illustrated in the figure above, this encompasses the entire interaction.

Key factors affecting this metric include:

Input prompt length
Requested output length
Model processing speed
Network conditions
Complexity of the task and prompt
Postprocessing requirements (for example, using Amazon Bedrock Guardrails or other quality checks)

Calculation: E2E latency = Time at completion of request – Time from request submission
Interpretation: Lower is better

Although these metrics provide a solid foundation for understanding latency, there are additional factors and considerations that can impact the perceived performance of LLM applications. These metrics are shown in the following diagram.

The role of tokenization

An often-overlooked aspect of latency is how different models tokenize text differently. Each model’s tokenization strategy is defined by its provider during training and can’t be modified. For example, a prompt that generates 100 tokens in one model might generate 150 tokens in another. When comparing model performance, remember that these inherent tokenization differences can affect perceived response times, even when the models are equally efficient. Awareness of this variation can help you better interpret latency differences between models and make more informed decisions when selecting models for your applications.

Understanding user experience

The psychology of waiting in AI applications reveals interesting patterns about user expectations and satisfaction. Users tend to perceive response times differently based on the context and complexity of their requests. A slight delay in generating a complex analysis might be acceptable, and even a small lag in a conversational exchange can feel disruptive. This understanding helps us set appropriate optimization priorities for different types of applications.

Consistency over speed

Consistent response times, even if slightly slower, often lead to better user satisfaction than highly variable response times with occasional quick replies. This is crucial for streaming responses and implementing optimization strategies.

Keeping users engaged

When processing times are longer, simple indicators such as “Processing your request” or “loading animations” messages help keep users engaged, especially during the initial response time. In such scenarios, you want to optimize for TTFT.

Balancing speed, quality, and cost

Output quality often matters more than speed. Users prefer accurate responses over quick but less reliable ones. Consider benchmarking your user experience to find the best latency for your use case, considering that most humans can’t read faster than 225 words per minute and therefore extremely fast response can hinder user experience.

By understanding these nuances, you can make more informed decisions to optimize your AI applications for better user experience.

Latency-optimized inference: A deep dive

Amazon Bedrock latency-optimized inference capabilities are designed to provide higher OTPS and quicker TTFT, enabling applications to handle workloads more reliably. This optimization is available in the US East (Ohio) AWS Region for select FMs, including Anthropic’s Claude 3.5 Haiku and Meta’s Llama 3.1 models (both 405B and 70B versions). The optimization supports the following models:

Higher OTPS – Faster token generation after the model starts responding
Quicker TTFT – Faster initial response time

Implementation

To enable latency optimization, you need to set the latency parameter to optimized in your API calls:

# Using converse api without streaming
response = bedrock_runtime.converse(
    modelId='us.anthropic.claude-3-5-haiku-20241022-v1:0',
    messages=[{
            'role': 'user',
            'content': [{
                'text':'Write a story about music generating AI models'
                }]
              }],
    performanceConfig={'latency': 'optimized'}
)

For streaming responses:

# using converse API with streaming
response = bedrock_runtime.converse_stream(
       modelId='us.anthropic.claude-3-5-haiku-20241022-v1:0',
       messages=[{
            'role': 'user',
            'content': [{
                'text':'Write a story about music generating AI models'
                }]
              }],
        performanceConfig={'latency': 'optimized'}
    )

Benchmarking methodology and results

To understand the performance gains both for TTFT and OTPS, we conducted an offline experiment with around 1,600 API calls spread across various hours of the day and across multiple days. We used a dummy dataset comprising different task types: sequence-counting, story-writing, summarization, and translation. The input prompt ranged from 100 tokens to 100,000 tokens, and the output tokens ranged from 100 to 1,000 output tokens. These tasks were chosen to represent varying complexity levels and various model output lengths. Our test setup was hosted in the US West (Oregon) us-west-2 Region, and both the optimized and standard models were hosted in US East (Ohio) us-east-2 Region. This cross-Region setup introduced realistic network variability, helping us measure performance under conditions similar to real-world applications.

When analyzing the results, we focused on the key latency metrics discussed earlier: TTFT and OTPS. As a quick recap, lower TTFT values indicate faster initial response times, and higher OTPS values represent faster token generation speeds. We also looked at the 50th percentile (P50) and 90th percentile (P90) values to understand both typical performance and performance boundaries under challenging or upper bound conditions. Following the central limit theorem, we observed that, with sufficient samples, our results converged toward consistent values, providing reliable performance indicators.

It’s important to note that these results are from our specific test environment and datasets. Your actual results may vary based on your specific use case, prompt length, expected model response length, network conditions, client location, and other implementation components. When conducting your own benchmarks, make sure your test dataset represents your actual production workload characteristics, including typical input lengths and expected output patterns.

Benchmark results

Our experiments with the latency-optimized models revealed substantial performance improvements across both TTFT and OTPS metrics. The results in the following table show the comparison between standard and optimized versions of Anthropic’s Claude 3.5 Haiku and Meta’s Llama 3.1 70B models. For each model, we ran multiple iterations of our test scenarios to promote reliable performance. The improvements were particularly notable in high-percentile measurements, suggesting more consistent performance even under challenging conditions.

Model	Inference profile	TTFT P50 (in seconds)	TTFT P90 (in seconds)	OTPS P50	OTPS P90
us.anthropic.claude-3-5-haiku-20241022-v1:0	Optimized	0.6	1.4	85.9	152.0
us.anthropic.claude-3-5-haiku-20241022-v1:0	Standard	1.1	2.9	48.4	67.4
	*Improvement*	*-42.20%*	*-51.70%*	*77.34%*	*125.50%*
us.meta.llama3-1-70b-instruct-v1:0	Optimized	0.4	1.2	137.0	203.7
us.meta.llama3-1-70b-instruct-v1:0	Standard	0.9	42.8	30.2	32.4
	*Improvement*	*-51.65%*	*-97.10%*	*353.84%*	*529.33%*

These results demonstrate significant improvements across all metrics for both models. For Anthropic’s Claude 3.5 Haiku model, the optimized version achieved up to 42.20% reduction in TTFT P50 and up to 51.70% reduction in TTFT P90, indicating more consistent initial response times. Additionally, the OTPS saw improvements of up to 77.34% at the P50 level and up to 125.50% at the P90 level, enabling faster token generation.

The gains for Meta’s Llama 3.1 70B model are even more impressive, with the optimized version achieving up to 51.65% reduction in TTFT P50 and up to 97.10% reduction in TTFT P90, providing consistently rapid initial responses. Furthermore, the OTPS saw a massive boost, with improvements of up to 353.84% at the P50 level and up to 529.33% at the P90 level, enabling up to 5x faster token generation in some scenarios.

Although these benchmark results show the powerful impact of latency-optimized inference, they represent just one piece of the optimization puzzle. To make best use of these performance improvements and achieve the best possible response times for your specific use case, you’ll need to consider additional optimization strategies beyond merely enabling the feature.

Comprehensive guide to LLM latency optimization

Even though Amazon Bedrock latency-optimized inference offers great improvements from the start, getting the best performance requires a well-rounded approach to designing and implementing your application. In the next section, we explore some other strategies and considerations to make your application as responsive as possible.

Prompt engineering for latency optimization

When optimizing LLM applications for latency, the way you craft your prompts affects both input processing and output generation.

To optimize your input prompts, follow these recommendations:

Keep prompts concise – Long input prompts take more time to process and increase TTFT. Create short, focused prompts that prioritize necessary context and information.
Break down complex tasks – Instead of handling large tasks in a single request, break them into smaller, manageable chunks. This approach helps maintain responsiveness regardless of task complexity.
Smart context management – For interactive applications such as chatbots, include only relevant context instead of entire conversation history.
Token management – Different models tokenize text differently, meaning the same input can result in different numbers of tokens. Monitor and optimize token usage to keep performance consistent. Use token budgeting to balance context preservation with performance needs.

To engineer for brief outputs, follow these recommendations:

Engineer for brevity – Include explicit length constraints in your prompts (for example, “respond in 50 words or less”)
Use system messages – Set response length constraints through system messages
Balance quality and length – Make sure response constraints don’t compromise output quality

One of the best ways to make your AI application feel faster is to use streaming. Instead of waiting for the complete response, streaming shows the response as it’s being generated—like watching someone type in real-time. Streaming the response is one of the most effective ways to improve perceived performance in LLM applications maintaining user engagement.

These techniques can significantly reduce token usage and generation time, improving both latency and cost-efficiency.

Building production-ready AI applications

Although individual optimizations are important, production applications require a holistic approach to latency management. In this section, we explore how different system components and architectural decisions impact overall application responsiveness.

System architecture and end-to-end latency considerations

In production environments, overall system latency extends far beyond model inference time. Each component in your AI application stack contributes to the total latency experienced by users. For instance, when implementing responsible AI practices through Amazon Bedrock Guardrails, you might notice a small additional latency overhead. Similar considerations apply when integrating content filtering, user authentication, or input validation layers. Although each component serves a crucial purpose, their cumulative impact on latency requires careful consideration during system design.

Geographic distribution plays a significant role in application performance. Model invocation latency can vary considerably depending on whether calls originate from different Regions, local machines, or different cloud providers. This variation stems from data travel time across networks and geographic distances. When designing your application architecture, consider factors such as the physical distance between your application and model endpoints, cross-Region data transfer times, and network reliability in different Regions. Data residency requirements might also influence these architectural choices, potentially necessitating specific Regional deployments.

Integration patterns significantly impact how users perceive application performance. Synchronous processing, although simpler to implement, might not always provide the best user experience. Consider implementing asynchronous patterns where appropriate, such as pre-fetching likely responses based on user behavior patterns or processing noncritical components in the background. Request batching for bulk operations can also help optimize overall system throughput, though it requires careful balance with response time requirements.

As applications scale, additional infrastructure components become necessary but can impact latency. Load balancers, queue systems, cache layers, and monitoring systems all contribute to the overall latency budget. Understanding these components’ impact helps in making informed decisions about infrastructure design and optimization strategies.

Complex tasks often require orchestrating multiple model calls or breaking down problems into subtasks. Consider a content generation system that first uses a fast model to generate an outline, then processes different sections in parallel, and finally uses another model for coherence checking and refinement. This orchestration approach requires careful attention to cumulative latency impact while maintaining output quality. Each step needs appropriate timeouts and fallback mechanisms to provide reliable performance under various conditions.

Prompt caching for enhanced performance

Although our focus is on latency-optimized inference, it’s worth noting that Amazon Bedrock also offers prompt caching (in preview) to optimize for both cost and latency. This feature is particularly valuable for applications that frequently reuse context, such as document-based chat assistants or applications with repetitive query patterns. When combined with latency-optimized inference, prompt caching can provide additional performance benefits by reducing the processing overhead for frequently used contexts.

Prompt routing for intelligent model selection

Similar to prompt caching, Amazon Bedrock Intelligent Prompt Routing (in preview) is another powerful optimization feature. This capability automatically directs requests to different models within the same model family based on the complexity of each prompt. For example, simple queries can be routed to faster, more cost-effective models, and complex requests that require deeper understanding are directed to more sophisticated models. This automatic routing helps optimize both performance and cost without requiring manual intervention.

Architectural considerations and caching

Application architecture plays a crucial role in overall latency optimization. Consider implementing a multitiered caching strategy that includes response caching for frequently requested information and smart context management for historical information. This isn’t only about storing exact matches—consider implementing semantic caching that can identify and serve responses to similar queries.

Balancing model sophistication, latency, and cost

In AI applications, there’s a constant balancing act between model sophistication, latency, and cost, as illustrated in the diagram. Although more advanced models often provide higher quality outputs, they might not always meet strict latency requirements. In such cases, using a less sophisticated but faster model might be the better choice. For instance, in applications requiring near-instantaneous responses, opting for a smaller, more efficient model could be necessary to meet latency goals, even if it means a slight trade-off in output quality. This approach aligns with the broader need to optimize the interplay between cost, speed, and quality in AI systems.

Features such as Amazon Bedrock Intelligent Prompt Routing help manage this balance effectively. By automatically handling model selection based on request complexity, you can optimize for all three factors—quality, speed, and cost—without requiring developers to commit to a single model for all requests.

As we’ve explored throughout this post, optimizing LLM application latency involves multiple strategies, from using latency-optimized inference and prompt caching to implementing intelligent routing and careful prompt engineering. The key is to combine these approaches in a way that best suits your specific use case and requirements.

Conclusion

Making your AI application fast and responsive isn’t a one-time task, it’s an ongoing process of testing and improvement. Amazon Bedrock latency-optimized inference gives you a great starting point, and you’ll notice significant improvements when you combine it with the strategies we’ve discussed.

Ready to get started? Here’s what to do next:

Try our sample notebook to benchmark latency for your specific use case
Enable latency-optimized inference in your application code
Set up Amazon CloudWatch metrics to monitor your application’s performance

Remember, in today’s AI applications, being smart isn’t enough, being responsive is just as important. Start implementing these optimization strategies today and watch your application’s performance improve.

About the Authors

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Vivek Singh is a Senior Manager, Product Management at AWS AI Language Services team. He leads the Amazon Transcribe product team. Prior to joining AWS, he held product management roles across various other Amazon organizations such as consumer payments and retail. Vivek lives in Seattle, WA and enjoys running, and hiking.

Ankur Desai is a Principal Product Manager within the AWS AI Services team.