Learn how Google DeepMind and Google Cloud are helping to bring a cinema classic to larger-than-life in Las Vegas.Read More
Repurposing Protein Folding Models for Generation with Latent Diffusion
PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models.
The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment of recognition for the of AI role in biology. What comes next after protein folding?
In PLAID, we develop a method that learns to sample from the latent space of protein folding models to generate new proteins. It can accept compositional function and organism prompts, and can be trained on sequence databases, which are 2-4 orders of magnitude larger than structure databases. Unlike many previous protein structure generative models, PLAID addresses the multimodal co-generation problem setting: simultaneously generating both discrete sequence and continuous all-atom structural coordinates.
Accelerating Whisper on Arm with PyTorch and Hugging Face Transformers
Automatic speech recognition (ASR) has revolutionized how we interact with technology, clearing the way for applications like real-time audio transcription, voice assistants, and accessibility tools. OpenAI Whisper is a powerful model for ASR, capable of multilingual speech recognition and translation.
A new Arm Learning Path is now available that explains how to accelerate Whisper on Arm-based cloud instances using PyTorch and Hugging Face transformers.
Why Run Whisper on Arm?
Arm processors are popular in cloud infrastructure for their efficiency, performance, and cost-effectiveness. With major cloud providers such as AWS, Azure, and Google Cloud offering Arm-based instances, running machine learning workloads on this architecture is becoming increasingly attractive.
What You’ll Learn
The Arm Learning Path provides a structured approach to setting up and accelerating Whisper on Arm-based cloud instances. Here’s what you cover:
1. Set Up Your Environment
Before running Whisper, you must set up your development environment. The learning path walks you through setting up an Arm-based cloud instance and installing all dependencies, such as PyTorch, Transformers, and ffmpeg.
2. Run Whisper with PyTorch and Hugging Face Transformers
Once the environment is ready, you will use the Hugging Face transformer library with PyTorch to load and execute Whisper for speech-to-text conversion. The tutorial provides a step-by-step approach for processing audio files and generating audio transcripts.
3. Measure and Evaluate Performance
To ensure efficient execution, you learn how to measure transcription speeds and compare different optimization techniques. The guide provides insights into interpreting performance metrics and making informed decisions on your deployment.
Try it Yourself
Upon completion of this tutorial, you know how to:
- Deploy Whisper on an Arm-based cloud instance.
- Implement performance optimizations for efficient execution.
- Evaluate transcription speeds and optimize further based on results.
Try the live demo today and see audio transcription in action on Arm: Whisper on Arm Demo.
Accelerating Whisper on Arm with PyTorch and Hugging Face Transformers
Automatic speech recognition (ASR) has revolutionized how we interact with technology, clearing the way for applications like real-time audio transcription, voice assistants, and accessibility tools. OpenAI Whisper is a powerful model for ASR, capable of multilingual speech recognition and translation.
A new Arm Learning Path is now available that explains how to accelerate Whisper on Arm-based cloud instances using PyTorch and Hugging Face transformers.
Why Run Whisper on Arm?
Arm processors are popular in cloud infrastructure for their efficiency, performance, and cost-effectiveness. With major cloud providers such as AWS, Azure, and Google Cloud offering Arm-based instances, running machine learning workloads on this architecture is becoming increasingly attractive.
What You’ll Learn
The Arm Learning Path provides a structured approach to setting up and accelerating Whisper on Arm-based cloud instances. Here’s what you cover:
1. Set Up Your Environment
Before running Whisper, you must set up your development environment. The learning path walks you through setting up an Arm-based cloud instance and installing all dependencies, such as PyTorch, Transformers, and ffmpeg.
2. Run Whisper with PyTorch and Hugging Face Transformers
Once the environment is ready, you will use the Hugging Face transformer library with PyTorch to load and execute Whisper for speech-to-text conversion. The tutorial provides a step-by-step approach for processing audio files and generating audio transcripts.
3. Measure and Evaluate Performance
To ensure efficient execution, you learn how to measure transcription speeds and compare different optimization techniques. The guide provides insights into interpreting performance metrics and making informed decisions on your deployment.
Try it Yourself
Upon completion of this tutorial, you know how to:
- Deploy Whisper on an Arm-based cloud instance.
- Implement performance optimizations for efficient execution.
- Evaluate transcription speeds and optimize further based on results.
Try the live demo today and see audio transcription in action on Arm: Whisper on Arm Demo.
Revisit Large-Scale Image–Caption Data in Pre-training Multimodal Foundation Models
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. Notably, the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still unclear. Additionally, different multimodal foundation models may have distinct preferences for specific caption formats while the efforts of studying the optimal captions for each foundation model remain limited. In this work, we introduce a novel, controllable, and scalable captioning pipeline that generates diverse caption formats…Apple Machine Learning Research
Do LLMs Estimate Uncertainty Well in Instruction-Following?
Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs’ instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs’ uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies…Apple Machine Learning Research
Llama 4 family of models from Meta are now available in SageMaker JumpStart
Today, we’re excited to announce the availability of Llama 4 Scout and Maverick models in Amazon SageMaker JumpStart and coming soon in Amazon Bedrock. Llama 4 represents Meta’s most advanced multimodal models to date, featuring a mixture of experts (MoE) architecture and context window support up to 10 million tokens. With native multimodality and early fusion technology, Meta states that these new models demonstrate unprecedented performance across text and vision tasks while maintaining efficient compute requirements. With a dramatic increase on supported context length from 128K in Llama 3, Llama 4 is now suitable for multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over extensive codebases. You can now deploy the Llama-4-Scout-17B-16E-Instruct, Llama-4-Maverick-17B-128E-Instruct, and Llama-4-Maverick-17B-128E-Instruct-FP8 models using SageMaker JumpStart in the US East (N. Virginia) AWS Region.
In this blog post, we walk you through how to deploy and prompt a Llama-4-Scout-17B-16E-Instruct model using SageMaker JumpStart.
Llama 4 overview
Meta announced Llama 4 today, introducing three distinct model variants: Scout, which offers advanced multimodal capabilities and a 10M token context window; Maverick, a cost-effective solution with a 128K context window; and Behemoth, in preview. These models are optimized for multimodal reasoning, multilingual tasks, coding, tool-calling, and powering agentic systems.
Llama 4 Maverick is a powerful general-purpose model with 17 billion active parameters, 128 experts, and 400 billion total parameters, and optimized for high-quality general assistant and chat use cases. Additionally, Llama 4 Maverick is available with base and instruct models in both a quantized version (FP8) for efficient deployment on the Instruct model and a non-quantized (BF16) version for maximum accuracy.
Llama 4 Scout, the more compact and smaller model, has 17 billion active parameters, 16 experts, and 109 billion total parameters, and features an industry-leading 10M token context window. These models are designed for industry-leading performance in image and text understanding with support for 12 languages, enabling the creation of AI applications that bridge language barriers.
See Meta’s community license agreement for usage terms and more details.
SageMaker JumpStart overview
SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can use state-of-the-art model architectures—such as language models, computer vision models, and more—without having to build them from scratch.
With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker inference instances can be isolated within your virtual private cloud (VPC). After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of Amazon SageMaker AI, including SageMaker inference for deploying models and container logs for improved observability. With SageMaker AI, you can streamline the entire model deployment process.
Prerequisites
To try the Llama 4 models in SageMaker JumpStart, you need the following prerequisites:
- An AWS account that will contain all your AWS resources.
- An AWS Identity and Access Management (IAM) role to access SageMaker AI. To learn more about how IAM works with SageMaker AI, see Identity and Access Management for Amazon SageMaker AI.
- Access to Amazon SageMaker Studio and a SageMaker AI notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
- Access to accelerated instances (GPUs) for hosting the LLMs.
Discover Llama 4 models in SageMaker JumpStart
SageMaker JumpStart provides FMs through two primary interfaces: SageMaker Studio and the Amazon SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.
SageMaker Studio is a comprehensive integrated development environment (IDE) that offers a unified, web-based interface for performing all aspects of the AI development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process.
In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference. You can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart from the Home page in SageMaker Studio, as shown in the following figure.
Alternatively, you can use the SageMaker Python SDK to programmatically access and use SageMaker JumpStart models. This approach allows for greater flexibility and integration with existing AI and machine learning (AI/ML) workflows and pipelines.
By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI/ML development efforts, regardless of your preferred interface or workflow.
Deploy Llama 4 models for inference through the SageMaker JumpStart UI
On the SageMaker JumpStart landing page, you can find all the public pre-trained models offered by SageMaker AI. You can then choose the Meta model provider tab to discover all the available Meta models.
If you’re using SageMaker Classic Studio and don’t see the Llama 4 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, see Shut down and Update Studio Classic Apps.
- Search for Meta to view the Meta model card. Each model card shows key information, including:
- Model name
- Provider name
- Task category (for example, Text Generation)
- Select the model card to view the model details page.
The model details page includes the following information:
- The model name and provider information
- Deploy button to deploy the model
- About and Notebooks tabs with detailed information
The About tab includes important details, such as:
- Model description
- License information
- Technical specifications
- Usage guidelines
Before you deploy the model, we recommended you review the model details and license terms to confirm compatibility with your use case.
- Choose Deploy to proceed with deployment.
- For Endpoint name, use the automatically generated name or enter a custom one.
- For Instance type, use the default: p5.48xlarge.
- For Initial instance count, enter the number of instances (default: 1).
Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed. - Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.
- Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
- Choose Deploy. The deployment process can take several minutes to complete.
When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. You can monitor the deployment progress on the SageMaker console Endpoints page, which will display relevant metrics and status information. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.
Deploy Llama 4 models for inference using the SageMaker Python SDK
When you choose Deploy and accept the terms, model deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.
To deploy using a notebook, start by selecting an appropriate model, specified by the model_id
. You can deploy any of the selected models on SageMaker AI.
You can deploy the Llama 4 Scout model using SageMaker JumpStart with the following SageMaker Python SDK code:
This deploys the model on SageMaker AI with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To successfully deploy the model, you must manually set accept_eula=True
as a deploy method argument. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:
Recommended instances and benchmark
The following table lists all the Llama 4 models available in SageMaker JumpStart along with the model_id, default instance types, and the maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models. For increased context length, you can modify the default instance type in the SageMaker JumpStart UI.
Model name | Model ID | Default instance type | Supported instance types |
Llama-4-Scout-17B-16E-Instruct | meta-vlm-llama-4-scout-17b-16e-instruct |
ml.p5.48xlarge | ml.g6e.48xlarge, ml.p5.48xlarge, ml.p5en.48xlarge |
Llama-4-Maverick-17B-128E-Instruct | meta-vlm-llama-4-maverick-17b-128e-instruct |
ml.p5.48xlarge | ml.p5.48xlarge, ml.p5en.48xlarge |
Llama 4-Maverick-17B-128E-Instruct-FP8 | meta-vlm-llama-4-maverick-17b-128-instruct-fp8 |
ml.p5.48xlarge | ml.p5.48xlarge, ml.p5en.48xlarge |
Inference and example prompts for Llama 4 Scout 17B 16 Experts model
You can use the Llama 4 Scout model for text and image or vision reasoning use cases. With that model, you can perform a variety of tasks, such as image captioning, image text retrieval, visual question answering and reasoning, document visual question answering, and more.
In the following sections we show example payloads, invocations, and responses for Llama 4 Scout that you can use against your Llama 4 model deployments using Sagemaker JumpStart.
Text-only input
Input:
Response:
Single-image input
In this section, let’s test Llama 4’s multimodal capabilities. By merging text and vision tokens into a unified processing backbone, Llama 4 can seamlessly understand and respond to queries about an image. The following is an example of how you can prompt Llama 4 to answer questions about an image such as the one in the example:
Image:
Input:
Response:
The Llama 4 model on JumpStart can take in the image provided via a URL, underlining its powerful potential for real-time multimodal applications.
Multi-image input
Building on its advanced multimodal functionality, Llama 4 can effortlessly process multiple images at the same time. In this demonstration, the model is prompted with two image URLs and tasked with describing each image and explaining their relationship, showcasing its capacity to synthesize information across several visual inputs. Let’s test this below by passing in the URLs of the following images in the payload.
Image 1:
Image 2:
Input:
Response:
As you can see, Llama 4 excels in handling multiple images simultaneously, providing detailed and contextually relevant insights that emphasize its robust multimodal processing abilities.
Codebase analysis with Llama 4
Using Llama 4 Scout’s industry-leading context window, this section showcases its ability to deeply analyze expansive codebases. The example extracts and contextualizes the buildspec-1-10-2.yml
file from the AWS Deep Learning Containers GitHub repository, illustrating how the model synthesizes information across an entire repository. We used a tool to ingest the whole repository into plaintext that we provided to the model as context:
Input:
Output:
Multi-document processing
Harnessing the same extensive token context window, Llama 4 Scout excels in multi-document processing. In this example, the model extracts key financial metrics from Amazon 10-K reports (2017-2024), demonstrating its capability to integrate and analyze data spanning multiple years—all without the need for additional processing tools.
Input:
Output:
Clean up
To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints using the following code snippets:
Alternatively, using the SageMaker console, complete the following steps:
- On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
- Search for the embedding and text generation endpoints.
- On the endpoint details page, choose Delete.
- Choose Delete again to confirm.
Conclusion
In this post, we explored how SageMaker JumpStart empowers data scientists and ML engineers to discover, access, and deploy a wide range of pre-trained FMs for inference, including Meta’s most advanced and capable models to date. Get started with SageMaker JumpStart and Llama 4 models today.
For more information about SageMaker JumpStart, see Train, deploy, and evaluate pretrained models with SageMaker JumpStart and Getting started with Amazon SageMaker JumpStart.
About the authors
Marco Punio is a Sr. Specialist Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyper-scale on AWS. As a member of the Third-party Model Provider Applied Sciences Solutions Architecture team at AWS, he is a global lead for the Meta–AWS Partnership and technical strategy. Based in Seattle, Washington, Marco enjoys writing, reading, exercising, and building applications in his free time.
Chakravarthy Nagarajan is a Principal Solutions Architect specializing in machine learning, big data, and high performance computing. In his current role, he helps customers solve real-world, complex business problems using machine learning and generative AI solutions.
Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the SageMaker machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.
Malav Shastri is a Software Development Engineer at AWS, where he works on the Amazon SageMaker JumpStart and Amazon Bedrock teams. His role focuses on enabling customers to take advantage of state-of-the-art open source and proprietary foundation models and traditional machine learning algorithms. Malav holds a Master’s degree in Computer Science.
Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.
Baladithya Balamurugan is a Solutions Architect at AWS focused on ML deployments for inference and using AWS Neuron to accelerate training and inference. He works with customers to enable and accelerate their ML deployments on services such as Amazon Sagemaker and Amazon EC2. Based in San Francisco, Baladithya enjoys tinkering, developing applications, and his home lab in his free time.
John Liu has 14 years of experience as a product executive and 10 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 and Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols and fintech companies, and also spent 9 years as a portfolio manager at various hedge funds.
Multi-tenancy in RAG applications in a single Amazon Bedrock knowledge base with metadata filtering
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies and AWS. Amazon Bedrock Knowledge Bases offers fully managed, end-to-end Retrieval Augmented Generation (RAG) workflows to create highly accurate, low-latency, secure, and custom generative AI applications by incorporating contextual information from your company’s data sources.
Organizations need to control access to their data across different business units, including companies, departments, or even individuals, while maintaining scalability. When organizations try to separate data sources manually, they often create unnecessary complexity and hit service limitations. This post demonstrates how Amazon Bedrock Knowledge Bases can help you scale your data management effectively while maintaining proper access controls on different management levels.
One of these strategies is using Amazon Simple Storage Service (Amazon S3) folder structures and Amazon Bedrock Knowledge Bases metadata filtering to enable efficient data segmentation within a single knowledge base. Additionally, we dive into integrating common vector database solutions available for Amazon Bedrock Knowledge Bases and how these integrations enable advanced metadata filtering and querying capabilities.
Organizing S3 folder structures for scalable knowledge bases
Organizations working with multiple customers need a secure and scalable way to keep each customer’s data separate while maintaining efficient access controls. Without proper data segregation, companies risk exposing sensitive information between customers or creating complex, hard-to-maintain systems. For this post, we focus on maintaining access controls across multiple business units within the same management level.
A key strategy involves using S3 folder structures and Amazon Bedrock Knowledge Bases metadata filtering to enable efficient data segregation within a single knowledge base. Instead of creating separate knowledge bases for each customer, you can use a consolidated knowledge base with a well-structured S3 folder hierarchy. For example, imagine a consulting firm that manages documentation for multiple healthcare providers—each customer’s sensitive patient records and operational documents must remain strictly separated. The Amazon S3 structure might look as follows:
s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/
s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/
s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/policies/
s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/procedures/
s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/
s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/policies/
s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/procedures/
This structure makes sure that Customer A’s healthcare documentation remains completely separate from Customer B’s data. When combined with Amazon Bedrock Knowledge Bases metadata filtering, you can verify that users associated with Customer A can only access their organization’s documents, and Customer B’s users can only see their own data—maintaining strict data boundaries while using a single, efficient knowledge base infrastructure.
The Amazon Bedrock Knowledge Bases metadata filtering capability enhances this segregation by allowing you to tag documents with customer-specific identifiers and other relevant attributes. These metadata filters provide an additional layer of security and organization, making sure that queries only return results from the appropriate customer’s dataset.
Solution overview
The following diagram provides a high-level overview of AWS services and features through a sample use case. Although the example uses Customer A and Customer B for illustration, these can represent distinct business units (such as departments, companies, or teams) with different compliance requirements, rather than only individual customers.
The workflow consists of the following steps:
- Customer data is uploaded along with metadata indicating data ownership and other properties to specific folders in an S3 bucket.
- The S3 bucket, containing customer data and metadata, is configured as a knowledge base data source. Amazon Bedrock Knowledge Bases ingests the data, along with the metadata, from the source repository and a knowledge base sync is performed.
- A customer initiates a query using a frontend application with metadata filters against the Amazon Bedrock knowledge base. An access control metadata filter must be in place to make sure that the customer only accesses data they own; the customer can apply additional filters to further refine query results. This combined query and filter is passed to the
RetrieveAndGenerate
API. - The
RetrieveAndGenerate
API handles the core RAG workflow. It consists of several sub-steps:- The user query is converted into a vector representation (embedding).
- Using the query embedding and the metadata filter, relevant documents are retrieved from the knowledge base.
- The original query is augmented with the retrieved documents, providing context for the large language model (LLM).
- The LLM generates a response based on the augmented query and retrieved context.
- Finally, the generated response is sent back to the user.
When implementing Amazon Bedrock Knowledge Bases in scenarios involving sensitive information or requiring access controls, developers must implement proper metadata filtering in their application code. Failure to enforce appropriate metadata-based filtering could result in unauthorized access to sensitive documents within the knowledge base. Metadata filtering serves as a critical security boundary and should be consistently applied across all queries. For comprehensive guidance on implementing secure metadata filtering practices, refer to the Amazon Bedrock Knowledge Base Security documentation.
Implement metadata filtering
For this use case, two specific example customers, Customer A and Customer B, are aligned to different proprietary compliance documents. The number of customers and folders can scale to N depending on the size of the customer base. We will use the following public documents, which will reside in the respective customer’s S3 folder. Customer A requires the Architecting for HIPAA Security and Compliance on AWS document. Customer B requires access to the Using AWS in the Context of NHS Cloud Security Guidance document.
- Create a JSON file representing the corresponding metadata for both Customer A and Customer B:
The following is the JSON metadata for Customer A’s data:
{ "metadataAttributes": { "customer": "CustomerA", "documentType": "HIPAA Compliance Guide", "focus": "HIPAA Compliance", "publicationYear": 2022, "region": "North America" }}
The following is the JSON metadata for Customer B’s data:
{ "metadataAttributes": { "customer": "CustomerB", "documentType": "NHS Compliance Guidance", "focus": "UK Healthcare Compliance", "publicationYear": 2023, "region": "Europe" }}
- Save these files separately with the naming convention <filename>.pdf.metadata.JSON and store them in the same S3 folder or prefix that stores the source document. For Customer A, name the metadata file architecting-hipaa-compliance-on-aws.pdf.metadata.json and upload it to the folder corresponding to Customer A’s documents. Repeat these steps for Customer B.
- Create an Amazon Bedrock knowledge base. For instructions, see Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases
- After you create your knowledge base, you can sync the data source. For more details, see Sync your data with your Amazon Bedrock knowledge base.
Test metadata filtering
After you sync the data source, you can test the metadata filtering.
The following is an example for setting the customer = CustomerA
metadata filter to show Customer A only has access to the HIPAA compliance document and not the NHS Compliance Guidance that relates to Customer B.
To use the metadata filtering options on the Amazon Bedrock console, complete the following steps:
- On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
- Choose the knowledge base you created.
- Choose Test knowledge base.
- Choose the Configurations icon, then expand Filters.
- Enter a condition using the format: key = value (for this example,
customer = CustomerA
) and press Enter. - When finished, enter your query in the message box, then choose Run.
We enter two queries, “summarize NHS Compliance Guidance” and “summarize HIPAA Compliance Guide.” The following figure shows the two queries: one attempting to query data related to NHS compliance guidance, which fails because it is outside of the Customer A segment, and another successfully querying data on HIPAA compliance, which has been tagged for Customer A.
Implement field-specific chunking
Amazon Bedrock Knowledge Bases supports several document types for Amazon S3 metadata filtering. The supported file formats include:
- Plain text (.txt)
- Markdown (.md)
- HTML (.html)
- Microsoft Word documents (.doc and.docx)
- CSV files (.csv)
- Microsoft Excel spreadsheets (.xls and .xlsx)
When working with CSV data, customers often want to chunk on a specific field in their CSV documents to gain granular control over data retrieval and enhance the efficiency and accuracy of queries. By creating logical divisions based on fields, users can quickly access relevant subsets of data without needing to process the entire dataset.
Additionally, field-specific chunking aids in organizing and maintaining large datasets, facilitating updating or modifying specific portions without affecting the whole. This granularity supports better version control and data lineage tracking, which are crucial for data integrity and compliance. Focusing on relevant chunks can improve the performance of LLMs, ultimately leading to more accurate insights and better decision-making processes within organizations. For more information, see Amazon Bedrock Knowledge Bases now supports advanced parsing, chunking, and query reformulation giving greater control of accuracy in RAG based applications.
To demonstrate field-specific chunking, we use two sample datasets with the following schemas:
- Schema 1 – Customer A uses the following synthetic dataset for recording medical case reports (
case_reports.csv
)
CaseID | DoctorID | PatientID | Diagnosis | TreatmentPlan | Content |
C001 | D001 | P001 | Hypertension | Lifestyle changes, Medication (Lisinopril) | “Patient diagnosed with hypertension, advised lifestyle changes, and started on Lisinopril.” |
C002 | D002 | P002 | Diabetes Type 2 | Medication (Metformin), Diet adjustment | “Diabetes Type 2 confirmed, prescribed Metformin, and discussed a low-carb diet plan.” |
C003 | D003 | P003 | Asthma | Inhaler (Albuterol) | “Patient reports difficulty breathing; prescribed Albuterol inhaler for asthma management.” |
C004 | D004 | P004 | Coronary Artery Disease | Medication (Atorvastatin), Surgery Consultation | “Coronary artery disease diagnosed, started on Atorvastatin, surgery consultation recommended.” |
… | … | … | … | … | … |
- Schema 2 – Customer B uses the following dataset for recording genetic testing results (
genetic_testings.csv
)
SampleID | PatientID | TestType | Result |
S001 | P001 | Genome Sequencing | Positive |
S002 | P002 | Exome Sequencing | Negative |
S003 | P003 | Targeted Gene Panel | Positive |
S004 | P004 | Whole Genome Sequencing | Negative |
… | … | … | … |
Complete the following steps:
- Create a JSON file representing the corresponding metadata for both Customer A and Customer B:
The following is the JSON metadata for Customer A’s data (note that recordBasedStructureMetadata
supports exactly one content field):
The following is the JSON metadata for Customer B’s data:
- Save your files with the naming convention <filename>.csv.metadata.json and store the new JSON file in the same S3 prefix of the bucket where you stored the dataset. For Customer A, name the metadata file case_reports.csv.metadata.JSON and upload the file to the same folder corresponding to Customer A’s datasets.
Repeat the process for Customer B. You have now created metadata from the source CSV itself, as well as an additional metadata field customer
that doesn’t exist in the original dataset. The following image highlights the metadata.
Test field-specific chunking
The following is an example of setting the customer = CustomerA
metadata filter demonstrating that Customer A only has access to the medical case reports dataset and not the genetic testing dataset that relates to Customer B. We enter a query requesting information about a patient with PatientID
as P003
.
To test, complete the following steps:
- On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
- Choose the knowledge base you created.
- Choose Test knowledge base.
- Choose the Configurations icon, then expand Filters.
- Enter a condition using the format: key = value (for this example,
customer = CustomerA
) and press Enter. - When finished, enter your query in the message box, then choose Run.
The knowledge base returns, “Patient reports difficulty breathing; prescribed Albuterol inhaler for asthma management,” which is the Result
column entry from Customer A’s medical case reports dataset for that PatientID
. Although there is a record with the same PatientID
in Customer B’s genetic testing dataset, Customer A has access only to the medical case reports data due to the metadata filtering.
Apply metadata filtering for the Amazon Bedrock API
You can call the Amazon Bedrock API RetrieveAndGenerate
to query a knowledge base and generate responses based on the retrieved results using the specified FM or inference profile. The response only cites sources that are relevant to the query.
The following Python Boto3 example API call applies the metadata filtering for retrieving Customer B data and generates responses based on the retrieved results using the specified FM (Anthropic’s Claude 3 Sonnet) in RetrieveAndGenerate
:
The following GitHub repository provides a notebook that you can follow to deploy an Amazon Bedrock knowledge base with access control implemented using metadata filtering in your own AWS account.
Integrate existing vector databases with Amazon Bedrock Knowledge Bases and validate metadata
There are multiple ways to create vector databases from AWS services and partner offerings to build scalable solutions. If a vector database doesn’t exist, you can use Amazon Bedrock Knowledge Bases to create one using Amazon OpenSearch Serverless Service, Amazon Aurora PostgreSQL Serverless, or Amazon Neptune Analytics to store embeddings, or you can specify an existing vector database supported by Redis Enterprise Cloud, Amazon Aurora PostgreSQL with the pgvector extension, MongoDB Atlas, or Pinecone. After you create your knowledge base and either ingest or sync your data, the metadata attached to the data will be ingested and automatically populated to the vector database.
In this section, we review how to incorporate and validate metadata filtering with existing vector databases using OpenSearch Serverless, Aurora PostgreSQL with the pgvector extension, and Pinecone. To learn how to set up each individual vector databases, follow the instructions in Prerequisites for your own vector store for a knowledge base.
OpenSearch Serverless as a knowledge base vector store
With OpenSearch Serverless vector database capabilities, you can implement semantic search, RAG with LLMs, and recommendation engines. To address data segregation between business segments within each Amazon Bedrock knowledge base with an OpenSearch Serverless vector database, use metadata filtering. Metadata filtering allows you to segment data inside of an OpenSearch Serverless vector database. This can be useful when you want to add descriptive data to your documents for more control and granularity in searches.
Each OpenSearch Serverless dashboard has a URL that can be used to add documents and query your database; the structure of the URL is domain-endpoint/_dashboard
.
After creating a vector database index, you can use metadata filtering to selectively retrieve items by using JSON query options in the request body. For example, to return records owned by Customer A, you can use the following request:
This query will return a JSON response containing the document index with the document labeled as belonging to Customer A.
Aurora PostgreSQL with the pgvector extension as a knowledge base vector store
Pgvector is an extension of PostgreSQL that allows you to extend your relational database into a high-dimensional vector database. It stores each document’s vector in a separate row of a database table. For details on creating an Aurora PostgreSQL table to be used as the vector store for a knowledge base, see Using Aurora PostgreSQL as a Knowledge Base for Amazon Bedrock.
When storing a vector index for your knowledge base in an Aurora database cluster, make sure that the table for your index contains a column for each metadata property in your metadata files before starting data ingestion.
Continuing with the Customer A example, the customer requires the Architecting for HIPAA Security and Compliance on AWS document.
The following is the JSON metadata for Customer A’s data:
{ "metadataAttributes": { "customer": "CustomerA", "documentType": "HIPAA Compliance Guide", "focus": "HIPAA Compliance", "publicationYear": 2022, "region": "North America" }}
The schema of the PostgreSQL table you create must contain four essential columns for ID, text content, vector values, and service managed metadata; it must also include additional metadata columns (customer
, documentType
, focus
, publicationYear
, region
) for each metadata property in the corresponding metadata file. This allows pgvector to perform efficient vector searches and similarity comparisons by running queries directly on the database table. The following table summarizes the columns.
Column Name | Data Type | Description |
id | UUID primary key | Contains unique identifiers for each record |
chunks | Text | Contains the chunks of raw text from your data sources |
embedding | Vector | Contains the vector embeddings of the data sources |
metadata | JSON | Contains Amazon Bedrock managed metadata required to carry out source attribution and to enable data ingestion and querying. |
customer | Text | Contains the customer ID |
documentType | Text | Contains the type of document |
focus | Text | Contains the document focus |
publicationYear | Int | Contains the year document was published |
region | Text | Contains the document’s related AWS Region |
During Amazon Bedrock knowledge base data ingestion, these columns will be populated with the corresponding attribute values. Chunking can break down a single document into multiple separate records (each associated with a different ID).
This PostgreSQL table structure allows for efficient storage and retrieval of document vectors, using PostgreSQL’s robustness and pgvector’s specialized vector handling capabilities for applications like recommendation systems, search engines, or other systems requiring similarity searches in high-dimensional space.
Using this approach, you can implement access control at the table level by creating database tables for each segment. Additional metadata columns can also be included in the table for properties such as the specific document owner (user_id
), tags, and so on to further enable and enforce fine-grained (row-level) access control and result filtering if you restrict each user to only query the rows that contain their user ID (document owner).
After creating a vector database table, you can use metadata filtering to selectively retrieve items by using a PostgreSQL query. For example, to return table records owned by Customer A, you can use the following query:
This query will return a response containing the database records with the document labeled as belonging to Customer A.
Pinecone as a knowledge base vector store
Pinecone, a fully managed vector database, enables semantic search, high-performance search, and similarity matching. Pinecone databases can be integrated into your AWS environment in the form of Amazon Bedrock knowledge bases, but are first created through the Pinecone console. For detailed documentation about setting up a vector store in Pinecone, see Pinecone as a Knowledge Base for Amazon Bedrock. Then, you can integrate the databases using the Amazon Bedrock console. For more information about Pinecone integration with Amazon Bedrock, see Bring reliable GenAI applications to market with Amazon Bedrock and Pinecone.
You can segment a Pinecone database by adding descriptive metadata to each index and using that metadata to inform query results. Pinecone supports strings and lists of strings to filter vector searches on customer names, customer industry, and so on. Pinecone also supports numbers and booleans.
Use metadata query language to filter output ($eq
, $ne
, $in
, $nin
, $and
, and $or
). The following example shows a snippet of metadata and queries that will return that index. The example queries in Python demonstrate how you can retrieve a list of records associated with Customer A from the Pinecone database.
This query will return a response containing the database records labeled as belonging to Customer A.
Enhanced scaling with multiple data sources
Amazon Bedrock Knowledge Bases now supports multiple data sources across AWS accounts. Amazon Bedrock Knowledge Bases can ingest data from up to five data sources, enhancing the comprehensiveness and relevancy of a knowledge base. This feature allows customers with complex IT systems to incorporate data into generative AI applications without restructuring or migrating data sources. It also provides flexibility for you to scale your Amazon Bedrock knowledge bases when data resides in different AWS accounts.
The features includes cross-account data access, enabling the configuration of S3 buckets as data sources across different accounts and efficient data management options for retaining or deleting data when a source is removed. These enhancements alleviate the need for creating multiple knowledge bases or redundant data copies.
Clean up
After completing the steps in this blog post, make sure to clean up your resources to avoid incurring unnecessary charges. Delete the Amazon Bedrock Knowledge Base by navigating to the Amazon Bedrock console, selecting your knowledge base, and choosing “Delete” from the “Actions” dropdown menu. If you created vector databases for testing, remember to delete OpenSearch Serverless collections, stop or delete Aurora PostgreSQL instances, and remove Pinecone index created. Additionally, consider deleting test documents uploaded to S3 buckets specifically for this blog example to avoid storage charges. Review and clean up any IAM roles or policies created for this demonstration if they’re no longer needed.
While Amazon Bedrock Knowledge Bases include charges for data indexing and queries, the underlying storage in S3 and vector databases will continue to incur charges until those resources are removed. For specific pricing details, refer to the Amazon Bedrock pricing page.
Conclusion
In this post, we covered several key strategies for building scalable, secure, and segmented Amazon Bedrock knowledge bases. These include using S3 folder structure, metadata to organize data sources, and data segmentation within a single knowledge base. Using metadata filtering to create custom queries that target specific data segments helps provide retrieval accuracy and maintain data privacy. We also explored integrating and validating metadata for vector databases including OpenSearch Serverless, Aurora PostgreSQL with the pgvector extension, and Pinecone.
By consolidating multiple business segments or customer data within a single Amazon Bedrock knowledge base, organizations can achieve cost optimization compared to creating and managing them separately. The improved data segmentation and access control measures help make sure each team or customer can only access the information relevant to their domain. The enhanced scalability helps meet the diverse needs of organizations, while maintaining the necessary data segregation and access control.
Try out metadata filtering with Amazon Bedrock Knowledge Bases, and share your thoughts and questions with the authors or in the comments.
About the Authors
Breanne Warner is an Enterprise Solutions Architect at Amazon Web Services supporting healthcare and life science customers. She is passionate about supporting customers to use generative AI on AWS and evangelizing 1P and 3P model adoption. Breanne is also on the Women at Amazon board as co-director of Allyship with the goal of fostering inclusive and diverse culture at Amazon. Breanne holds a Bachelor of Science in Computer Engineering from University of Illinois at Urbana Champaign.
Justin Lin is a Small & Medium Business Solutions Architect at Amazon Web Services. He studied computer science at UW Seattle. Dedicated to designing and developing innovative solutions that empower customers, Justin has been dedicating his time to experimenting with applications in generative AI, natural language processing, and forecasting.
Chloe Gorgen is an Enterprise Solutions Architect at Amazon Web Services, advising AWS customers in various topics including security, analytics, data management, and automation. Chloe is passionate about youth engagement in technology, and supports several AWS initiatives to foster youth interest in cloud-based technology. Chloe holds a Bachelor of Science in Statistics and Analytics from the University of North Carolina at Chapel Hill.
Effectively use prompt caching on Amazon Bedrock
Prompt caching, now generally available on Amazon Bedrock with Anthropic’s Claude 3.5 Haiku and Claude 3.7 Sonnet, along with Nova Micro, Nova Lite, and Nova Pro models, lowers response latency by up to 85% and reduces costs up to 90% by caching frequently used prompts across multiple API calls.
With prompt caching, you can mark the specific contiguous portions of your prompts to be cached (known as a prompt prefix). When a request is made with the specified prompt prefix, the model processes the input and caches the internal state associated with the prefix. On subsequent requests with a matching prompt prefix, the model reads from the cache and skips the computation steps required to process the input tokens. This reduces the time to first token (TTFT) and makes more efficient use of hardware such that we can share the cost savings with you.
This post provides a detailed overview of the prompt caching feature on Amazon Bedrock and offers guidance on how to effectively use this feature to achieve improved latency and cost savings.
How prompt caching works
Large language model (LLM) processing is made up of two primary stages: input token processing and output token generation. The prompt caching feature on Amazon Bedrock optimizes the input token processing stage.
You can begin by marking the relevant portions of your prompt with cache checkpoints. The entire section of the prompt preceding the checkpoint then becomes the cached prompt prefix. As you send more requests with the same prompt prefix, marked by the cache checkpoint, the LLM will check if the prompt prefix is already stored in the cache. If a matching prefix is found, the LLM can read from the cache, allowing the input processing to resume from the last cached prefix. This saves the time and cost that would otherwise be spent recomputing the prompt prefix.
Be advised that the prompt caching feature is model-specific. You should review the supported models and details on the minimum number of tokens per cache checkpoint and maximum number of cache checkpoints per request.
Cache hits only occur when the exact prefix matches. To fully realize the benefits of prompt caching, it’s recommended to position static content such as instructions and examples at the beginning of the prompt. Dynamic content, including user-specific information, should be placed at the end of the prompt. This principle also extends to images and tools, which must remain identical across requests in order to enable caching.
The following diagram illustrates how cache hits work. A, B, C, D represent distinct portions of the prompt. A, B and C are marked as the prompt prefix. Cache hits occur when subsequent requests contain the same A, B, C prompt prefix.
When to use prompt caching
Prompt caching on Amazon Bedrock is recommended for workloads that involve long context prompts that are frequently reused across multiple API calls. This capability can significantly improve response latency by up to 85% and reduce inference costs by up to 90%, making it well-suited for applications that use repetitive, long input context. To determine if prompt caching is beneficial for your use case, you will need to estimate the number of tokens you plan to cache, the frequency of reuse, and the time between requests.
The following use cases are well-suited for prompt caching:
- Chat with document – By caching the document as input context on the first request, each user query becomes more efficient, enabling simpler architectures that avoid heavier solutions like vector databases.
- Coding assistants – Reusing long code files in prompts enables near real-time inline suggestions, eliminating much of the time spent reprocessing code files.
- Agentic workflows – Longer system prompts can be used to refine agent behavior without degrading the end-user experience. By caching the system prompts and complex tool definitions, the time to process each step in the agentic flow can be reduced.
- Few-shot learning – Including numerous high-quality examples and complex instructions, such as for customer service or technical troubleshooting, can benefit from prompt caching.
How to use prompt caching
When evaluating a use case to use prompt caching, it’s crucial to categorize the components of a given prompt into two distinct groups: the static and repetitive portion, and the dynamic portion. The prompt template should adhere to the structure illustrated in the following figure.
You can create multiple cache checkpoints within a request, subject to model-specific limitations. It should follow the same static portion, cache checkpoint, dynamic portion structure, as illustrated in the following figure.
Use case example
The “chat with document” use case, where the document is included in the prompt, is well-suited for prompt caching. In this example, the static portion of the prompt would comprise instructions on response formatting and the body of the document. The dynamic portion would be the user’s query, which changes with each request.
In this scenario, the static portions of the prompt should be marked as the prompt prefixes to enable prompt caching. The following code snippet demonstrates how to implement this approach using the Invoke Model API. Here we create two cache checkpoints in the request, one for the instructions and one for the document content, as illustrated in the following figure.
We use the following prompt:
In the response to the preceding code snippet, there is a usage section that provides metrics on the cache reads and writes. The following is the example response from the first model invocation:
The cache checkpoint has been successfully created with 37,209 tokens cached, as indicated by the cache_creation_input_tokens
value, as illustrated in the following figure.
For the subsequent request, we can ask a different question:
The dynamic portion of the prompt has been changed, but the static portion and prompt prefixes remain the same. We can expect cache hits from the subsequent invocations. See the following code:
37,209 tokens are for the document and instructions read from the cache, and 10 input tokens are for the user query, as illustrated in the following figure.
Let’s change the document to a different blog post, but our instructions remain the same. We can expect cache hits for the instructions prompt prefix because it was positioned before the document body in our requests. See the following code:
In the response, we can see 1,038 cache read tokens for the instructions and 37,888 cache write tokens for the new document content, as illustrated in the following figure.
Cost savings
When a cache hit happens, Amazon Bedrock passes along the compute savings to customers by giving a per-token discount on cached context. To calculate the potential cost savings, you should first understand your prompt caching usage pattern with cache write/read metrics in the Amazon Bedrock response. Then you can calculate your potential cost savings with price per 1,000 input tokens (cache write) and price per 1,000 input tokens (cache read). For more price details, see Amazon Bedrock pricing.
Latency benchmark
Prompt caching is optimized to improve the TTFT performance on repetitive prompts. Prompt caching is well-suited for conversational applications that involve multi-turn interactions, similar to chat playground experiences. It can also benefit use cases that require repeatedly referencing a large document.
However, prompt caching might be less effective for workloads that involve a lengthy 2,000-token system prompt with a long set of dynamically changing text afterwards. In such cases, the benefits of prompt caching might be limited.
We have published a notebook on how to use prompt caching and how to benchmark it in our GitHub repo . The benchmark results depend on the use case: input token count, cached token count, or output token count.
Amazon Bedrock cross-Region inference
Prompt caching can be used in conjunction with cross-region inference (CRIS). Cross-region inference automatically selects the optimal AWS Region within your geography to serve your inference request, thereby maximizing available resources and model availability. At times of high demand, these optimizations may lead to increased cache writes.
Metrics and observability
Prompt caching observability is essential for optimizing cost savings and improving latency in applications using Amazon Bedrock. By monitoring key performance metrics, developers can achieve significant efficiency improvements—such as reducing TTFT by up to 85% and cutting costs by up to 90% for lengthy prompts. These metrics are pivotal because they enable developers to assess cache performance accurately and make strategic decisions regarding cache management.
Monitoring with Amazon Bedrock
Amazon Bedrock exposes cache performance data through the API response’s usage
section, allowing developers to track essential metrics such as cache hit rates, token consumption (both read and write), and latency improvements. By using these insights, teams can effectively manage caching strategies to enhance application responsiveness and reduce operational costs.
Monitoring with Amazon CloudWatch
Amazon CloudWatch provides a robust platform for monitoring the health and performance of AWS services, including new automatic dashboards tailored specifically for Amazon Bedrock models. These dashboards offer quick access to key metrics and facilitate deeper insights into model performance.
To create custom observability dashboards, complete the following steps:
- On the CloudWatch console, create a new dashboard. For a full example, see Improve visibility into Amazon Bedrock usage and performance with Amazon CloudWatch.
- Choose CloudWatch as your data source and select Pie for the initial widget type (this can be adjusted later).
- Update the time range for metrics (such as 1 hour, 3 hours, or 1 day) to suit your monitoring needs.
- Select Bedrock under AWS namespaces.
- Enter “cache” in the search box to filter cache-related metrics.
- For the model, locate
anthropic.claude-3-7-sonnet-20250219-v1:0
, and select bothCacheWriteInputTokenCount
andCacheReadInputTokenCount
.
- Choose Create widget and then Save to save your dashboard.
The following is a sample JSON configuration for creating this widget:
Understanding cache hit rates
Analyzing cache hit rates involves observing both CacheReadInputTokens
and CacheWriteInputTokens
. By summing these metrics over a defined period, developers can gain insights into the efficiency of the caching strategies. With the published pricing for the model-specific price per 1,000 input tokens (cache write) and price per 1,000 input tokens (cache read) on the Amazon Bedrock pricing page, you can estimate the potential cost savings for your specific use case.
Conclusion
This post explored the prompt caching feature in Amazon Bedrock, demonstrating how it works, when to use it, and how to use it effectively. It’s important to carefully evaluate whether your use case will benefit from this feature. It depends on thoughtful prompt structuring, understanding the distinction between static and dynamic content, and selecting appropriate caching strategies for your specific needs. By using CloudWatch metrics to monitor cache performance and following the implementation patterns outlined in this post, you can build more efficient and cost-effective AI applications while maintaining high performance.
For more information about working with prompt caching on Amazon Bedrock, see Prompt caching for faster model inference.
About the authors
Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.
Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.
Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services, specializing in Amazon Bedrock security. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies and security principles allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value while maintaining robust security postures.
Kosta Belz is a Senior Applied Scientist in the AWS Generative AI Innovation Center, where he helps customers design and build generative AI solutions to solve key business problems.
Sean Eichenberger is a Sr Product Manager at AWS.
Advanced tracing and evaluation of generative AI agents using LangChain and Amazon SageMaker AI MLFlow
Developing generative AI agents that can tackle real-world tasks is complex, and building production-grade agentic applications requires integrating agents with additional tools such as user interfaces, evaluation frameworks, and continuous improvement mechanisms. Developers often find themselves grappling with unpredictable behaviors, intricate workflows, and a web of complex interactions. The experimentation phase for agents is particularly challenging, often tedious and error prone. Without robust tracking mechanisms, developers face daunting tasks such as identifying bottlenecks, understanding agent reasoning, ensuring seamless coordination across multiple tools, and optimizing performance. These challenges make the process of creating effective and reliable AI agents a formidable undertaking, requiring innovative solutions to streamline development and enhance overall system reliability.
In this context, Amazon SageMaker AI with MLflow offers a powerful solution to streamline generative AI agent experimentation. For this post, I use LangChain’s popular open source LangGraph agent framework to build an agent and show how to enable detailed tracing and evaluation of LangGraph generative AI agents. This post explores how Amazon SageMaker AI with MLflow can help you as a developer and a machine learning (ML) practitioner efficiently experiment, evaluate generative AI agent performance, and optimize their applications for production readiness. I also show you how to introduce advanced evaluation metrics with Retrieval Augmented Generation Assessment (RAGAS) to illustrate MLflow customization to track custom and third-party metrics like with RAGAS.
The need for advanced tracing and evaluation in generative AI agent development
A crucial functionality for experimentation is the ability to observe, record, and analyze the internal execution path of an agent as it processes a request. This is essential for pinpointing errors, evaluating decision-making processes, and improving overall system reliability. Tracing workflows not only aids in debugging but also ensures that agents perform consistently across diverse scenarios.
Further complexity arises from the open-ended nature of tasks that generative AI agents perform, such as text generation, summarization, or question answering. Unlike traditional software testing, evaluating generative AI agents requires new metrics and methodologies that go beyond basic accuracy or latency measures. You must assess multiple dimensions—such as correctness, toxicity, relevance, coherence, tool call, and groundedness—while also tracing execution paths to identify errors or bottlenecks.
Why SageMaker AI with MLflow?
Amazon SageMaker AI, which provides a fully managed version of the popular open source MLflow, offers a robust platform for machine learning experimentation and generative AI management. This combination is particularly powerful for working with generative AI agents. SageMaker AI with MLflow builds on MLflow’s open source legacy as a tool widely adopted for managing machine learning workflows, including experiment tracking, model registry, deployment, and metrics comparison with visualization.
- Scalability: SageMaker AI allows you to easily scale generative AI agentic experiments, running multiple iterations simultaneously.
- Integrated tracking: MLflow integration enables efficient management of experiment tracking, versioning, and agentic workflow.
- Visualization: Monitor and visualize the performance of each experiment run with built-in MLflow capabilities.
- Continuity for ML Teams: Organizations already using MLflow for classic ML can adopt agents without overhauling their MLOps stack, reducing friction for generative AI adoption.
- AWS ecosystem advantage: Beyond MLflow, SageMaker AI provides a comprehensive ecosystem for generative AI development, including access to foundation models, many managed services, simplified infrastructure, and integrated security.
This evolution positions SageMaker AI with MLflow as a unified platform for both traditional ML and cutting-edge generative AI agent development.
Key features of SageMaker AI with MLflow
The capabilities of SageMaker AI with MLflow directly address the core challenges of agentic experimentation—tracing agent behavior, evaluating agent performance, and unified governance.
- Experiment tracking: Compare different runs of the LangGraph agent and track changes in performance across iterations.
- Agent versioning: Keep track of different versions of the agent throughout its development lifecycle to iteratively refine and improve agents.
- Unified agent governance: Agents registered in SageMaker AI with MLflow automatically appear in the SageMaker AI with MLflow console, enabling a collaborative approach to management, evaluation, and governance across teams.
- Scalable infrastructure: Use the managed infrastructure of SageMaker AI to run large-scale experiments without worrying about resource management.
LangGraph generative AI agents
LangGraph offers a powerful and flexible approach to designing generative AI agents tailored to your company’s specific needs. LangGraph’s controllable agent framework is engineered for production use, providing low-level customization options to craft bespoke solutions.
In this post, I show you how to create a simple finance assistant agent equipped with a tool to retrieve financial data from a datastore, as depicted in the following diagram. This post’s sample agent, along with all necessary code, is available on the GitHub repository, ready for you to replicate and adapt it for your own applications.
Solution code
You can follow and execute the full example code from the aws-samples GitHub repository. I use snippets from the code in the repository to illustrate evaluation and tracking approaches in the reminder of this post.
Prerequisites
- An AWS account with billing enabled.
- A SageMakerAI domain. For more information, see Use quick setup for Amazon SageMaker AI.
- Access to a running SageMaker AI with MLflow tracking server in Amazon SageMaker Studio. For more information, see the instructions for setting up a new MLflow tracking server.
- Access to the Amazon Bedrock foundation models for agent and evaluation tasks.
Trace generative AI agents with SageMaker AI with MLflow
MLflow’s tracing capabilities are essential for understanding the behavior of your LangGraph agent. The MLflow tracking is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results.
MLflow tracing is a feature that enhances observability in your generative AI agent by capturing detailed information about the execution of the agent services, nodes, and tools. Tracing provides a way to record the inputs, outputs, and metadata associated with each intermediate step of a request, enabling you to easily pinpoint the source of bugs and unexpected behaviors.
The MLfow tracking UI displays the traces exported under the MLflow Traces tab for the selected MLflow experimentation, as shown in the following image.
Furthermore, you can see the detailed trace for an agent input or prompt invocation by selecting the Request ID. Choosing Request ID opens a collapsible view with results captured at each step of the invocation workflow from input to the final output, as shown in the following image.
SageMaker AI with MLflow traces all the nodes in the LangGraph agent and displays the trace in the MLflow UI with detailed inputs, outputs, usage tokens, and multi-sequence messages with origin type (human, tool, AI) for each node. The display also captures the execution time over the entire agentic workflow, providing a per-node breakdown of time. Overall, tracing is crucial for generative AI agents for the following reasons:
- Performance monitoring: Tracing enables you to oversee the agent’s behavior and make sure that it operates effectively, helping identify malfunctions, inaccuracies, or biased outputs.
- Timeout management: Tracing with timeouts helps prevent agents from getting stuck in long-running operations or infinite loops, helping to ensure better resource management and responsiveness.
- Debugging and troubleshooting: For complex agents with multiple steps and varying sequences based on user input, tracing helps pinpoint where issues are introduced in the execution process.
- Explainability: Tracing provides insights into the agent’s decision-making process, helping you to understand the reasoning behind its actions. For example, you can see what tools are called and the processing type—human, tool, or AI.
- Optimization: Capturing and propagating an AI system’s execution trace enables end-to-end optimization of AI systems, including optimization of heterogeneous parameters such as prompts and metadata.
- Compliance and security: Tracing helps in maintaining regulatory compliance and secure operations by providing audit logs and real-time monitoring capabilities.
- Cost tracking: Tracing can help in analyzing resource usage (input tokens, output tokens) and associated extrapolate costs of running AI agents.
- Adaptation and learning: Tracing allows for observing how agents interact with prompts and data, providing insights that can be used to improve and adapt the agent’s performance over time.
In the MLflow UI, you can choose the Task name to see details captured at any agent step as it services the input request prompt or invocation, as shown in the following image.
By implementing proper tracing, you can gain deeper insights into your generative AI agents’ behavior, optimize their performance, and make sure that they operate reliably and securely.
Configure tracing for agent
For fine-grained control and flexibility in tracking, you can use MLflow’s tracing decorator APIs. With these APIs, you can add tracing to specific agentic nodes, functions, or code blocks with minimal modifications.
This configuration allows users to:
- Pinpoint performance bottlenecks in the LangGraph agent
- Track decision-making processes
- Monitor error rates and types
- Analyze patterns in agent behavior across different scenarios
This approach allows you to specify exactly what you want to track in your experiment. Additionally, MLflow offers out-of-the box tracing comparability with LangChain for basic tracing through MLflow’s autologging feature mlflow.langchain.autolog()
. With SageMaker AI with MLflow, you can gain deep insights into the LangGraph agent’s performance and behavior, facilitating easier debugging, optimization, and monitoring, in both development and production environments.
Evaluate with MLflow
You can use MLflow’s evaluation capabilities to help assess the performance of the LangGraph large language model (LLM) agent and objectively measure its effectiveness in various scenarios. The important aspects of evaluation are:
- Evaluation metrics: MLflow offers many default metrics such as LLM-as-a-Judge, accuracy, and latency metrics that you can specify for evaluation and have the flexibility to define custom LLM-specific metrics tailored to the agent. For instance, you can introduce custom metrics for Correct Financial Advice, Adherence to Regulatory Guidelines, and Usefulness of Tool Invocations.
- Evaluation dataset: Prepare a dataset for evaluation that reflects real-world queries and scenarios. The dataset should include example questions, expected answers, and relevant context data.
- Run evaluation using MLflow evaluate library: MLflow’s
mlflow.evaluate()
returns comprehensive evaluation results, which can be viewed directly in the code or through the SageMaker AI with MLflow UI for a more visual representation.
The following is a snippet for how mlflow.evaluate()
, can be used to execute evaluation on agents. You can follow this example by running the code in the same aws-samples GitHub repository.
This code snippet employs MLflow’s evaluate()
function to rigorously assess the performance of a LangGraph LLM agent, comparing its responses to a predefined ground truth dataset that’s maintained in the golden_questions_answer.jsonl
file in the aws-samples GitHub repository. By specifying “model_type”:”question-answering”, MLflow applies relevant evaluation metrics for question-answering tasks, such as accuracy and coherence. Additionally, the extra_metrics
parameter allows you to incorporate custom, domain-specific metrics tailored to the agent’s specific application, enabling a comprehensive and nuanced evaluation beyond standard benchmarks. The results of this evaluation are then logged in MLflow (as shown in the following image), providing a centralized and traceable record of the agent’s performance, facilitating iterative improvement and informed deployment decisions. The MLflow evaluation is captured as part of the MLflow execution run.
You can open the SageMaker AI with MLflow tracking server and see the list of MLflow execution runs for the specified MLflow experiment, as shown in the following image.
The evaluation metrics are captured within the MLflow execution along with model metrics and the accompanying artifacts, as shown in the following image.
Furthermore, the evaluation metrics are also displayed under the Model metrics tab within a selected MLflow execution run, as shown in the following image.
Finally, as shown in the following image, you can compare different variations and versions of the agent during the development phase by selecting the compare checkbox option in the MLflow UI between selected MLflow execution experimentation runs. This can help compare and select the best functioning agent version for deployment or with other decision making processes for agent development.
Register the LangGraph agent
You can use SageMaker AI with MLflow artifacts to register the LangGraph agent along with any other item as required or that you’ve produced. All the artifacts are stored in the SageMaker AI with MLflow tracking server’s configured Amazon Simple Storage Service (Amazon S3) bucket. Registering the LangGraph agent is crucial for governance and lifecycle management. It provides a centralized repository for tracking, versioning, and deploying the agents. Think of it as a catalog of your validated AI assets.
As shown in the following figure, you can see the artifacts captured under the Artifact tab within the MLflow execution run.
MLflow automatically captures and logs agent-related information files such as the evaluation results and the consumed libraries in the requirements.txt file. Furthermore, a successfully logged LangGraph agent as a MLflow model can be loaded and used for inference using mlflow.langchain.load_model(model_uri)
. Registering the generative AI agent after rigorous evaluation helps ensure that you’re promoting a proven and validated agent to production. This practice helps prevent the deployment of poorly performing or unreliable agents, helping to safeguard the user experience and the integrity of your applications. Post-evaluation registration is critical to make sure that the experiment with the best result is the one that gets promoted to production.
Use MLflow to experiment and evaluate with external libraries (such as RAGAS)
MLflow’s flexibility allows for seamless integration with external libraries, enhancing your ability to experiment and evaluate LangChain LangGraph agents. You can extend SageMaker MLflow to include external evaluation libraries such as RAGAS for comprehensive LangGraph agent assessment. This integration enables ML practitioners to use RAGAS’s specialized LLM evaluation metrics while benefiting from MLflow’s experiment tracking and visualization capabilities. By logging RAGAS metrics directly to SageMaker AI with MLflow, you can easily compare different versions of the LangGraph agent across multiple runs, gaining deeper insights into its performance.
RAGAS is an open source library that provide tools specifically for evaluation of LLM applications and generative AI agents. RAGAS includes a method ragas.evaluate(), to run evaluations for LLM agents with choice of LLM models (evaluators) for scoring the evaluation, and extensive list of default metrics. To incorporate RAGAS metrics into your MLflow experiments, you can use the following approach.
You can follow this example by running the notebook in the GitHub repository additional_evaluations_with_ragas.ipynb
.
The evaluation results using RAGAS metrics from the above code are shown in the following figure.
Subsequently, the computed RAGAS evaluations metrics can be exported and tracked in the SageMaker AI with MLflow tracking server as part of the MLflow experimentation run. See the following code snippet for illustration and the full code can be found in the notebook in the same aws-samples GitHub repository.
You can view the RAGAS metrics logged by MLflow in the SageMaker AI with MLflow UI on the Model metrics tab, as shown in the following image.
From experimentation to production: Collaborative approval with SageMaker with MLflow tracing and evaluation
In a real-world deployment scenario, MLflow’s tracing and evaluation capabilities with LangGraph agents can significantly streamline the process of moving from experimentation to production.
Imagine a large team of data scientists and ML engineers working on an agentic platform, as shown in the following image. With MLflow, they can create sophisticated agents that can handle complex queries, process returns, and provide product recommendations. During the experimentation phase, the team can use MLflow to log different versions of the agent, tracking performance and evaluation metrics such as response accuracy, latency, and other metrics. MLflow’s tracing feature allows them to analyze the agent’s decision-making process, identifying areas for improvement. The results across numerous experiments are automatically logged to SageMaker AI with MLflow. The team can use the MLflow UI to collaborate, compare, and select the best performing version of the agent and decide on a production-ready version, all informed by the diverse set data logged in SageMaker AI with MLflow.
With this data, the team can present a clear, data-driven case to stakeholders for promoting the agent to production. Managers and compliance officers can review the agent’s performance history, examine specific interaction traces, and verify that the agent meets all necessary criteria. After being approved, the SageMaker AI with MLflow registered agent facilitates a smooth transition to deployment, helping to ensure that the exact version of the agent that passed evaluation is the one that goes live. This collaborative, traceable approach not only accelerates the development cycle but also instills confidence in the reliability and effectiveness of the generative AI agent in production.
Clean up
To avoid incurring unnecessary charges, use the following steps to clean up the resources used in this post:
- Remove SageMaker AI with MLflow tracking server:
- In SageMaker Studio, stop and delete any running MLflow tracking server instances
- Revoke Amazon Bedrock model access:
- Go to the Amazon Bedrock console.
- Navigate to Model access and remove access to any models you enabled for this project.
- Delete the SageMaker domain (If not needed):
- Open the SageMaker console.
- Navigate to the Domains section.
- Select the domain you created for this project.
- Choose Delete domain and confirm the action.
- Also delete any associated S3 buckets and IAM roles.
Conclusion
In this post, I showed you how to combine LangChain’s LangGraph, Amazon SageMaker AI, and MLflow to demonstrate a powerful workflow for developing, evaluating, and deploying sophisticated generative AI agents. This integration provides the tools needed to gain deep insights into the generative AI agent’s performance, iterate quickly, and maintain version control throughout the development process.
As the field of AI continues to advance, tools like these will be essential for managing the increasing complexity of generative AI agents and ensuring their effectiveness with the following considerations,
- Traceability is paramount: Effective tracing of agent execution paths using SageMaker MLflow is crucial for debugging, optimization, and helping to ensure consistent performance in complex generative AI workflows. Pinpoint issues, understand decision-making, examine interaction traces, and improve overall system reliability through detailed, recorded analysis of agent processes.
- Evaluation drives improvement: Standardized and customized evaluation metrics, using MLflow’s
evaluate()
function and integrations with external libraries like RAGAS, provide quantifiable insights into agent performance, guiding iterative refinement and informed deployment decisions. - Collaboration and governance are essential: Unified governance facilitated by SageMaker AI with MLflow enables seamless collaboration across teams, from data scientists to compliance officers, helping to ensure responsible and reliable deployment of generative AI agents in production environments.
By embracing these principles and using the tools outlined in this post, developers and ML practitioners can confidently navigate the complexities of generative AI agent development and deployment, building robust and reliable applications that deliver real business value. Now, it’s your turn to unlock the potential of advanced tracing, evaluation, and collaboration in your agentic workflows! Dive into the aws-samples GitHub repository and start using the power of LangChain’s LangGraph, Amazon SageMaker AI, and MLflow for your generative AI projects.
About the Author
Sandeep Raveesh is a Generative AI Specialist Solutions Architect at AWS. He works with customers through their AIOps journey across model training, Retrieval Augmented Generation (RAG), generative AI agents, and scaling generative AI use-cases. He also focuses on go-to-market strategies helping AWS build and align products to solve industry challenges in the generative AI space. You can find Sandeep on LinkedIn.