Evaluate RAG responses with Amazon Bedrock, LlamaIndex and RAGAS

Evaluate RAG responses with Amazon Bedrock, LlamaIndex and RAGAS

In the rapidly evolving landscape of artificial intelligence, Retrieval Augmented Generation (RAG) has emerged as a game-changer, revolutionizing how Foundation Models (FMs) interact with organization-specific data. As businesses increasingly rely on AI-powered solutions, the need for accurate, context-aware, and tailored responses has never been more critical.

Enter the powerful trio of Amazon Bedrock, LlamaIndex, and RAGAS– a cutting-edge combination that’s set to redefine the evaluation and optimization of RAG responses. This blog post delves into how these innovative tools synergize to elevate the performance of your AI applications, ensuring they not only meet but exceed the exacting standards of enterprise-level deployments.

Whether you’re a seasoned AI practitioner or a business leader exploring the potential of generative AI, this guide will equip you with the knowledge and tools to:

  1. Harness the full potential of Amazon Bedrock robust foundation models
  2. Utilize RAGAS’s comprehensive evaluation metrics for RAG systems

In this post, we’ll explore how to leverage Amazon Bedrock, LlamaIndex, and RAGAS to enhance your RAG implementations. You’ll learn practical techniques to evaluate and optimize your AI systems, enabling more accurate, context-aware responses that align with your organization’s specific needs. Let’s dive in and discover how these powerful tools can help you build more effective and reliable AI-powered solutions.

RAG Evaluation

RAG evaluation is important to ensure that RAG models produce accurate, coherent, and relevant responses. By analyzing the retrieval and generator components both jointly and independently, RAG evaluation helps identify bottlenecks, monitor performance, and improve the overall system. Current RAG pipelines frequently employ similarity-based metrics such as ROUGE, BLEU, and BERTScore to assess the quality of the generated responses, which is essential for refining and enhancing the model’s capabilities.

Above mentioned probabilistic metrics ROUGE, BLEU, and BERTScore have limitations in assessing relevance and detecting hallucinations. More sophisticated metrics are needed to evaluate factual alignment and accuracy.

Evaluate RAG components with Foundation models

We can also use a Foundation Model as a judge to compute various metrics for both retrieval and generation. Here are some examples of these metrics:

  • Retrieval component
    • Context precision – Evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not.
    • Context recall – Ensures that the context contains all relevant information needed to answer the question.
  • Generator component
    • Faithfulness – Verifies that the generated answer is factually accurate based on the provided context, helping to identify errors or “hallucinations.”
    • Answer relavancy : Measures how well the answer matches the question. Higher scores mean the answer is complete and relevant, while lower scores indicate missing or redundant information.

Ragas Metrics - generation, retrieval

Overview of solution

This post guides you through the process of assessing quality of RAG response with evaluation framework such as RAGAS and LlamaIndex with Amazon Bedrock.

In this post, we are also going to leverage Langchain to create a sample RAG application.

Amazon Bedrock is a fully managed service that offers a choice of high-performing Foundation Models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

The Retrieval Augmented Generation Assessment (RAGAS) framework offers multiple metrics to evaluate each part of the RAG system pipeline, identifying areas for improvement. It utilizes foundation models to test individual components, aiding in pinpointing modules for development to enhance overall results.

LlamaIndex is a framework for building LLM applications. It simplifies data integration from various sources and provides tools for data indexing, engines, agents, and application integrations. Optimized for search and retrieval, it streamlines querying LLMs and retrieving documents. This blog post focuses on using its Observability/Evaluation modules.

LangChain is an open-source framework that simplifies the creation of applications powered by foundation models. It provides tools for chaining LLM operations, managing context, and integrating external data sources. LangChain is primarily used for building chatbots, question-answering systems, and other AI-driven applications that require complex language processing capabilities.

Diagram Architecture

The following diagram is a high-level reference architecture that explains how you can evaluate the RAG solution with RAGAS or LlamaIndex.

Architecture Diagram

The solution consists of the following components:

  1. Evaluation dataset – The source data for the RAG comes from the Amazon SageMaker FAQ, which represents 170 question-answer pairs. This corresponds to Step 1 in the architecture diagram.
  1. Build sample RAG – Documents are segmented into chunks and stored in an Amazon Bedrock Knowledge Bases (Steps 2–4). We use Langchain Retrieval Q&A to answer user queries. This process retrieves relevant data from an index at runtime and passes it to the Foundation Model (FM).
  2. RAG evaluation – To assess the quality of the Retrieval-Augmented Generation (RAG) solution, we can use both RAGAS and LlamaIndex. An LLM performs the evaluation by comparing its predictions with ground truths (Steps 5–6).

You must follow the provided notebook to reproduce the solution. We elaborate on the main code components in this post.

Prerequisites

To implement this solution, you need the following:

  1. An AWS accountwith privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
  2. Access enabled for the Amazon Titan Embeddings G1 – Text model and Anthropic Claude 3 Sonnet on Amazon Bedrock. For instructions, see Model access.
  3. Run the prerequisite code provided in the Python

Ingest FAQ data

The first step is to ingest the SageMaker FAQ data. For this purpose, LangChain provides a WebBaseLoader object to load text from HTML webpages into a document format. Then we split each document in multiple chunks of 2,000 tokens with a 100-token overlap. See the following code below:

text_chunks = split_document_from_url(SAGEMAKER_URL, chunck_size= 2000,  chunk_overlap=100)
retriever_db= get_retriever(text_chunks, bedrock_embeddings)

Set up embeddings and LLM with Amazon Bedrock and LangChain

In order to build a sample RAG application, we need an LLM and an embedding model:

  • LLM – Anthropic Claude 3 Sonnet
  • Embedding – Amazon Titan Embeddings – Text V2

This code sets up a LangChain application using Amazon Bedrock, configuring embeddings with Titan and a Claude 3 Sonnet model for text generation with specific parameters for controlling the model’s output. See the following code below from the notebook :

from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from langchain_aws import ChatBedrock
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA
import nest_asyncio
nest_asyncio.apply()

#URL to fetch the document
SAGEMAKER_URL="https://aws.amazon.com/sagemaker/faqs/"

#Bedrock parameters
EMBEDDING_MODEL="amazon.titan-embed-text-v2:0"
BEDROCK_MODEL_ID="anthropic.claude-3-sonnet-20240229-v1:0"

bedrock_embeddings = BedrockEmbeddings(model_id=EMBEDDING_MODEL,client=bedrock_client)

model_kwargs = {
    "temperature": 0, 
    "top_k": 250, 
    "top_p": 1,
    "stop_sequences": ["\n\nHuman:"]
}    

llm_bedrock = ChatBedrock(
    model_id=BEDROCK_MODEL_ID,
    model_kwargs=model_kwargs
)

Set up Knowledge Bases

We will create Amazon Bedrock knowledgebases Web Crawler datasource and process Sagemaker FAQ data.

In the code below, we load the embedded documents in Knowledge bases and we set up the retriever with LangChain:

from utils import split_document_from_url, get_bedrock_retriever
from botocore.exceptions import ClientError

text_chunks = split_document_from_url(SAGEMAKER_URL, chunck_size= 2000,  chunk_overlap=100)
retriever_db= get_bedrock_retriever(text_chunks, region)

Build a Q&A chain to query the retrieval API

After the database is populated, create a Q&A retrieval chain to perform question answering with context extracted from the vector store. You also define a prompt template following Claude prompt engineering guidelines. See the following code below from the notebook:

from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise and short. "
    "Context: {context}"
    )

prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
    ]
)
question_answer_chain = create_stuff_documents_chain(llm_bedrock, prompt_template)
chain = create_retrieval_chain(retriever_db, question_answer_chain)

Build Dataset to evaluate RAG application

To evaluate a RAG application, we need a combination of the following datasets:

  • Questions – The user query that serves as input to the RAG pipeline
  • Context – The information retrieved from enterprise or external data sources based on the provided query
  • Answers – The responses generated by LLMs
  • Ground truths – Human-annotated, ideal responses for the questions that can be used as the benchmark to compare against the LLM-generated answers

We are ready to evaluate the RAG application. As describe in the introduction, we select 3 metrics to assess our RAG solution:

  1. Faithfulness
  2. Answer Relevancy
  3. Answer Correctness

For more information, refer to Metrics.

This step involves defining an evaluation dataset with a set of ground truth questions and answers. For this post, we choose four random questions from the SageMaker FAQ. See the following code below from the notebook:

EVAL_QUESTIONS = [
"Can I stop a SageMaker Autopilot job manually?",
"Do I get charged separately for each notebook created and run in SageMaker Studio?",
"Do I get charged for creating and setting up an SageMaker Studio domain?",
"Will my data be used or shared to update the base model that is offered to customers using SageMaker JumpStart?",
]

#Defining the ground truth answers for each question

EVAL_ANSWERS = [
"Yes. You can stop a job at any time. When a SageMaker Autopilot job is stopped, all ongoing trials will be stopped and no new trial will be started.",
"""No. You can create and run multiple notebooks on the same compute instance.
You pay only for the compute that you use, not for individual items.
You can read more about this in our metering guide.
In addition to the notebooks, you can also start and run terminals and interactive shells in SageMaker Studio, all on the same compute instance.""",
"No, you don’t get charged for creating or configuring an SageMaker Studio domain, including adding, updating, and deleting user profiles.",
"No. Your inference and training data will not be used nor shared to update or train the base model that SageMaker JumpStart surfaces to customers."
]

Evaluation of RAG with RAGAS

Evaluating the RAG solution requires to compare LLM predictions with ground truth answers. To do so, we use the batch() function from LangChain to perform inference on all questions inside our evaluation dataset.

Then we can use the evaluate() function from RAGAS to perform evaluation on each metric (answer relevancy, faithfulness and answer corectness). It uses an LLM to compute metrics. Feel free to use other Metrics from RAGAS.

See the following code below from the notebook:

from ragas.metrics import answer_relevancy, faithfulness, answer_correctness
from ragas import evaluate

#Batch invoke and dataset creation
result_batch_questions = chain.batch([{"input": q} for q in EVAL_QUESTIONS])

dataset= build_dataset(EVAL_QUESTIONS,EVAL_ANSWERS,result_batch_questions, text_chunks)

result = evaluate(dataset=dataset, metrics=[ answer_relevancy, faithfulness, answer_correctness ],llm=llm_bedrock, embeddings=bedrock_embeddings, raise_exceptions=False )
df = result.to_pandas()
df.head()
 The following screenshot shows the evaluation results and the RAGAS answer relevancy score.

Relevancy score

Answer Relevancy

In the answer_relevancy_score column, a score closer to 1 indicates the response generated is relevant to the input query.

Faithfulness

In the second column, the first query result has a lower faithfulness_score (0.2), which indicates the responses are not derived from the context and are hallucinations. The rest of the query results have a higher faithfulness_score (1.0), which indicates the responses are derived from the context.

Answer Correctness

In the last column answer_correctness, the second and last row have high answer correctness, meaning that answer provided by the LLM is closer to to from the groundtruth.

Evaluation of RAG with LlamaIndex

LlamaIndex, similar to Ragas, provides a comprehensive RAG (Retrieval-Augmented Generation) evaluation module. This module offers a variety of metrics to assess the performance of your RAG system. The evaluation process generates two key outputs:

  1. Feedback: The judge LLM (Language Model) provides detailed evaluation feedback in the form of a string, offering qualitative insights into the system’s performance.
  2. Score: This numerical value indicates how well the answer meets the evaluation criteria. The scoring system varies depending on the specific metric being evaluated. For example, metrics like Answer Relevancy and Faithfulness are typically scored on a scale from 0 to 1.

These outputs allow for both qualitative and quantitative assessment of your RAG system’s performance, enabling you to identify areas for improvement and track progress over time.

The following is a code sample from the notebook:

from llama_index.llms.bedrock import Bedrock
from llama_index.core.evaluation import (
    AnswerRelevancyEvaluator,
    CorrectnessEvaluator,
    FaithfulnessEvaluator
)
from utils import evaluate_llama_index_metric

bedrock_llm_llama = Bedrock(model=BEDROCK_MODEL_ID)
faithfulness= FaithfulnessEvaluator(llm=bedrock_llm_llama)
answer_relevancy= AnswerRelevancyEvaluator(llm=bedrock_llm_llama)
correctness= CorrectnessEvaluator(llm=bedrock_llm_llama)

Answer Relevancy

df_answer_relevancy= evaluate_llama_index_metric(answer_relevancy, dataset)
df_answer_relevancy.head()

The column Score defines the result for the answer_relevancy evaluation criteria. All passing values are set to 1, meaning that all predictions are relevant with the context retrieved.

Additionally, the column Feedback provides a clear explanation of the result of the passing score. We can observe that all answers align with the context extracted from the retriever.

Answer Correctness

df_correctness= evaluate_llama_index_metric(correctness, dataset)
df_correctness.head()

All values from the column Score are set to 5.0, meaning that all predictions are coherent with ground truth answers.

Faithfulness

The following screenshot shows the evaluation results for answer faithfulness.

df_faithfulness= evaluate_llama_index_metric(faithfulness, dataset)
df_faithfulness.head()

All values from the Score column are set to 1.0, which means all answers generated by LLM are coherent given the context retrieved.

Conclusion

While Foundation Models offer impressive generative capabilities, their effectiveness in addressing organization-specific queries has been a persistent challenge. The Retrieval Augmented Generation framework emerges as a powerful solution, bridging this gap by enabling LLMs to leverage external, organization-specific data sources.

To truly unlock the potential of RAG pipelines, the RAGAS framework, in conjunction with LlamaIndex, provides a comprehensive evaluation solution. By meticulously assessing both retrieval and generation components, this approach empowers organizations to pinpoint areas for improvement and refine their RAG implementations. The result? Responses that are not only factually accurate but also highly relevant to user queries.

By adopting this holistic evaluation approach, enterprises can fully harness the transformative power of generative AI applications. This not only maximizes the value derived from these technologies but also paves the way for more intelligent, context-aware, and reliable AI systems that can truly understand and address an organization’s unique needs.

As we continue to push the boundaries of what’s possible with AI, tools like Amazon Bedrock, LlamaIndex, and RAGAS will play a pivotal role in shaping the future of enterprise AI applications. By embracing these innovations, organizations can confidently navigate the exciting frontier of generative AI, unlocking new levels of efficiency, insight, and competitive advantage.

For further exploration, readers interested in enhancing the reliability of AI-generated content may want to look into Amazon Bedrock’s Guardrails feature, which offers additional tools like the Contextual Grounding Check.


About the authors

Madhu is a Senior Partner Solutions Architect specializing in worldwide public sector cybersecurity partners. With over 20 years in software design and development, he collaborates with AWS partners to ensure customers implement solutions that meet strict compliance and security objectives. His expertise lies in building scalable, highly available, secure, and resilient applications for diverse enterprise needs.

Babu Kariyaden Parambath is a Senior AI/ML Specialist at AWS. At AWS, he enjoys working with customers in helping them identify the right business use case with business value and solve it using AWS AI/ML solutions and services. Prior to joining AWS, Babu was an AI evangelist with 20 years of diverse industry experience delivering AI driven business value for customers.

Read More

Innovating at speed: BMW’s generative AI solution for cloud incident analysis

Innovating at speed: BMW’s generative AI solution for cloud incident analysis

This post was co-authored with Johann Wildgruber, Dr. Jens Kohl, Thilo Bindel, and Luisa-Sophie Gloger from BMW Group.

The BMW Group—headquartered in Munich, Germany—is a vehicle manufacturer with more than 154,000 employees, and 30 production and assembly facilities worldwide as well as research and development locations across 17 countries. Today, the BMW Group (BMW) is the world’s leading manufacturer of premium automobiles and motorcycles, and provider of premium financial and mobility services.

BMW Connected Company is a division within BMW responsible for developing and operating premium digital services for BMW’s connected fleet, which currently numbers more than 23 million vehicles worldwide. These digital services are used by many BMW vehicle owners daily; for example, to lock or open car doors remotely using an app on their phone, to start window defrost remotely, to buy navigation map updates from the car’s menu, or to listen to music streamed over the internet in their car.

In this post, we explain how BMW uses generative AI technology on AWS to help run these digital services with high availability. Specifically, BMW uses Amazon Bedrock Agents to make remediating (partial) service outages quicker by speeding up the otherwise cumbersome and time-consuming process of root cause analysis (RCA). The fully automated RCA agent correctly identifies the right root cause for most cases (measured at 85%), and helps engineers in terms of system understanding and real-time insights in their cases. This performance was further validated during the proof of concept, where employing the RCA agent on representative use cases clearly demonstrates the benefits of this solution, allowing BMW to achieve significantly lower diagnosis times.

The challenges of root cause analysis

Digital services are often implemented by chaining multiple software components together; components that might be built and run by different teams. For example, consider the service of remotely opening and locking vehicle doors. There might be a development team building and running the iOS app, another team for the Android app, a team building and running the backend-for-frontend used by both the iOS and Android app, and so on. Moreover, these teams might be geographically dispersed and run their workloads in different locations and regions; many hosted on AWS, some elsewhere.

Now consider a (fictitious) scenario where reports come in from car owners complaining that remotely locking doors with the app no longer works. Is the iOS app responsible for the outage, or the backend-for-frontend? Did a firewall rule change somewhere? Did an internal TLS certificate expire? Is the MQTT system experiencing delays? Was there an inadvertent breaking change in recent API changes? When did they actually deploy that? Or was the database password for the central subscription service rotated again?

It can be difficult to determine the root cause of issues in situations like this. It requires checking many systems and teams, many of which might be failing, because they’re interdependent. Developers need to reason about the system architecture, form hypotheses, and follow the chain of components until they have located the one that is the culprit. They often have to backtrack and reassess their hypotheses, and pursue the investigation in another chain of components.

Understanding the challenges in such complex systems highlights the need for a robust and efficient approach to root cause analysis. With this context in mind, let’s explore how BMW and AWS collaborated to develop a solution using Amazon Bedrock Agents to streamline and enhance the RCA process.

Solution overview

At a high level, the solution uses an Amazon Bedrock agent to do automated RCA. This agent has several custom-built tools at its disposal to do its job. These tools, implemented by AWS Lambda functions, use services like Amazon CloudWatch and AWS CloudTrail to analyze system logs and metrics. The following diagram illustrates the solution architecture.

High level diagram of the solution

When an incident occurs, an on-call engineer gives a description of the issue at hand to the Amazon Bedrock agent. The agent will then start investigating for the root cause of the issue, using its tools to do tasks that the on-call engineer would otherwise do manually, such as searching through logs. Based on the clues it uncovers, the agent proposes several likely hypotheses to the on-call engineer. The engineer can then resolve the issue, or give pointers to the agent to direct the investigation further. In the following section, we take a closer look at the tools the agent uses.

Amazon Bedrock agent tools

The Amazon Bedrock agent’s effectiveness in performing RCA lies in its ability to seamlessly integrate with custom tools. These tools, designed as Lambda functions, use AWS services like CloudWatch and CloudTrail to automate tasks that are typically manual and time-intensive for engineers. By organizing its capabilities into specialized tools, the Amazon Bedrock agent makes sure that RCA is both efficient and precise.

Architecture Tool

The Architecture Tool uses C4 diagrams to provide a comprehensive view of the system’s architecture. These diagrams, enhanced through Structurizr, give the agent a hierarchical understanding of component relationships, dependencies, and workflows. This allows the agent to target the most relevant areas during its RCA process, effectively narrowing down potential causes of failure based on how different systems interact.

For instance, if an issue affects a specific service, the Architecture Tool can identify upstream or downstream dependencies and suggest hypotheses focused on those systems. This accelerates diagnostics by enabling the agent to reason contextually about the architecture instead of blindly searching through logs or metrics.

Logs Tool

The Logs Tool uses CloudWatch Logs Insights to analyze log data in real time. By searching for patterns, errors, or anomalies, as well as comparing the trend to the previous period, it helps the agent pinpoint issues related to specific events, such as failed authentications or system crashes.

For example, in a scenario involving database access failures, the Logs Tool might identify a new spike in the number of error messages such as “FATAL: password authentication failed” compared to the previous hour. This insight allows the agent to quickly associate the failure with potential root causes, such as an improperly rotated database password.

Metrics Tool

The Metrics Tool provides the agent with real-time insights into the system’s health by monitoring key metrics through CloudWatch. This tool identifies statistical anomalies in critical performance indicators such as latency, error rates, resource utilization, or unusual spikes in usage patterns, which can often signal potential issues or deviations from normal behavior.

For instance, in a Kubernetes memory overload scenario, the Metrics Tool might detect a sharp increase in memory consumption or unusual resource allocation prior to the failure. By surfacing CloudWatch metric alarms for such anomalies, the tool enables the agent to prioritize hypotheses related to resource mismanagement, misconfigured thresholds, or unexpected system load, guiding the investigation more effectively toward resolving the issue.

Infrastructure Tool

The Infrastructure Tool uses CloudTrail data to analyze critical control-plane events, such as configuration changes, security group updates, or API calls. This tool is particularly effective in identifying misconfigurations or breaking changes that might trigger cascading failures.

Consider a case where a security group ingress rule is inadvertently removed, causing connectivity issues between services. The Infrastructure Tool can detect and correlate this event with the reported incident, providing the agent with actionable insights to guide its RCA process.

By combining these tools, the Amazon Bedrock agent mimics the step-by-step reasoning of an experienced engineer while executing tasks at machine speed. The modular nature of the tools allows for flexibility and customization, making sure that RCA is tailored to the unique needs of BMW’s complex, multi-regional cloud infrastructure.

In the next section, we discuss how these tools work together within the agent’s workflow.

Amazon Bedrock agents: The ReAct framework in action

At the heart of BMW’s rapid RCA lies the ReAct (Reasoning and Action) agent framework, an innovative approach that dynamically combines logical reasoning with task execution. By integrating ReAct with Amazon Bedrock, BMW gains a flexible solution for diagnosing and resolving complex cloud-based incidents. Unlike traditional methods, which rely on predefined workflows, ReAct agents use real-time inputs and iterative decision-making to adapt to the specific circumstances of an incident.

The ReAct agent in BMW’s RCA solution uses a structured yet adaptive workflow to diagnose and resolve issues. First, it interprets the textual description of an incident (for example, “Vehicle doors cannot be locked via the app”) to identify which parts of the system are most likely impacted. Guided by the ReAct framework’s iterative reasoning, the agent then gathers evidence by calling specialized tools, using data centrally aggregated in a cross-account observability setup. By continuously reevaluating the results of each tool invocation, the agent zeros in on potential causes—whether an expired certificate, a revoked firewall rule, or a spike in traffic—until it isolates the root cause. The following diagram illustrates this workflow.

The ReAct framework offers the following benefits:

  • Dynamic and adaptive – The ReAct agent tailors its approach to the specific incident, rather than a one-size-fits-all methodology. This adaptability is especially critical in BMW’s multi-regional, multi-service architecture.
  • Efficient tool utilization – By reasoning about which tools to invoke and when, the ReAct agent minimizes redundant queries, providing faster diagnostics without overloading AWS services like CloudWatch or CloudTrail.
  • Human-like reasoning – The ReAct agent mimics the logical thought process of a seasoned engineer, iteratively exploring hypotheses until it identifies the root cause. This capability bridges the gap between automation and human expertise.

By employing Amazon Bedrock ReAct agents, significantly lower diagnosis times are achieved. These agents not only enhance operational efficiency but also empower engineers to focus on strategic improvements rather than labor-intensive diagnostics.

Case study: Root cause analysis “Unlocking vehicles via the iOS app”

To illustrate the power of Amazon Bedrock agents in action, let us explore a possible real-world scenario involving the interplay between BMW’s connected fleet and the digital services running in the cloud backend.

We deliberately change the security group for the central networking account in a test environment. This has the effect that requests from the fleet are (correctly) blocked by the changed security group and do not reach the services hosted in the backend. Hence, a test user cannot lock or unlock her vehicle door remotely.

Incident details

BMW engineers received a report from a tester indicating the remote lock/unlock functionality on the mobile app does not work.

This report raised immediate questions: was the issue in the app itself, the backend-for-frontend service, or deeper within the system, such as in the MQTT connectivity or authentication mechanisms?

How the ReAct agent addresses the problem

The problem is described to the Amazon Bedrock ReAct agent: “Users of the iOS app cannot unlock car doors remotely.” The agent immediately begins its analysis:

  1. The agent begins by understanding the overall system architecture, calling the Architecture Tool. The outputs of the architecture tool reveal that the iOS app, like the Android app, is connected to a backend-for-frontend API, and that the backend-for-frontend API itself is connected to several other internal APIs, such as the Remote Vehicle Management API. The Remote Vehicle Management API is responsible for sending commands to cars by using MQTT messaging.
  2. The agent uses the other tools at its disposal in a targeted way: it scans the logs, metrics, and control plane activities of only those components that are involved in remotely unlocking car doors: iOS app remote logs, backend-for-frontend API logs, and so on. The agent finds several clues:
    1. Anomalous logs that indicate connectivity issues (network timeouts).
    2. A sharp decrease in the number of successful invocations of the Remote Vehicle Management API.
    3. Control plane activities: several security groups in the central networking account hosted on the testing environment were changed.
  3. Based on those findings, the agent infers and defines several hypotheses and presents these to the user, ordered by their likelihood. In this case, the first hypothesis is the actual root cause: a security group was inadvertently changed in the central networking account, which meant that network traffic between the backend-for-frontend and the Remote Vehicle Management API was now blocked. The agent correctly correlated logs (“fetch timeout error”), metrics (decrease in invocations) and control plane changes (security group ingress rule removed) to come to this conclusion.
  4. If the on-call engineer wants further information, they can now ask follow-up questions to the agent, or instruct the agent to investigate elsewhere as well.

The entire process—from incident detection to resolution—took minutes, compared to the hours it could have taken with traditional RCA methods. The ReAct agent’s ability to dynamically reason, access cross-account observability data, and iterate on its hypotheses alleviated the need for tedious manual investigations.

Conclusion

By using Amazon Bedrock ReAct agents, BMW has shown how to improve its approach to root cause analysis, turning a complex and manual process into an efficient, automated workflow. The tools integrated within the ReAct framework significantly narrow down potential reasoning space, and enable dynamic hypotheses generation and targeted diagnostics, mimicking the reasoning process of seasoned engineers while operating at machine speed. This innovation has reduced the time required to identify and resolve service disruptions, further enhancing the reliability of BMW’s connected services and improving the experience for millions of customers worldwide.

The solution has demonstrated measurable success, with the agent identifying root causes in 85% of test cases and providing detailed insights in the remainder, greatly expediting engineers’ investigations. By lowering the barrier to entry for junior engineers, it has enabled less-experienced team members to diagnose issues effectively, maintaining reliability and scalability across BMW’s operations.

Incorporating generative AI into RCA processes showcases the transformative potential of AI in modern cloud-based operations. The ability to adapt dynamically, reason contextually, and handle complex, multi-regional infrastructures makes Amazon Bedrock Agents a game changer for organizations aiming to maintain high availability in their digital services.

As BMW continues to expand its connected fleet and digital offerings, the adoption of generative AI-driven solutions like Amazon Bedrock will play an important role in maintaining operational excellence and delivering seamless experiences to customers. By following BMW’s example, your organization can also benefit from Amazon Bedrock Agents for root cause analysis to enhance service reliability.

Get started by exploring Amazon Bedrock Agents to optimize your incident diagnostics or use CloudWatch Logs Insights to identify anomalies in your system logs. If you want a hands-on introduction to creating your own Amazon Bedrock agents—complete with code examples and best practices—check out the following GitHub repo. These tools are setting a new industry standard for efficient RCA and operational excellence.


About the Authors

Johann Wildgruber is a transformation lead reliability engineer at BMW Group, working currently to set up an observability platform to strengthen the reliability of ConnectedDrive services. Johann has several years of experience as a product owner in operating and developing large and complex cloud solutions. He is interested in applying new technologies and methods in software development.

Dr. Jens Kohl is a technology leader and builder with over 13 years of experience at the BMW Group. He is responsible for shaping the architecture and continuous optimization of the Connected Vehicle cloud backend. Jens has been leading software development and machine learning teams with a focus on embedded, distributed systems and machine learning for more than 10 years.

Thilo Bindel is leading the Offboard Reliability & Data Engineering team at BMW Group. He is responsible for defining and implementing strategies to ensure reliability, availability, and maintainability of BMW’s backend services in the Connected Vehicle domain. His goal is to establish reliability and data engineering best practices consistently across the organization and to position the BMW Group as a leader in data-driven observability within the automotive industry and beyond.

Luisa-Sophie Gloger is a Data Scientist at the BMW Group with a focus on Machine Learning. As a lead developer within the Connected Company’s Connected AI platform team, she enjoys helping teams to improve their products and workflows with Generative AI. She also has a background in working on Natural Language processing (NLP) and a degree in psychology.

Tanrajbir Takher is a Data Scientist at AWS’s Generative AI Innovation Center, where he works with enterprise customers to implement high-impact generative AI solutions. Prior to AWS, he led research for new products at a computer vision unicorn and founded an early generative AI startup.

Otto Kruse is a Principal Solutions Developer within AWS Industries – Prototyping and Customer Engineering (PACE), a multi-disciplinary team dedicated to helping large companies utilize the potential of the AWS cloud by exploring and implementing innovative ideas. Otto focuses on application development and security.

Huong Vu is a Data Scientist at AWS Generative AI Innovation Centre. She drives projects to deliver generative-AI applications for enterprise customers from a diverse range of industries. Prior to AWS, she worked on improving NLP models for Alexa shopping assistant both on the Amazon.com website and on Echo devices.

Aishwarya is a Senior Customer Solutions Manager with AWS Automotive. She is passionate about solving business problems using Generative AI and cloud-based technologies.

Satyam Saxena is an Applied Science Manager at AWS Generative AI Innovation Center team. He leads Generative AI customer engagements, driving innovative ML/AI initiatives from ideation to production with over a decade of experience in machine learning and data science. His research interests include deep learning, computer vision, NLP, recommender systems, and generative AI.

Kim Robins, a Senior AI Strategist at AWS’s Generative AI Innovation Center, leverages his extensive artificial intelligence and machine learning expertise to help organizations develop innovative products and refine their AI strategies, driving tangible business value.

Read More

Time series forecasting with LLM-based foundation models and scalable AIOps on AWS

Time series forecasting with LLM-based foundation models and scalable AIOps on AWS

Time series forecasting is critical for decision-making across industries. From predicting traffic flow to sales forecasting, accurate predictions enable organizations to make informed decisions, mitigate risks, and allocate resources efficiently. However, traditional machine learning approaches often require extensive data-specific tuning and model customization, resulting in lengthy and resource-heavy development.

Enter Chronos, a cutting-edge family of time series models that uses the power of large language model (LLM) architectures to break through these hurdles. As a foundation model, Chronos is pre-trained on large and diverse datasets, enabling it to generalize forecasting capabilities across multiple domains. This innovative approach allows Chronos to excel at zero-shot forecasts—predictions made without specific training on the target dataset. Chronos outperforms task-specific models across most benchmarked datasets.

Chronos is founded on a key insight: both LLMs and time series forecasting aim to decode sequential patterns to predict future events. This parallel allows us to treat time series data as a language to be modeled by off-the-shelf transformer architectures. To make this possible, Chronos converts continuous time series data into a discrete vocabulary through a two-step process of scaling the time series by its absolute mean and then quantizing the scaled time series into a fixed number of equally spaced bins.

In this blog post, we will guide you through the process of integrating Chronos into Amazon SageMaker Pipeline using a synthetic dataset that simulates a sales forecasting scenario, unlocking accurate and efficient predictions with minimal data. You will learn how to use features to orchestrate the entire workflow from fine-tuning to deployment. By the end of this journey, you will be equipped to streamline your development process and apply Chronos to any time series data, transforming your forecasting approach.

Prerequisites

SageMaker domain access with required IAM permissions: You need to have access to a SageMaker domain with the necessary AWS Identity and Access Management (IAM) permissions to create and manage resources. Make sure that you have the required permissions to create notebooks, deploy models, and perform other tasks outlined in this post. See quick setup for Amazon SageMaker AI for instructions about setting up a SageMaker domain. To follow along, see the code in GitHub.

Click here to open the AWS console and follow along.

Overview of SageMaker Pipelines

We use SageMaker Pipelines to orchestrate training and evaluation experiments. With Amazon SageMaker Pipelines, you can:

  • Run multiple experiment iterations simultaneously, reducing overall processing time and cost
  • Monitor and visualize the performance of each experiment run with Studio integration
  • Invoke downstream workflows for further analysis, deployment, or model selection

Training pipeline

SageMaker Pipelines

Generate data

The availability and quality of public time series data are limited compared to the extensive high-quality text datasets available in the natural language processing (NLP) domain. This disparity poses challenges for training models intended for zero-shot forecasting, which requires large-scale, diverse time series data. Given that we’re fine-tuning a pretrained Chronos model, we use only a small set of synthetically generated data.

To generate diverse time series patterns, the first step in our pipeline generates a synthetic dataset using a kernel bank of basis kernels. These kernels define fundamental time series patterns, including linear trends, smooth local variations, and seasonality. By combining these kernels through random binary operations, we create complex, synthetic time series data. This process allows us to generate intricate patterns from simple basis kernels.

This data processing job is accomplished using a PyTorchProcessor, which runs PyTorch code (generate_data.py) within a container managed by SageMaker. Data and other relevant artifacts for debugging are located in the default Amazon Simple Storage Service (Amazon S3) bucket associated with the SageMaker account. Logs for each step in the pipeline can be found in Amazon CloudWatch.

base_job_name = f"{pipeline_name}/data-generation-step"

script_processor = PyTorchProcessor( 
    command=['python3'],
    role=role,
    instance_count=1,
    instance_type="ml.c5.2xlarge",
    base_job_name=base_job_name,
    sagemaker_session=pipeline_session,
    framework_version='1.13',
    py_version='py39'
)

Hyperparameter search

After data generation, we fine-tune a pretrained Chronos model. Fine-tuning allows it to specialize in a specific use-case that may not be well-represented in its pretraining data. In this post, we have used amazon/chronos-t5-small  but you can use any model that seems fit. The following table shows the available models.

Model Parameters Based on
chronos-t5-tiny 8M t5-efficient-tiny
chronos-t5-mini 20M t5-efficient-mini
chronos-t5-small 46M t5-efficient-small
chronos-t5-base 200M t5-efficient-base
chronos-t5-large 710M t5-efficient-large

For optimal output, we use automatic model tuning to find the best version of a model through hyperparameter tuning. This step is integrated into SageMaker Pipelines and enables running multiple training jobs in parallel, employing various methods and predefined hyperparameter ranges. In our pipeline, we specifically tune the learning rate to optimize our model’s performance. With the hyperparameter tuning capability in SageMaker, we increase the likelihood that our model achieves optimal accuracy and generalization for the given task.

estimator = PyTorch(
    role=role,
    instance_type=pipeline_parameters['training_instance_type'],
    output_path=f"s3://{bucket_name}/{pipeline_name}/models/",
    instance_count=1,
    source_dir='model',
    image_uri=train_image_uri,
    entry_point=model_name + ".py",
    base_job_name = f"{pipeline_name}/training/job",
)

hyper_ranges = {
     'learning-rate': ContinuousParameter(1e-5, 1e-4),
}

objective_name = "logloss"
metric_definitions = [{"Name": objective_name, "Regex": "'loss': ([0-9\.]+),"}]

tuner_log = HyperparameterTuner(
    estimator,
    objective_name,
    hyper_ranges,
    metric_definitions,
    max_jobs=pipeline_parameters['max_jobs'], 
    max_parallel_jobs=pipeline_parameters['max_parallel_jobs'],
    objective_type="Minimize",
    base_tuning_job_name=f"{pipeline_name}/HPTuning/{model_name}",
    random_seed=10
)

Amazon SageMaker Model Registry

The selected model is then uploaded to SageMaker Model Registry, which plays a critical role in managing models that are ready for production. It stores models, organizes model versions, captures essential metadata and artifacts such as container images, and governs the approval status of each model. By using the registry, we can efficiently deploy models to accessible SageMaker environments and establish a foundation for model versioning.

registration_steps = {}

register_args = best_model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=[instance_type],
    transform_instances=[instance_type],
    model_package_group_name=model_package_group_name,
    domain="MACHINE_LEARNING",
    description="Chronos",
    task="REGRESSION",
    framework="PYTORCH",
    image_uri=inference_image_uri
)
registration_steps = ModelStep(
    name=model_name, 
    step_args=register_args
)

Inference

Upon completion of our training pipeline, our model is then deployed using SageMaker hosting services, which enables the creation of an inference endpoint for real-time predictions. This endpoint allows seamless integration with applications and systems, providing on-demand access to the model’s predictive capabilities through a secure HTTPS interface. Real-time predictions can be used in scenarios such as stock price and energy demand forecasts.

endpoint_name = "chronos-endpoint-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print(f"EndpointName: {endpoint_name}")
model.deploy(
    initial_instance_count=1, 
    instance_type="ml.p3.2xlarge",
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
    endpoint_name=endpoint_name
)

predictor = Predictor(endpoint_name=endpoint_name)

payload = {"inputs": input_data}
jstr = json.dumps(payload)

p = predictor.predict(
    jstr,
    initial_args={
        "ContentType": 'application/json'
    }
)

Sample prediction output

The following figure demonstrates a sample forecast from the Chronos endpoint.

Sample Forecast

Chronos benchmark performance

Benchmarks

The preceding graph shows the performance evaluation of various time series forecasting models based on 27 datasets not used in training the Chronos models. The benchmark assesses zero-shot performance of Chronos models against local statistical models, task-specific models, and pretrained models. The evaluation uses two metrics: probabilistic forecasting (WQL) and point forecasting (MASE); both normalized using a Seasonal Naive baseline. The results are aggregated using geometric means. It’s noted that some of the above pretrained models had prior exposure to the benchmark datasets.

Zero shot results are from Chronos: Learning the Language of Time Series.

Conclusion

In this blog post, we’ve demonstrated how to use Amazon SageMaker AIOps features to deploy Chronos, a powerful time series forecasting model based on LLM architectures. By using SageMaker Pipelines, we’ve showcased a comprehensive approach to building, training, and deploying sophisticated forecasting models at scale. This implementation offers efficiency in model development, scalability, streamlined AIOps, real-time inference capabilities, and cost-effectiveness. The integration of Chronos with SageMaker opens up new possibilities for businesses across various sectors to implement advanced time series forecasting without extensive in-house machine learning expertise. As AI and machine learning continue to evolve, solutions like Chronos on Amazon SageMaker represent a significant step forward in making sophisticated forecasting techniques more accessible and actionable, potentially leading to more informed decision-making and improved operational efficiency across industries.

References

Feel free to leave a comment with any thoughts or questions!


About the Authors

Alston Chan is a Software Development Engineer at Amazon Ads. He builds machine learning pipelines and recommendation systems for product recommendations on the Detail Page. Outside of work, he enjoys game development and rock climbing.

Maria Masood specializes in building data pipelines and data visualizations at AWS Commerce Platform. She has expertise in Machine Learning, covering natural language processing, computer vision, and time-series analysis. A sustainability enthusiast at heart, Maria enjoys gardening and playing with her dog during her downtime.

Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.

Read More

Ground truth generation and review best practices for evaluating generative AI question-answering with FMEval

Ground truth generation and review best practices for evaluating generative AI question-answering with FMEval

Generative AI question-answering applications are pushing the boundaries of enterprise productivity. These assistants can be powered by various backend architectures including Retrieval Augmented Generation (RAG), agentic workflows, fine-tuned large language models (LLMs), or a combination of these techniques. However, building and deploying trustworthy AI assistants requires a robust ground truth and evaluation framework.

Ground truth data in AI refers to data that is known to be factual, representing the expected use case outcome for the system being modeled. By providing an expected outcome to measure against, ground truth data unlocks the ability to deterministically evaluate system quality. Running deterministic evaluation of generative AI assistants against use case ground truth data enables the creation of custom benchmarks. These benchmarks are essential for tracking performance drift over time and for statistically comparing multiple assistants in accomplishing the same task. Additionally, they enable quantifying performance changes as a function of enhancements to the underlying assistant, all within a controlled setting. With deterministic evaluation processes such as the Factual Knowledge and QA Accuracy metrics of FMEval, ground truth generation and evaluation metric implementation are tightly coupled. To ensure the highest quality measurement of your question answering application against ground truth, the evaluation metric’s implementation must inform ground truth curation.

In this post, we discuss best practices for applying LLMs to generate ground truth for evaluating question-answering assistants with FMEval on an enterprise scale. FMEval is a comprehensive evaluation suite from Amazon SageMaker Clarify, and provides standardized implementations of metrics to assess quality and responsibility. To learn more about FMEval, see Evaluate large language models for quality and responsibility of LLMs. Additionally, see the Generative AI Security Scoping Matrix for guidance on moderating confidential and personally identifiable information (PII) as part of your generative AI solution.

By following these guidelines, data teams can implement high fidelity ground truth generation for question-answering use case evaluation with FMEval. For ground truth curation best practices for question answering evaluations with FMEval that you can use to design FMEval ground truth prompt templates, see Ground truth curation and metric interpretation best practices for evaluating generative AI question answering using FMEval.

Generating ground truth for FMEval question-answering evaluation

One option to get started with ground truth generation is human curation of a small question-answer dataset. The human curated dataset should be small (based on bandwidth), high in signal, and ideally prepared by use case subject matter experts (SMEs). The exercise of generating this dataset forces a data alignment exercise early in the evaluation process, raising important questions and conversations among use case stakeholders about what questions are important to measure over time for the business. The outcomes for this exercise are three-fold:

  • Stakeholder alignment on the top N important questions
  • Stakeholder awareness of the evaluation process
  • A high-fidelity starter ground truth dataset for the first proof of concept evaluation as a function of awareness and evaluation

While an SME ground truth curation exercise is a strong start, at the scale of an enterprise knowledge base, pure SME generation of ground truth will become prohibitively time and resource intensive. To scale ground truth generation and curation, you can apply a risk-based approach in conjunction with a prompt-based strategy using LLMs. It’s important to note that LLM-generated ground truth isn’t a substitute for use case SME involvement. For example, if ground truth is generated by LLMs before the involvement of SMEs, SMEs will still be needed to identify which questions are fundamental to the business and then align the ground truth with business value as part of a human-in-the-loop process.

To demonstrate, we provide a step-by-step walkthrough using Amazon’s 2023 letter to shareholders as source data.

In keeping with ground truth curation best practices for FMEval question-answering, ground truth is curated as question-answer-fact triplets. The question and answer are curated to suit the ideal question-answering assistant response in terms of content, length, and style. The fact is a minimal representation of the ground truth answer, comprising one or more subject entities of the question.

For example, consider how the following source document chunk from the Amazon 2023 letter to shareholders can be converted to question-answering ground truth.

Dear Shareholders:

Last year at this time, I shared my enthusiasm and optimism for Amazon’s future. Today, I have even more. The reasons are many, but start with the progress we’ve made in our financial results and customer experiences, and extend to our continued innovation and the remarkable opportunities in front of us. In 2023, Amazon’s total revenue grew 12% year-over-year (“Y oY”) from $514B to $575B. By segment, North America revenue increased 12% Y oY from $316B to $353B, International revenue grew 11% Y oY from$118B to $131B, and AWS revenue increased 13% Y oY from $80B to $91B. Further, Amazon’s operating income and Free Cash Flow (“FCF”) dramatically improved. Operating income in 2023 improved 201% YoY from $12.2B (an operating margin of 2.4%) to $36.9B (an operating margin of 6.4%).

To convert the source document excerpt into ground truth, we provide a base LLM prompt template. In the template, we instruct the LLM to take a fact-based approach to interpreting the chunk using chain-of-thought logic. For our example, we work with Anthropic’s Claude LLM on Amazon Bedrock. The template is compatible with and can be modified for other LLMs, such as LLMs hosted on Amazon Sagemaker Jumpstart and self-hosted on AWS infrastructure. To modify the prompt for use by other LLMs, a different approach to denoting prompt sections than XML tags might be required. For example, Meta Llama models apply tags such as <s> [INST] and <<SYS>>. For more information, see the Amazon Bedrock documentation on LLM prompt design and the FMEval documentation.

The LLM is assigned a persona to set its point of view for carrying out the task. In the instructions, the LLM identifies facts as entities from the source document chunk. For each fact, a question-answer-fact triplet is assembled based on the fact detected and its surrounding context. In the prompt, we provide detailed examples for controlling the content of ground truth questions. The examples focus on questions on chunk-wise business knowledge while ignoring irrelevant metadata that might be contained in a chunk. You can customize the prompt examples to fit your ground truth use case.

We further instruct the LLM to apply ground truth curation best practices for FMEval, such as generating multiple variations of facts to fit multiple possible unit expressions. Additional curation elements subject to the task at hand—such as brand language and tone—can be introduced into the ground truth generation prompt. With the following template, we verified that Anthropic’s Claude Sonnet 3.5 can generate custom ground truth attributes accommodating FMEval features, such as the <OR> delimiter to denote alternative acceptable answers for a ground truth fact.

"""You are an expert in ground truth curation for generative AI application evaluation on AWS.

Follow the instructions provided in the <instructions> XML tag for generating question answer fact triplets from a source document excerpt.

<instructions>
- Let's work this out in a step-by-step way to be sure we have the right answer.
- Review the source document excerpt provided in <document> XML tags below
- For each meaningful domain fact in the <document>, extract an unambiguous question-answer-fact set in JSON format including a question and answer pair encapsulating the fact in the form of a short sentence, followed by a minimally expressed fact extracted from the answer.

<domain_knowledge_focus>
- Focus ONLY on substantive domain knowledge contained within the document content
- Ignore all metadata and structural elements including but not limited to:
- Document dates, versions, page numbers
- Section numbers or titles
- Table structure or row/column positions
- List positions or ordering
- Questions must reference specific domain entities rather than generic document elements
</domain_knowledge_focus>

<context_specification_requirements>
Document Source Identification
- Always reference the specific source document and its date/version
- Example: "According to the [Document Name + Date], what is [specific query]?"

Cross-Reference Prevention
- Each question must be answerable from the current document chunk only
- Do not create questions requiring information from multiple documents
- Example: "In this [Document Name], what are [specific requirements]?"

Department/LOB Specification
- Always specify the relevant department, line of business, or organizational unit
- Example: "What are the [Department Name]'s requirements for [specific process]?"

Document Section Targeting
- Reference specific sections when the information location is relevant
- Example: "In Section [X] of [Document Name], what are the steps for [specific process]?"

Role-Based Context
- Specify relevant roles, responsibilities, or authority levels
- Example: "Which [specific roles] are authorized to [specific action]?"

Version Control Elements
- Include relevant version or revision information
- Example: "What changes were implemented in the [Month Year] revision of [Document]?"

Policy/Procedure Numbers
- Include specific policy or procedure reference numbers
- Example: "Under Policy [Number], what are the requirements for [specific action]?"

Regulatory Framework References
- Specify relevant regulatory frameworks or compliance requirements
- Example: "What [Regulation] compliance requirements are specified for [specific process]?"

System/Platform Specification
- Name specific systems, platforms, or tools
- Example: "What steps are required in [System Name] to [specific action]?"

Document Type Classification
- Specify the type of document (SOP, Policy, Manual, etc.)
- Example: "In the [Document Type + Number], where is [specific information] stored?"

Temporal Validity
- Include effective dates or time periods
- Example: "What process is effective from [Date] according to [Document]?"

Geographic Jurisdiction
- Specify relevant geographic regions or jurisdictions
- Example: "What requirements apply to [Region] according to [Document]?"

Business Process Owner
- Identify relevant process owners or responsible parties
- Example: "According to [Document], who owns the process for [specific action]?"

Classification Level
- Include relevant security or confidentiality classifications
- Example: "What are the requirements for [Classification Level] data?"

Stakeholder Scope
- Specify relevant stakeholders or approval authorities
- Example: "Which [stakeholder level] must approve [specific action]?"
</context_specification_requirements>

<question_quality_criteria>
- Questions must be specific enough that a vector database can match them to the relevant document chunk
- Questions should include key identifying terms, names, and context
- Questions should target concrete, actionable information
- Answers should provide complete context without referring back to document elements
</question_quality_criteria>

<output_format>
The question-answer-fact set should each be a short string in JSON format with the keys: "question", "ground_truth_answer", "fact"
</output_format>

<best_practices>
- Questions, answers, and facts should not refer to the subject entity as "it" or "they", and instead refer to it directly by name
- Questions, answers, and facts should be individually unique to the document chunk, such that based on the question a new call to the retriever will address the correct document section when posing the ground truth question
- Facts should be represented in 3 or fewer words describing an entity in the <document>
- If there are units in the fact, the "fact" entry must provide multiple versions of the fact using <OR> as a delimiter. See <unit_variations> for examples.
<unit_variations>
- Dollar Unit Equivalencies: `1,234 million<OR>1.234 billion`
- Date Format Equivalencies: `2024-01-01<OR>January 1st 2024`
- Number Equivalencies: `1<OR>one`
</unit_variations>
</best_practices>

- Start your response immediately with the question-answer-fact set JSON, and separate each extracted JSON record with a newline.
</instructions>

<document>
{context_document}
</document>

Now, extract the question answer pairs and fact from the document excerpt according to your instructions, starting immediately with JSON and no preamble."""

The generation output is provided as fact-wise JSONLines records in the following format, where elements in square brackets represent values from a line in Table 1.

{

"question": "[Question]",

"ground_truth_answer": "[Ground Truth Answer]",

"fact": "[Fact]"

}

Here are a few examples of generated ground truth:

Question Ground Truth Answer Fact
What was Amazon’s total revenue growth in 2023? Amazon’s total revenue grew 12% year-over-year from $514B to $575B in 2023. 12%<OR>$514B to $575B
How much did North America revenue increase in 2023? North America revenue increased 12% year-over-year from $316B to $353B. 12%<OR>$316B to $353B
What was the growth in International revenue for Amazon in 2023? International revenue grew 11% year-over-year from $118B to $131B. 11%<OR>$118B to $131B
How much did AWS revenue increase in 2023? AWS revenue increased 13% year-over-year from $80B to $91B. 13%<OR>$80B to $91B
What was Amazon’s operating income improvement in 2023? Operating income in 2023 improved 201% year-over-year from $12.2B to $36.9B. 201%<OR>$12.2B to $36.9B
What was Amazon’s operating margin in 2023? Amazon’s operating margin in 2023 was 6.4%. 6.4%

Scaling ground truth generation with a pipeline

To automate ground truth generation, we provide a serverless batch pipeline architecture, shown in the following figure. At a high level, the AWS Step Functions pipeline accepts source data in Amazon Simple Storage Service (Amazon S3), and orchestrates AWS Lambda functions for ingestion, chunking, and prompting on Amazon Bedrock to generate the fact-wise JSONLines ground truth.

A step function pipeline containing chunking map state and generation map state with fork for human in the loop

There are three user inputs to the step function:

  • A custom name for the ground truth dataset
  • The input Amazon S3 prefix for the source data
  • The percentage to sample for review.

Additional configurations are set by Lambda environment variables, such as the S3 source bucket and Amazon Bedrock Model ID to invoke on generation.

{

"dataset_name": "YOUR_DATASET_NAME”,

"input_prefix": "YOUR INPUT_PREFIX",

"review_percentage": "REVIEW PERCENTAGE"

}

After the initial payload is passed, a validation function assembles the global event payload structure in terms of system input and user input.

{

"system_input":

{

"run_id": "<AWS Step Function execution ID>",

"input_bucket": "<Input data Amazon S3 bucket>",

"output_bucket": "<Output data Amazon S3 bucket>",

"output_document_chunks_prefix": "<Amazon S3 bucket Prefix to store chunks>",

"chunk_size": "<Document chunk size>",

"chunk_overlap": "<Number of tokens that will overlap across consecutive chunks>"

},

"user_input":

{

"dataset_name": "<Dataset name>",

"input_prefix": "<Amazon S3 bucket prefix for ground truth generation data input data>",

"review_percentage": "<Percent of records to flag for human review>"

}

}

After validation, the first distributed map state iterates over the files in the input bucket to start the document ingestion and chunking processes with horizontal scaling. The resulting chunks are stored in an intermediate S3 bucket.

The second distributed map is the generation core of the pipeline. Each chunk generated by the previous map is fed as an input to the ground truth generation prompt on Amazon Bedrock. For each chunk, a JSONLines file containing the question-answer-fact triplets is validated and stored in an S3 bucket at the output prefix.

The following figure shows a view of the data structure and lineage from document paragraphs to the final ground truth chunk across the chunking and generation map states. The numbering between the two figures indicates the data structure present at each point in the pipeline. Finally, the JSONLines files are aggregated in an Amazon SageMaker Processing Job, including the assignment of a random sample for human review based on user input.

a diagram showing a document broken into chunks further broken into ground truth triplets derived from each chunk

The last step of the pipeline is the aggregation step using a SageMaker Processing job. The aggregation step consists of concatenating the JSONLines records generated by every child execution of the generation map into a single ground truth output file. A randomly selected percentage of the records in the output file are sampled and flagged for review as part of a human-in-the-loop process.

Judging ground truth for FMEval question-answering evaluation

In this section, we discuss two key components of evaluating ground truth quality: human in the loop and applying an LLM as a Judge. Measuring ground truth quality is an essential component of the evaluation lifecycle.

Human-in-the-loop

The level of ground truth human review required is determined by the risk of having incorrect ground truth, and its negative implications. Ground truth review by use case SMEs can verify if critical business logic is appropriately represented by the ground truth. The process of ground truth review by humans is called human-in-the-loop (HITL), and an example the HITL process is shown in the following figure.

The steps of HTIL are:

  1. Classify risk: performing a risk analysis will establish the severity and likelihood of negative events occurring as a result of incorrect ground truth used for evaluation of a generative AI use-case. Based on the outcome of the analysis, assign the ground truth dataset a risk level: Low, Medium, High or Critical. The table below outlines the relationship between event severity, likelihood, and risk level. See Learn how to assess the risk of AI systems for a deep dive on performing AI risk assessment.
  2. Human review: Based on the assigned risk level, use-case expert reviewers examine a proportional amount of the use-case ground truth. Organizations can set acceptability thresholds for percentage of HITL intervention based on their tolerance for risk. Similarly, if a ground truth dataset is promoted from a low risk to a medium risk use case, an increased level of HITL intervention will be necessary.
  3. Identify findings: Reviewers can identify any hallucinations relative to source data, challenges with information veracity according to their expertise, or other criteria set by the organization. In this post, we focus on hallucination detection and information veracity.
  4. Action results: Reviewers can take business actions based on their judgement, such as updating and deleting records, or re-writing applicable source documents. Bringing in LLMOps SMEs to apply dataset curation best practices can also be an outcome.

Four part diagram showing risk classification, human review, identifying findings, and actioning results

Putting the risk table from Learn how to assess the risk of AI systems into action, the severity and likelihood of risks for a ground truth dataset validating a production chatbot with frequent customer use would be greater than an internal evaluation dataset used by developers to advance a prototype.

Likelihood
Severity Rare Unlikely Possible Likely Frequent
Extreme Low Medium High Critical Critical
Major Very low Low Medium High Critical
Moderate Very low Low Medium Medium High
Low Very low Very low Low Low Medium
Very Low Very low Very low Very low Very low Low

Next, we walk through the step-by-step process of conducting a human review for hallucination detection and information veracity. Human review is performed by comparing the ground truth chunk input to the LLM prompt to the generated question-answer-fact triplets. This view is shown in the following table.

Source data chunk Ground truth triplets

Dear Shareholders:

Last year at this time, I shared my enthusiasm and optimism for Amazon’s future. Today, I have even more. The reasons are many, but start with the progress we’ve made in our financial results and customer experiences, and extend to our continued innovation and the remarkable opportunities in front of us. In 2023, Amazon’s total revenue grew 12% year-over-year (“YoY”) from $514B to $575B. By segment, North America revenue increased 12% Y oY from $316B to $353B, International revenue grew 11% YoY from $118B to $131B, and AWS revenue increased 13% YoY from $80B to $91B.

{“question”: “What was Amazon’s total revenue growth in 2023?”, “ground_truth_answer”: “Amazon’s total revenue grew 12% year-over-year from $514B to $575B in 2023.”, “fact”: “12%<OR>$514B to $575B”}

{“question”: “How much did North America revenue increase in 2023?”, “ground_truth_answer”: “North America revenue increased 12% year-over-year from $316B to $353B.”, “fact”: “12%<OR>$316B to $353B”}

{“question”: “What was the growth in International revenue for Amazon in 2023?”, “ground_truth_answer”: “International revenue grew 11% year-over-year from $118B to $131B.”, “fact”: “11%<OR>$118B to $131B”}

Human reviewers then identify and take action based on findings to correct the system. LLM hallucination is the phenomenon where LLMs generate plausible-sounding but factually incorrect or nonsensical information, presented confidently as factual. Organizations can introduce additional qualities for evaluating and scoring ground truth, as suited to the risk level and use case requirements.

In hallucination detection, reviewers seek to identify text that has been incorrectly generated by the LLM. An example of hallucination and remediation is shown in the following table. A reviewer would notice in the source data that Amazon’s total revenue grew 12% year over year, yet the ground truth answer hallucinated a 15% figure. In remediation, the reviewer can change this back to 12%.

Source data chunk Example hallucination Example hallucination remediation
In 2023, Amazon’s total revenue grew 12% year-over-year (“YoY”) from $514B to $575B.

{“question”: “What was Amazon’s total revenue growth in 2023?”,

“ground_truth_answer”: “Amazon’s total revenue grew 15% year-over-year from $514B to $575B in 2023.”,

“fact”: “12%<OR>$514B to $575B”}

{“question”: “What was Amazon’s total revenue growth in 2023?”,

“ground_truth_answer”: “Amazon’s total revenue grew 12% year-over-year from $514B to $575B in 2023.”,

“fact”: “12%<OR>$514B to $575B”}

In SME review for veracity, reviewers seek to validate if the ground truth is in fact truthful. For example, the source data used for the ground truth generation prompt might be out of date or incorrect. The following table shows the perspective of an HITL review by a domain SME.

Source data chunk Example SME review Example hallucination remediations
Effective June 1st, 2023, AnyCompany is pleased to announce the implementation of “Casual Friday” as part of our updated dress code policy. On Fridays, employees are permitted to wear business casual attire, including neat jeans, polo shirts, and comfortable closed-toe shoes.

“As an HR Specialist, this looks incorrect to me.

We did not implement the Casual Friday policy after all at AnyCompany – the source data for this ground truth must be out of date.”

  • Delete Incorrect Ground Truth
  • Update Source Data Document
  • Other use case specific actions

Traditional machine learning applications can also inform the HITL process design. For examples of HITL for traditional machine learning, see Human-in-the-loop review of model explanations with Amazon SageMaker Clarify and Amazon A2I. 

LLM-as-a-judge

When scaling HITL, LLM reviewers can perform hallucination detection and remediation. This idea is known as self-reflective RAG, and can be used to decrease—but not eliminate—the level of human effort in the process for hallucination detection. As a means of scaling LLM-as-a-judge review, Amazon Bedrock now offers the ability to use LLM reviewers and to perform automated reasoning checks with Amazon Bedrock Guardrails for mathematically sound self-validation against predefined policies. For more information about implementation, see New RAG evaluation and LLM-as-a-judge capabilities in Amazon Bedrock and Prevent factual errors from LLM hallucinations with mathematically sound Automated Reasoning checks (preview).

The following figure shows an example high-level diagram of a self-reflective RAG pattern. A generative AI application based on RAG yields responses fed to a judge application. The judge application reflects on whether responses are incomplete, hallucinated, or irrelevant. Based on the judgement, data is routed along the corresponding remediation.

A diagram showing a generation chain followed by a judge chain which intelligently routes requests back if required for re-ranking

The golden rule in implementing HITL or LLM-as-a-judge as part of ground truth generation is to make sure the organization’s review process aligns with the accepted risk level for the ground truth dataset.

Conclusion

In this post, we provided guidance on generating and reviewing ground truth for evaluating question-answering applications using FMEval. We explored best practices for applying LLMs to scale ground truth generation while maintaining quality and accuracy. The serverless batch pipeline architecture we presented offers a scalable solution for automating this process across large enterprise knowledge bases. We provide a ground truth generation prompt that you can use to get started with evaluating knowledge assistants using the FMEval Factual Knowledge and QA Accuracy evaluation metrics.

By following these guidelines, organizations can follow responsible AI best practices for creating high-quality ground truth datasets for deterministic evaluation of question-answering assistants. Use case-specific evaluations supported by well-curated ground truth play a crucial role in developing and deploying AI solutions that meet the highest standards of quality and responsibility.

Whether you’re developing an internal tool, a customer-facing virtual assistant, or exploring the potential of generative AI for your organization, we encourage you to adopt these best practices. Start implementing a robust ground truth generation and review processes for your generative AI question-answering evaluations today with FMEval.


About the authors

Headshot of Samantha StuartSamantha Stuart is a Data Scientist with AWS Professional Services, and has delivered for customers across generative AI, MLOps, and ETL engagements. Samantha has a research master’s degree in engineering from the University of Toronto, where she authored several publications on data-centric AI for drug delivery system design. Outside of work, she is most likely spotted playing music, spending time with friends and family, at the yoga studio, or exploring Toronto.

Headshot of Philippe Duplessis GuindonPhilippe Duplessis-Guindon is a cloud consultant at AWS, where he has worked on a wide range of generative AI projects. He has touched on most aspects of these projects, from infrastructure and DevOps to software development and AI/ML. After earning his bachelor’s degree in software engineering and a master’s in computer vision and machine learning from Polytechnique Montreal, Philippe joined AWS to put his expertise to work for customers. When he’s not at work, you’re likely to find Philippe outdoors—either rock climbing or going for a run.

Headshot of Rahul JaniRahul Jani is a Data Architect with AWS Professional Service. He collaborates closely with enterprise customers building modern data platforms, generative AI applications, and MLOps. He is specialized in the design and implementation of big data and analytical applications on the AWS platform. Beyond work, he values quality time with family and embraces opportunities for travel.

Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.

Read More

Accelerate AWS Well-Architected reviews with Generative AI

Accelerate AWS Well-Architected reviews with Generative AI

Building cloud infrastructure based on proven best practices promotes security, reliability and cost efficiency. To achieve these goals, the AWS Well-Architected Framework provides comprehensive guidance for building and improving cloud architectures. As systems scale, conducting thorough AWS Well-Architected Framework Reviews (WAFRs) becomes even more crucial, offering deeper insights and strategic value to help organizations optimize their growing cloud environments.

In this post, we explore a generative AI solution leveraging Amazon Bedrock to streamline the WAFR process. We demonstrate how to harness the power of LLMs to build an intelligent, scalable system that analyzes architecture documents and generates insightful recommendations based on AWS Well-Architected best practices. This solution automates portions of the WAFR report creation, helping solutions architects improve the efficiency and thoroughness of architectural assessments while supporting their decision-making process.

Scaling Well-Architected reviews using a generative AI-powered solution

As organizations expand their cloud footprint, they face several challenges in adhering to the Well-Architected Framework:

  • Time-consuming and resource-intensive manual reviews
  • Inconsistent application of Well-Architected principles across different teams
  • Difficulty in keeping pace with the latest best practices
  • Challenges in scaling reviews for large or numerous architectures

To address these challenges, we have built a WAFR Accelerator solution that uses generative AI to help streamline and expedite the WAFR process. By automating the initial assessment and documentation process, this solution significantly reduces time spent on evaluations while providing consistent architecture assessments against AWS Well-Architected principles. This allows teams to focus more on implementing improvements and optimizing AWS infrastructure. The solution incorporates the following key features:

  • Using a Retrieval Augmented Generation (RAG) architecture, the system generates a context-aware detailed assessment. The assessment includes a solution summary, an evaluation against Well-Architected pillars, an analysis of adherence to best practices, actionable improvement recommendations, and a risk assessment.
  •  An interactive chat interface allows deeper exploration of both the original document and generated content.
  • Integration with the AWS Well-Architected Tool pre-populates workload information and initial assessment responses.

This solution offers the following key benefits:

  • Rapid analysis and resource optimization – What previously took days of manual review can now be accomplished in minutes, allowing for faster iteration and improvement of architectures. This time efficiency translates to significant cost savings and optimized resource allocation in the review process.
  • Consistency and enhanced accuracy – The approach provides a consistent application of AWS Well-Architected principles across reviews, reducing human bias and oversight. This systematic approach leads to more reliable and standardized evaluations.
  • Depth of insight – Advanced analysis can identify subtle patterns and potential issues that might be missed in manual reviews, providing deeper insights into architectural strengths and weaknesses.
  • Scalability – The solution can handle multiple reviews simultaneously, making it suitable for organizations of all sizes, from startups to enterprises. This scalability allows for more frequent and comprehensive reviews.
  • Interactive exploration -The generative AI-driven chat interface allows users to dive deeper into the assessment, asking follow-up questions and gaining a better understanding of the recommendations. This interactivity enhances engagement and promotes more thorough comprehension of the results.

Solution overview

The WAFR Accelerator is designed to streamline and enhance the architecture review process by using the capabilities of generative AI through Amazon Bedrock and other AWS services. This solution automates the analysis of complex architecture documents, evaluating them against the AWS Well-Architected Framework’s pillars and providing detailed assessments and recommendations.

The solution consists of the following capabilties:

  • Generative AI-powered analysis – Uses Amazon Bedrock to rapidly analyze architecture documents against AWS Well-Architected best practices, generating detailed assessments and recommendations.
  • Knowledge base integration – Incorporates up-to-date WAFR documentation and cloud best practices using Amazon Bedrock Knowledge Bases, providing accurate and context-aware evaluations.
  • Customizable – Uses prompt engineering, which enables customization and iterative refinement of the prompts used to drive the large language model (LLM), allowing for refining and continuous enhancement of the assessment process.
  • Integration with the AWS Well-Architected Tool – Creates a Well-Architected workload milestone for the assessment and prepopulates answers for WAFR questions based on generative AI-based assessment.
  • Generative AI-assisted chat – Offers an AI-driven chat interface for in-depth exploration of assessment results, supporting multi-turn conversations with context management.
  • Scalable architecture – Uses AWS services like AWS Lambda and Amazon Simple Queue Service (Amazon SQS) for efficient processing of multiple reviews.
  • Data privacy and network security – With Amazon Bedrock, you are in control of your data, and all your inputs and customizations remain private to your AWS account. Your data, such as prompts, completions, custom models, and data used for fine-tuning or continued pre-training, is not used for service improvement and is never shared with third-party model providers. Your data remains in the AWS Region where the API call is processed. All data is encrypted in transit and at rest. You can use AWS PrivateLink to create a private connection between your VPC and Amazon Bedrock.

A human-in-the-loop review is still crucial to validate the generative AI findings, checking for accuracy and alignment with organizational requirements.

The following diagram illustrates the solution’s technical architecture.

solution-architecture

The workflow consists of the following steps:

  1. WAFR guidance documents are uploaded to a bucket in Amazon Simple Storage Service (Amazon S3). These documents form the foundation of the RAG architecture. Using Amazon Bedrock Knowledge Base, the sample solution ingests these documents and generates embeddings, which are then stored and indexed in Amazon OpenSearch Serverless. This creates a vector database that enables retrieval of relevant WAFR guidance during the review process
  2. Users access the WAFR Accelerator Streamlit application through Amazon CloudFront, which provides secure and scalable content delivery. User authentication is handled by Amazon Cognito, making sure only authenticated user have access.
  3. Users upload their solution architecture document in PDF format using the Streamlit application running on an Amazon Elastic Compute Cloud (Amazon EC2) instance that stores it in an S3 bucket. On submission, the WAFR review process is invoked by Amazon SQS, which queues the review request.
  4. The WAFR reviewer, based on Lambda and AWS Step Functions, is activated by Amazon SQS. It orchestrates the review process, including document content extraction, prompt generation, solution summary, knowledge embedding retrieval, and generation.
  5. Amazon Textract extracts the content from the uploaded documents, making it machine-readable for further processing.
  6. The WAFR reviewer uses Amazon Bedrock Knowledge Bases’ fully managed RAG workflow to query the vector database in OpenSearch Serverless, retrieving relevant WAFR guidance based on the selected WAFR pillar and questions. Metadata filtering is used to improve retrieval accuracy.
  7. Using the extracted document content and retrieved embeddings, the WAFR reviewer generates an assessment using Amazon Bedrock. A workload is created in the AWS Well-Architected Tool with answers populated with the assessment results. This allows users to download initial version of the AWS Well-Architected report from the AWS Well-Architected Tool console on completion of the assessment.
  8. The assessment is also stored in an Amazon DynamoDB table for quick retrieval and future reference.
  9. The WAFR Accelerator application retrieves the review status from the DynamoDB table to keep the user informed.
  10. Users can chat with the content using Amazon Bedrock, allowing for deeper exploration of the document, assessment, and recommendations.
  11. Once the assessment is complete, human reviewers can review it in the AWS Well-Architected Tool.

Deploy the solution

To implement the solution in your own environment, we’ve provided resources in the following GitHub repo to guide you through the process. The setup is streamlined using the AWS Cloud Development Kit (AWS CDK), which allows for infrastructure as code (IaC) deployment. For step-by-step instructions, we’ve prepared a detailed README file that walks you through the entire setup process.

To get started, complete the following steps:

  1. Clone the provided repository containing the AWS CDK code and README file.
  2. Review the README file for prerequisites and environment setup instructions.
  3. Follow the AWS CDK deployment steps outlined in the documentation.
  4. Configure necessary environment-specific parameters as described.

Deploying and running this solution in your AWS environment will incur costs for the AWS services used, including but not limited to Amazon Bedrock, Amazon EC2, Amazon S3, and DynamoDB. It is highly recommended that you use a separate AWS account and setup AWS Budget to monitor the costs.

DISCLAIMER: This is sample code for non-production usage. You should work with your security and legal teams to adhere to your organizational security, regulatory, and compliance requirements before deployment.

Test the solution

The following diagram illustrates the workflow for using the application.

workflow

To demonstrate how generative AI can accelerate AWS Well-Architected reviews, we have developed a Streamlit-based demo web application that serves as the front-end interface for initiating and managing the WAFR review process.

Complete the following steps to test the demo application:

  1. Open a new browser window and enter the CloudFront URL provided during the setup.
  2. Add a new user to the Amazon Cognito user pool deployed by the AWS CDK during the setup. Log in to the application using this user’s credentials.
  3. Choose New WAFR Review in the navigation pane.
  4. For Analysis type, choose the analysis type:
    • Quick – You can generate a quick analysis without creating a workload in the AWS Well-Architected Tool. This option is faster because it groups the questions for an individual pillar into a single prompt. It’s suitable for an initial assessment.
    • Deep with Well-Architected Tool – You can generate a comprehensive and detailed analysis that automatically creates a workload in the AWS Well-Architected tool. This thorough review process requires more time to complete as it evaluates each question individually rather than grouping them together. The deep review typically takes approximately 20 minutes, though the actual duration may vary depending on the document size and the number of Well- Architected pillars selected for evaluation.
  5. Enter the analysis name and description.
  6. Choose the AWS Well-Architected lens and desired pillars.
  7. Upload your solution architecture or technical design document
  8. Choose Create WAFR Analysis.wafr-results
  9. Choose Existing WAFR Reviews in the navigation pane.
  10. Choose your newly submitted analysis.

After the status changes to Completed, you can view the WAFR analysis at the bottom of the page. For multiple reviews, choose the relevant analysis on the dropdown menu.

You can chat with the uploaded document as well as the other generated content by using the WAFR Chat section on the Existing WAFR Reviews page.

Improving assessment quality

The solution uses prompt engineering to optimize textual input to the foundation model (FM) to obtain desired assessment responses. The quality of prompt (the system prompt, in this case) has significant impact on the model output. The solution provides a sample system prompt that is used to drive the assessment. You could enhance this prompt further to align with specific organizational needs. This becomes more crucial when defining and ingesting your own custom lenses.

Another important factor is the quality of the document that is uploaded for assessment. Detailed and architecture-rich documents can result in better inferences and therefore finer assessments. Prompts are defined in such a way that if there is inadequate information for assessment, then it’s highlighted in the output. This minimizes hallucination by the FM and provides a potential opportunity to enrich your design templates in alignment with AWS Well-Architected content.

You could further enhance this solution by using Amazon Bedrock Guardrails to further reduce hallucinations and ground responses in your own source information.

At the time of writing of this blog, only the AWS Well-Architected Framework, Financial Services Industry, and Analytics lenses have been provisioned. However, other lenses, including custom lenses, could be added with a few refinements to the UI application and underlying data store.

Clean up

After you’ve finished exploring or using the solution and no longer require these resources, be sure to clean them up to avoid ongoing charges. Follow these steps to remove all associated resources:

  1. Navigate to the directory containing your AWS CDK code.
  2. Run the following command: cdk destroy.
  3. Confirm the deletion when prompted.
  4. Manually check for and delete any resources that might not have been automatically removed, such as S3 buckets with content or custom IAM roles.
  5. Verify that all related resources have been successfully deleted.

Conclusion

In this post, we showed how generative AI and Amazon Bedrock can play a crucial role in expediting and scaling the AWS Well-Architected Framework reviews within an organization. By automating document analysis and using a WAFR-aware knowledge base, the solution offers rapid and in-depth assessments, helping organizations build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads.

To learn more, refer to the following:


About the Authors

Shoeb Bustani is a Senior Enterprise Solutions Architect at AWS, based in the United Kingdom. As a senior enterprise architect, innovator, and public speaker, he provides strategic architectural partnership and guidance to help customers achieve their business outcome leveraging AWS services and best practices.

Brijesh Pati is an Enterprise Solutions Architect at AWS, helping enterprise customers adopt cloud technologies. With a background in application development and enterprise architecture, he has worked with customers across sports, finance, energy, and professional services sectors. Brijesh specializes in AI/ML solutions and has experience with serverless architectures.

Rohan Ghosh is as an Enterprise Solutions Architect at Amazon Web Services (AWS), specializing in the Advertising and Marketing sector. With extensive experience in Cloud Solutions Engineering, Application Development, and Enterprise Support, he helps organizations architect and implement cutting-edge cloud solutions. His current focus areas include Data Analytics and Generative AI, where he guides customers in leveraging AWS technologies to drive innovation and business transformation.

Read More

Dynamic metadata filtering for Amazon Bedrock Knowledge Bases with LangChain

Dynamic metadata filtering for Amazon Bedrock Knowledge Bases with LangChain

Amazon Bedrock Knowledge Bases offers a fully managed Retrieval Augmented Generation (RAG) feature that connects large language models (LLMs) to internal data sources. It’s a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts. It also provides developers with greater control over the LLM’s outputs, including the ability to include citations and manage sensitive information.

Amazon Bedrock Knowledge Bases has a metadata filtering capability that allows you to refine search results based on specific attributes of the documents, improving retrieval accuracy and the relevance of responses. These metadata filters can be used in combination with the typical semantic (or hybrid) similarity search. Improving document retrieval results helps personalize the responses generated for each user. Dynamic metadata filters allow you to instantly create custom queries based on the varying user profiles or user-inputted responses so the documents retrieved only contain information relevant to the your needs.

In this post, we discuss using metadata filters with Amazon Bedrock Knowledge Bases.

Solution overview

The following code is an example metadata filter for Amazon Bedrock Knowledge Bases. Logical operators (such as AND or OR) can be nested to combine other logical operators and filter conditions. For more information, refer to the Retrieve API.

{
    "andAll": [
        {
            "equals": {
                "key": "desired_destination",
                "value": "<UNKNOWN>"  # This will be overwritten with appropriate values at runtime
            }
        },
        {
            "equals": {
                "key": "travelling_with_children",
                "value": "<UNKNOWN>"  # This will be overwritten with appropriate values at runtime
            }
        }
    ]
}

For our use case, we use an example of a travel website where the user answers a few questions about their travel preferences (including desired destination, preferred activities, and traveling companions) and then the system retrieves relevant documents.

We exclusively focus on the retrieval portion of RAG in this post. We provide the upstream components, including document ingestion and query formatting, as static data instead of code. The downstream generation component is out of scope for this post.

Prerequisites

To follow along with this post, you should understand basic retrieval techniques such as similarity search.

Additionally, you need an Amazon Bedrock knowledge base populated with documents and metadata. For instructions, see Create an Amazon Bedrock knowledge base. We have provided example documents and metadata in the accompanying GitHub repo for you to upload.

The associated notebook contains the required library imports and environment variables. Make sure you run the notebook using an AWS Identity and Access Management (IAM) role with the correct permissions for Amazon Simple Storage Service (Amazon S3) and Amazon Bedrock (AmazonS3FullAccess and AmazonBedrockFullAccess, respectively). We recommend running the notebook locally or in Amazon SageMaker. Then you can run the following code to test your AWS and knowledge base connection:

# Test AWS connection
# Create a session using your AWS credentials
session = boto3.Session()

# Create an STS client
sts_client = session.client('sts')

# Get the caller identity
response = sts_client.get_caller_identity()

# Print the response
print(response)

knowledge_base_id = 'XXXXXXXXXX'

retrieval_config = {
    "vectorSearchConfiguration": {
        "numberOfResults": 4,
        "overrideSearchType": "HYBRID"
    }
}

# Test bedrock knowledge bases connection
client = boto3.client('bedrock-agent-runtime')

response = client.retrieve(
    knowledgeBaseId=knowledge_base_id,
    retrievalConfiguration=retrieval_config,
    retrievalQuery={"text": "Hello world"}
)

print(response)

Create a dynamic filter

The "value" field within the filter needs to be updated at request time. This means overwriting the retrieval_config object, as shown in the following figure. The placeholder values in the filter get overwritten with the user data at runtime.

Visual of how the placeholder value of keys is updated with the actual values in the user data

Because the retrieval_config object is a nested hierarchy of logical conditions (a tree), you can implement a breadth first search to identify and replace all the "value" field values (where "value" is the key and "<UNKNOWN>" is the placeholder value) with the corresponding value from the user data. See the following code:

def setup_retrieval_config(inputs):

    # Make a copy because the filter is updated dynamically based on the user_data, this allows you to start from the default each time
    local_retrieval_config = copy.deepcopy(retrieval_config)

    updated_vector_search_config = replace_values(local_retrieval_config["vectorSearchConfiguration"], inputs["user_data"])
    local_retrieval_config["vectorSearchConfiguration"] = updated_vector_search_config

    return local_retrieval_config

def replace_values(vector_search_config: Dict, user_data: Dict):
    # Replace the value fields in the filter with the correct value according to the user_data
    # Uses breadth first search to find all of the value fields

    # Filter is not a required key, if you do not want any filters get rid of the key
    if "filter" in vector_search_config and not vector_search_config["filter"]:
        del vector_search_config["filter"]

    # Recursively traverse from the root
    elif 'filter' in vector_search_config:
        vector_search_config['filter'] = replace_values(vector_search_config['filter'], user_data)

    # At a node that is not the root
    else:
        for key, value in vector_search_config.items():
            if isinstance(value, dict):

                # At a leaf e.g. {"key": "age", "value": ""}}
                if 'key' in value and 'value' in value:

                    # Only overwrite value['value'] that are not unknown
                    if value['key'] in user_data and not (value["value"] == "unknown" or value["value"] == ["unknown"]):

                        # Primitive data type
                        if type(value["value"]) in [str, int, float, bool]:
                            value['value'] = user_data[value['key']]

                        # List data type
                        elif isinstance(value["value"], list):
                            value['value'] = [user_data[value['key']]]
                        else:
                            raise ValueError(f"Unsupported value['value'] type {type(value['value'])}")
                else:
                    vector_search_config[key] = replace_values(value, user_data)

            # Recurse on each item in the list
            elif isinstance(value, list):
                vector_search_config[key] = [replace_values(item, user_data) for item in value]
            else:
                raise ValueError(f"Unsupported value type {type(value)}")

    return vector_search_config

Option 1: Create a retriever each time

To define the retrieval_config parameter dynamically, you can instantiate AmazonKnowledgeBasesRetriever each time. This integrates into a larger LangChain centric code base. See the following code:

def create_retrieval_chain() -> Runnable:
        """
        Creates a retrieval chain for the retriever.

        Returns:
            Runnable: The retrieval chain.
        """

        query = create_query_for_retrieval()

        def create_retriever(inputs):
            # This wrapper is necessary because if you return a callable object LangChain will automatically call it immediately, which is not the desired behavior
            # instead we want to call the retriever on the next step of the chain
            retriever_wrapper = {"retriever": AmazonKnowledgeBasesRetriever(knowledge_base_id=knowledge_base_id, retrieval_config=inputs["retrieval_config"])}
            return retriever_wrapper

        # Retrieval chain has three steps: (1) create the filter based off of the user data, (2) create the retriever, and (3) invoke the retriever
        retrieval_chain = (
            {
                "user_data" : itemgetter("user_data"),
                "retrieval_config" : lambda inputs: setup_retrieval_config(inputs)
            } |
            {
                "query" : query,
                "retriever" : lambda inputs: create_retriever(inputs)
            } |
            RunnableLambda(lambda inputs: inputs["retriever"]["retriever"].invoke(inputs["query"]))
        )
        return retrieval_chain

Option 2: Access the underlying Boto3 API

The Boto3 API is able to directly retrieve with a dynamic retrieval_config. You can take advantage of this by accessing the object that AmazonKnowledgeBasesRetriever wraps. This is slightly faster but is less pythonic because it relies on LangChain implementation details, which may change without notice. This requires additional code to adapt the output to the proper format for a LangChain retriever. See the following code:

retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id=knowledge_base_id,
    retrieval_config=retrieval_config
)

def create_retrieval_chain() -> Runnable:
        """
        Creates a retrieval chain for the retriever.

        Returns:
            Runnable: The retrieval chain.
        """

        query = create_query_for_retrieval()
        
        def retrieve_and_format(inputs):
            results = retriever.client.retrieve(
                retrievalQuery={"text": inputs["query"]}, 
                knowledgeBaseId=knowledge_base_id, 
                retrievalConfiguration=inputs["retrieval_config"]
            )
        
            documents = []
            for result in results["retrievalResults"]:
                metadata = {
                    "location": result["location"],
                    "source_metadata": result["metadata"],
                    "score": result["score"],
                }

                document = Document(
                    page_content=result["content"]["text"],
                    metadata=metadata
                )
                documents.append(document)
            
            return documents

        retrieval_chain = (
            {
                "query" : query,
                "retrieval_config" : lambda inputs: setup_retrieval_config(inputs)
            } |
            RunnableLambda(lambda inputs: retrieve_and_format(inputs))
            # RunnableLambda(lambda inputs: retriever.client.retrieve(retrievalQuery={"text": inputs["query"]}, knowledgeBaseId=knowledge_base_id, retrievalConfiguration=inputs["retrieval_config"]))
        )
        return retrieval_chain

retrieval_chain_2 = create_retrieval_chain()

Results

Begin by reading in the user data. This example data contains user answers to an online questionnaire about travel preferences. The user_data fields must match the metadata fields.

with open("data/user_data.json", "r") as file:
user_data = json.load(file)

print(json.dumps(user_data[:2], indent=2))

Here is a preview of the user_data.json file from which certain fields will be extracted as values for filters.

{
        "trip_id": 1,
        "desired_destination": "Bali, Indonesia",
        "stay_duration": 7,
        "age": 35,
        "gender": "male",
        "companion": "solo",
	"travelling_with_children": "no",
        "travelling_with_pets": "no"
    },
    {
        "trip_id": 2,
        "desired_destination": "Paris, France",
        "stay_duration": 5,
        "age": 28,
        "gender": "female",
        "companion": "solo",
	"travelling_with_children": "no",
        "travelling_with_pets": "yes"
    },

Test the code with filters turned on and off. Only use a few filtering criteria because restrictive filters might return zero documents.

filters_to_test: List = [
    {
        "andAll": [
            {
                "equals": {
                    "key": "desired_destination",
                    "value": "<UNKNOWN>"  # This will be overwritten with appropriate values at runtime
                }
            },
            {
                "equals": {
                    "key": "travelling_with_children",
                    "value": "<UNKNOWN>"  # This will be overwritten with appropriate values at runtime
                }
            }
        ]
    },
    None
]

Finally, run both retrieval chains through both sets of filters for each user:

retrieval_chains = [retrieval_chain_1, retrieval_chain_2]

results = []

for retrieval_chain_id, retrieval_chain in enumerate(retrieval_chains):
    logger.info(retrieval_chain_id)
    # Loop through each filter options
    for filter in filters_to_test:
        retrieval_config["vectorSearchConfiguration"]["filter"] = filter
        # Loop through each user data entry
        for user_entry in user_data:
            inputs = {
                    "user_data": user_entry,
                    "retrieval_config": retrieval_config
                }

            # Run the retrieval chain with the current user entry
            try:
                result = retrieval_chain.invoke(inputs)
                # print(f"Result for user entry {user_entry['trip_id']}: {result}")
                results.append(({'retrieval_chain_id': retrieval_chain_id, 'user': user_entry, 'documents': result}))

            except Exception as e:
                print(f"Error during retrieval for user entry {user_entry['trip_id']}: {e}")

When analyzing the results, you can see that the first half of the documents are identical to the second half. In addition, when metadata filters aren’t used, the documents retrieved are occasionally for the wrong location. For example, trip ID 2 is to Paris, but the retriever pulls documents about London.

Excerpt of output table for reference:

Retrieval Approach Filter Trip ID Destination Page Content Metadata
Option_0 TRUE 2 Paris, France As a 70-year-old retiree, I recently had the pleasure of visiting Paris for the first time. It was a trip I had been looking forward to for years, and I was not disappointed. Here are some of my favorite attractions and activities that I would recommend to other seniors visiting the city.  First on my list is the Eiffel Tower… {‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_6.txt‘}, ‘type’: ‘S3’}, ‘score’: 0.48863396, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_6.txt‘, ‘travelling_with_children’: ‘no’, ‘activities_interest’: [‘museums’, ‘palaces’, ‘strolling’, ‘boat tours’, ‘neighborhood tours’], ‘companion’: ‘unknown’, ‘x-amz-bedrock-kb-data-source-id’: {YOUR_KNOWLEDGE_BASE_ID}, ‘stay_duration’: ‘unknown’, ‘preferred_month’: [‘unknown’], ‘travelling_with_pets’: ‘unknown’, ‘age’: [’71’, ’80’], ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3AiNKlapMBdxcT3sYpRK-d’, ‘desired_destination’: ‘Paris, France’}}
Option_0 TRUE 2 Paris, France As a 35-year-old traveling with my two dogs, I found Paris to be a pet-friendly city with plenty of attractions and activities for pet owners. Here are some of my top recommendations for traveling with pets in Paris:  The Jardin des Tuileries is a beautiful park located between the Louvre Museum and the Place de la Concorde… {‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_9.txt‘}, ‘type’: ‘S3’}, ‘score’: 0.474106, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_9.txt‘, ‘travelling_with_children’: ‘no’, ‘activities_interest’: [‘parks’, ‘museums’, ‘river cruises’, ‘neighborhood exploration’], ‘companion’: ‘pets’, ‘x-amz-bedrock-kb-data-source-id’: {YOUR_KNOWLEDGE_BASE_ID}, ‘stay_duration’: ‘unknown’, ‘preferred_month’: [‘unknown’], ‘travelling_with_pets’: ‘yes’, ‘age’: [’30’, ’31’, ’32’, ’33’, ’34’, ’35’, ’36’, ’37’, ’38’, ’39’, ’40’], ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3Aj52lapMBuHB13c7-hl-4’, ‘desired_destination’: ‘Paris, France’}}
Option_0 TRUE 2 Paris, France If you are looking for something a little more active, I would suggest visiting the Bois de Boulogne. This large park is located on the western edge of Paris and is a great place to go for a walk or a bike ride with your pet. The park has several lakes and ponds, as well as several gardens and playgrounds… {‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_5.txt‘}, ‘type’: ‘S3’}, ‘score’: 0.45283788, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_5.txt‘, ‘travelling_with_children’: ‘no’, ‘activities_interest’: [‘strolling’, ‘picnic’, ‘walk or bike ride’, ‘cafes and restaurants’, ‘art galleries and shops’], ‘companion’: ‘pet’, ‘x-amz-bedrock-kb-data-source-id’: ‘{YOUR_KNOWLEDGE_BASE_ID}, ‘stay_duration’: ‘unknown’, ‘preferred_month’: [‘unknown’], ‘travelling_with_pets’: ‘yes’, ‘age’: [’40’, ’41’, ’42’, ’43’, ’44’, ’45’, ’46’, ’47’, ’48’, ’49’, ’50’], ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3AmtKlapMBdxcT3sYpSK_N’, ‘desired_destination’: ‘Paris, France’}}
Option_0 FALSE 2 Paris, France {   “metadataAttributes”: {     “age”: [       “30”     ],     “desired_destination”: “London, United Kingdom”,     “stay_duration”: “unknown”,     “preferred_month”: [       “unknown”     ],     “activities_interest”: [       “strolling”,       “sightseeing”,       “boating”,       “eating out”     ],     “companion”: “pets”,     “travelling_with_children”: “no”,     “travelling_with_pets”: “yes”   } } {‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/London_2.txt.metadata (1).json’}, ‘type’: ‘S3’}, ‘score’: 0.49567315, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/London_2.txt.metadata (1).json’, ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3A5tKlapMBdxcT3sYpYq_r’, ‘x-amz-bedrock-kb-data-source-id’: {YOUR_KNOWLEDGE_BASE_ID}}}
Option_0 FALSE 2 Paris, France As a 35-year-old traveling with my two dogs, I found Paris to be a pet-friendly city with plenty of attractions and activities for pet owners. Here are some of my top recommendations for traveling with pets in Paris:  The Jardin des Tuileries is a beautiful park located between the Louvre Museum and the Place de la Concorde… {‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_9.txt‘}, ‘type’: ‘S3’}, ‘score’: 0.4741059, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_9.txt‘, ‘travelling_with_children’: ‘no’, ‘activities_interest’: [‘parks’, ‘museums’, ‘river cruises’, ‘neighborhood exploration’], ‘companion’: ‘pets’, ‘x-amz-bedrock-kb-data-source-id’: {YOUR_KNOWLEDGE_BASE_ID}, ‘stay_duration’: ‘unknown’, ‘preferred_month’: [‘unknown’], ‘travelling_with_pets’: ‘yes’, ‘age’: [’30’, ’31’, ’32’, ’33’, ’34’, ’35’, ’36’, ’37’, ’38’, ’39’, ’40’], ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3Aj52lapMBuHB13c7-hl-4’, ‘desired_destination’: ‘Paris, France’}}
Option_0 FALSE 2 Paris, France If you are looking for something a little more active, I would suggest visiting the Bois de Boulogne. This large park is located on the western edge of Paris and is a great place to go for a walk or a bike ride with your pet. The park has several lakes and ponds, as well as several gardens and playgrounds… {‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_5.txt‘}, ‘type’: ‘S3’}, ‘score’: 0.45283788, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_5.txt‘, ‘travelling_with_children’: ‘no’, ‘activities_interest’: [‘strolling’, ‘picnic’, ‘walk or bike ride’, ‘cafes and restaurants’, ‘art galleries and shops’], ‘companion’: ‘pet’, ‘x-amz-bedrock-kb-data-source-id’: {YOUR_KNOWLEDGE_BASE_ID}, ‘stay_duration’: ‘unknown’, ‘preferred_month’: [‘unknown’], ‘travelling_with_pets’: ‘yes’, ‘age’: [’40’, ’41’, ’42’, ’43’, ’44’, ’45’, ’46’, ’47’, ’48’, ’49’, ’50’], ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3AmtKlapMBdxcT3sYpSK_N’, ‘desired_destination’: ‘Paris, France’}}

Clean up

To avoid incurring additional charges, be sure to delete your knowledge base, OSS/vector store and the underlying S3 bucket.

Conclusion

Enabling dynamic filtering through Knowledge Base’s metadata filtering enhances document retrieval in RAG systems by tailoring outputs to user-specific needs, significantly improving the relevance and accuracy of LLM-generated responses. In the travel website example, filters make sure that retrieved documents closely matched user preferences.

This approach can be applied to other use cases, such as customer support, personalized recommendations, and content curation, where context-sensitive information retrieval is essential. Properly configured filters are crucial for maintaining accuracy across different applications, making this feature a powerful tool for refining LLM outputs in diverse scenarios.

Be sure to take advantage of this powerful and flexible solution in your application. For more information on metadata in Amazon Bedrock Knowledge Bases, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy. Also, Amazon Bedrock Knowledge Bases now provides autogenerated query filters.

Security Best Practices

For AWS IAM Policies:

  • Apply least-privilege permissions by being explicit with IAM actions and listing only required permissions rather than using wildcards
  • Use temporary credentials with IAM roles for workloads
  • Avoid using wildcards (*) in the Action element as this grants access to all actions for specific AWS services
  • Remove wildcards from the Resource element and explicitly list the specific resources that IAM entities should access
  • Review AWS managed policies carefully before using them and consider using customer managed policies if AWS managed policies grant more permissions than needed

For more detailed security best practices for AWS IAM, see Security best practices in IAM.

For Amazon S3:

  • Block Public Access unless explicitly required, make sure S3 buckets are not publicly accessible by using the S3 Block Public Access feature and implementing appropriate bucket policies
  • Enable encryption for data at rest (all S3 buckets have default encryption) and enforce encryption for data in transit using HTTPS/TLS
  • Grant only the minimum permissions required using IAM policies, bucket policies, and disable ACLs (Access Control Lists) which are no longer recommended for most modern use cases
  • Enable server access logging, AWS CloudTrail, and use AWS security services like GuardDuty, Macie, and IAM Access Analyzer to monitor and detect potential security issues

For more detailed security best practices for Amazon S3, see Security best practices for Amazon S3.

For Amazon Bedrock:

  • Use IAM roles and policies to control access to Bedrock resources and APIs.
  • Implement VPC endpoints to access Bedrock securely from within your VPC.
  • Encrypt data at rest and in transit when working with Bedrock to protect sensitive information.
  • Monitor Bedrock usage and access patterns using AWS CloudTrail for auditing purposes.

For more information on security in Amazon Bedrock, see Security in Amazon Bedrock.

For Amazon SageMaker:

  • Use IAM roles to control access to SageMaker resources and limit permissions based on job functions.
  • Encrypt SageMaker notebooks, training jobs, and endpoints using AWS KMS keys for data protection.
  • Implement VPC configurations for SageMaker resources to restrict network access and enhance security.
  • Use SageMaker private endpoints to access APIs without traversing the public internet.

About the Authors

Haley Tien is a Deep Learning Architect at AWS Generative AI Innovation Center. She has a Master’s degree in Data Science and assists customers in building generative AI solutions on AWS to optimize their workloads and achieve desired outcomes.

Adam Weinberger is a Applied Scientist II at AWS Generative AI Innovation Center. He has 10 years of experience in data science and machine learning. He holds a Master’s of Information and Data Science from the University of California, Berkeley.

Dan Ford is a Applied Scientist II at AWS Generative AI Innovation Center, where he helps public sector customers build state-of-the-art GenAI solutions.

Read More

Customize DeepSeek-R1 distilled models using Amazon SageMaker HyperPod recipes – Part 1

Customize DeepSeek-R1 distilled models using Amazon SageMaker HyperPod recipes – Part 1

Increasingly, organizations across industries are turning to generative AI foundation models (FMs) to enhance their applications. To achieve optimal performance for specific use cases, customers are adopting and adapting these FMs to their unique domain requirements. This need for customization has become even more pronounced with the emergence of new models, such as those released by DeepSeek.

However, customizing DeepSeek models effectively while managing computational resources remains a significant challenge. Tuning model architecture requires technical expertise, training and fine-tuning parameters, and managing distributed training infrastructure, among others. This often forces companies to choose between model performance and practical implementation constraints, creating a critical need for more accessible and streamlined model customization solutions.

In this two-part series, we discuss how you can reduce the DeepSeek model customization complexity by using the pre-built fine-tuning workflows (also called “recipes”) for both DeepSeek-R1 model and its distilled variations, released as part of Amazon SageMaker HyperPod recipes.

In this first post, we will build a solution architecture for fine-tuning DeepSeek-R1 distilled models and demonstrate the approach by providing a step-by-step example on customizing the DeepSeek-R1 Distill Qwen 7b model using recipes, achieving an average of 25% on all the Rouge scores, with a maximum of 49% on Rouge 2 score with both SageMaker HyperPod and SageMaker training jobs. The second part of the series will focus on fine-tuning the DeepSeek-R1 671b model itself.

At the time of this writing, the DeepSeek-R1 model and its distilled variations for Llama and Qwen were the latest released recipe. Check out sagemaker-hyperpod-recipes on GitHub for the latest released recipes, including support for fine-tuning the DeepSeek-R1 671b parameter model.

Amazon SageMaker HyperPod recipes

At re:Invent 2024, we announced the general availability of Amazon SageMaker HyperPod recipes. SageMaker HyperPod recipes help data scientists and developers of all skill sets to get started training and fine-tuning popular publicly available generative AI models in minutes with state-of-the-art training performance. These recipes include a training stack validated by Amazon Web Services (AWS), which removes the tedious work of experimenting with different model configurations, minimizing the time it takes for iterative evaluation and testing. They automate several critical steps, such as loading training datasets, applying distributed training techniques, automating checkpoints for faster recovery from faults, and managing the end-to-end training loop.

Recipes, paired with the resilient infrastructure of AWS, (Amazon SageMaker HyperPod and Amazon SageMaker Model Training) provide a resilient training environment for fine-tuning FMs such as DeepSeek-R1 with out-of-the-box customization.

To help customers quickly use DeepSeek’s powerful and cost-efficient models to accelerate generative AI innovation, we released new recipes to fine-tune six DeepSeek models, including DeepSeek-R1 distilled Llama and Qwen models using supervised fine-tuning (SFT), Quantized Low-Rank Adaptation (QLoRA), Low-Rank Adaptation (LoRA) techniques. In this post, we introduce these new recipes and walk you through a solution to fine-tune a DeepSeek Qwen 7b model for an advanced medical reasoning use case.

Solution overview

At its core, as depicted in the following diagram, the recipe architecture implements a hierarchical workflow that begins with a recipe specification that covers a comprehensive configuration defining the training parameters, model architecture, and distributed training strategies. These recipes are processed through the HyperPod recipe launcher, which serves as the orchestration layer responsible for launching a job on the corresponding architecture. The launcher interfaces with underlying cluster management systems such as SageMaker HyperPod (Slurm or Kubernetes) or training jobs, which handle resource allocation and scheduling. It’s a familiar NeMo-style launcher with which you can choose a recipe and run it on your infrastructure of choice (SageMaker HyperPod or training).

For example, after choosing your recipe, you can pre-train or fine-tune a model by running python3 main.py recipes=recipe-name. Alternatively, you can use a launcher script, which is a bash script that is preconfigured to run the chosen training or fine-tuning job on your cluster. You can check out main.py (NeMo style launcher) and launcher scripts for DeepSeek on the GitHub repository hosting SageMaker HyperPod recipes.

A key component of this architecture is the HyperPod training adapter for NeMo, which is built on the NVIDIA NeMo framework and Neuronx Distributed training package, which loads data, creates models, and facilitates efficient data parallelism, model parallelism, and hybrid parallelism strategies, which enables optimal utilization of computational resources across the distributed infrastructure. The architecture’s modular design allows for scalability and flexibility, making it particularly effective for training LLMs that require distributed computing capabilities.

You can run these recipes using SageMaker HyperPod or as SageMaker training jobs. For organizations that require granular control over training infrastructure and extensive customization options, SageMaker HyperPod is the ideal choice. SageMaker training jobs, on the other hand, is tailored for organizations that want a fully managed experience for their training workflows. To learn more details about these service features, refer to Generative AI foundation model training on Amazon SageMaker.

In the next sections, we go over the solution architecture for these services before presenting a step-by-step implementation example for each.

SageMaker HyperPod

To submit jobs using SageMaker HyperPod, you can use the HyperPod recipes launcher, which provides an straightforward mechanism to run recipes on both Slurm and Kubernetes. After you choose your orchestrator, you can choose your recipe’s launcher and have it run on your HyperPod cluster. The launcher will interface with your cluster with Slurm or Kubernetes native constructs. For this post, we use the HyperPod recipes launcher mechanism to run the training on a Slurm cluster. The following image shows the solution architecture for SageMaker HyperPod.

SageMaker training jobs

The workflow for SageMaker training jobs begins with an API request that interfaces with the SageMaker control plane, which manages the orchestration of training resources. The system uses the training jobs launcher to efficiently run workloads on a managed cluster.

The architecture uses Amazon Elastic Container Registry (Amazon ECR) for container image management. Training jobs are executed across a distributed cluster, with seamless integration to multiple storage solutions, including Amazon Simple Storage Service (Amazon S3), Amazon Elastic File Storage (Amazon EFS), and Amazon FSx for Lustre. All of this runs under the SageMaker managed environment, providing optimal resource utilization and security.

This design simplifies the complexity of distributed training while maintaining the flexibility needed for diverse machine learning (ML) workloads, making it an ideal solution for enterprise AI development. The following image shows the solution architecture for SageMaker training jobs.

Solution walkthrough

For this solution, consider a use case for a healthcare industry startup that aims to create an accurate, medically verified chat assistant application that bridges complex medical information with patient-friendly explanations. By fine-tuning DeepSeek-R1 Distill Qwen 7b using the FreedomIntelligence/medical-o1-reasoning-SFT dataset, you can use its medical reasoning capabilities to produce content that maintains clinical accuracy.

Prerequisites

You need to complete the following prerequisites before you can run the DeepSeek-R1 Distill Qwen 7B model fine-tuning notebook.

  1. Make the following quota increase requests for SageMaker. You need to request a minimum of one p4d.24xlarge instance (with 8 x NVIDIA A100 GPUs) ranging to a maximum of two p4d.24xlarge instances (depending on time-to-train and cost-to-train trade-offs for your use case).

On the Service Quotas console, request the following SageMaker quotas:

    • P4 instances (p4d.24xlarge) for training job usage: 1–2
    • P4 instances (p4d.24xlarge) for HyperPod clusters (“ml.p4d.24xlarge for cluster usage“): 1-2
  1. If you choose to use HyperPod clusters to run your training, set up a HyperPod Slurm cluster following the documentation at Tutuorial for getting started with SageMaker HyperPod. Alternatively, you can use the AWS CloudFormation template provided in the AWS Workshop Studio at Amazon SageMaker HyperPod Own Account and follow the instructions to set up a cluster and a development environment to access and submit jobs to the cluster.
  2. (Optional) If you choose to use SageMaker training jobs, you can create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding role. (You can use JupyterLab in your local setup, too.)
  1. Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets:
git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git 
cd 18_sagemaker_training_recipes/ft_deepseek_qwen_lora

Next, we run the model_trainer_deepseek_r1_recipe_lora.ipynb notebook to fine-tune the DeepSeek-R1 model using QLoRA on SageMaker.

Prepare the dataset

To prepare the dataset, you need to load the FreedomIntelligence/medical-o1-reasoning-SFT dataset, tokenize and chunk the dataset, and configure the data channels for SageMaker training on Amazon S3. Complete the following steps:

  1. Format the dataset by applying the prompt format for DeepSeek-R1 Distill Qwen 7B:
def generate_prompt(data_point):
    full_prompt = f"""
    Below is an instruction that describes a task, paired with an input that provides further context.
    Write a response that appropriately completes the request.
    Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

    ### Instruction:
    You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
    Please answer the following medical question.

    ### Question:
    {data_point["Question"]}

    ### Response:
    {data_point["Complex_CoT"]}

    """
    return {"prompt": full_prompt.strip()}
  1. Load the FreedomIntelligence/medical-o1-reasoning-SFT dataset and split it into training and validation datasets:
# Load dataset from the hub
train_set = load_dataset(dataset_name, 'en', split="train[5%:]")
test_set = load_dataset(dataset_name, 'en', split="train[:5%]")

...

train_dataset = train_set.map(
    generate_and_tokenize_prompt,
    remove_columns=columns_to_remove,
    batched=False
)

test_dataset = test_set.map(
    generate_and_tokenize_prompt,
    remove_columns=columns_to_remove,
    batched=False
)
  1. Load the DeepSeek-R1 Distill Qwen 7B tokenizer from the Hugging Face Transformers library and generate tokens for the train and validation datasets:
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
max_seq_length=1024

# Initialize a tokenizer by loading a pre-trained tokenizer configuration, using the fast tokenizer implementation if available.
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

...

train_dataset = train_dataset.map(tokenize, remove_columns=["prompt"])
test_dataset = test_dataset.map(tokenize, remove_columns=["prompt"])
  1. Prepare the training and validation datasets for SageMaker training by saving them as arrow files, which is required by SageMaker HyperPod recipes, and constructing the S3 paths where these files will be uploaded:
train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/train"
val_dataset_s3_path = f"s3://{bucket_name}/{input_path}/test"

train_dataset.save_to_disk(train_dataset_s3_path)
val_dataset.save_to_disk(val_dataset_s3_path)

The dataset above will be used in the examples for both SageMaker training jobs and SageMaker HyerPod.

Option A: Fine-tune using SageMaker training jobs

To fine-tune the model using SageMaker training jobs with recipes, this example uses the ModelTrainer class.

The ModelTrainer class is a newer and more intuitive approach to model training that significantly enhances user experience and supports distributed training, Build Your Own Container (BYOC), and recipes. For additional information about ModelTrainer, you can refer to Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer

To set up the fine-tuning workload, complete the following steps:

  1. Select the instance type, the container image for the training job, and define the checkpoint path where the model will be stored:
instance_type = "ml.p4d.24xlarge"

image_uri = (
    f"658645717510.dkr.ecr.{sagemaker_session.boto_session.region_name}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
)

checkpoint_s3_path = f"s3://{bucket_name}/deepseek-r1-distilled-qwen-7b-recipe-lora/checkpoints"
  1. Create the ModelTrainer function to encapsulate the training setup from a selected recipe:
from sagemaker.modules.configs import CheckpointConfig, Compute, InputData, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer

instance_count = 1

# Working override for custom dataset
recipe_overrides = {
    ...
    "trainer": {
        "num_nodes": instance_count,
        ...
    },
    ...
    "use_smp_model": False, # Required for PEFT
    "model": {
        "hf_model_name_or_path": model_id,
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/test",
        },
    },
}

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=instance_count,
    keep_alive_period_in_seconds=0
)

model_trainer = ModelTrainer.from_recipe(
    training_image=image_uri,
    training_recipe="fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq8k_gpu_lora",
    recipe_overrides=recipe_overrides,
    requirements="./requirements.txt",
    compute=compute_configs,
    ...
    checkpoint_config=CheckpointConfig(
        s3_uri=f"{checkpoint_s3_path}/{job_prefix}"
    ),
)

You can point to the specific recipe with the training_recipe argument and override the recipe arguments by providing a dictionary as argument of recipe_overrides. In the previous example:

  • num_nodes: Indicates the number of instances that will be used for the fine-tuning execution
  • checkpoint_dir: Location in the container where the job will save model checkpoints

The ModelTrainer class simplifies the experience by encapsulating code and training setup directly from the selected recipe. In this example:

  • training_recipe: hf_deepseek_r1_distilled_qwen_7b_seq8k_gpu_lora is defining fine-tuning setup for the LoRA technique
  1. Set up the input channels for ModelTrainer by creating an InputData objects from the provided S3 bucket paths for the training and test and validation datasets
  2. Submit the training job:
# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=True)

Option B: Fine-tune using SageMaker HyperPod with Slurm

To fine-tune the model using HyperPod, make sure your cluster is up and ready by following the prerequisites. To access the login or head node of the HyperPod Slurm cluster from your development environment, follow the login instructions at Log in to your cluster in the Amazon SageMaker HyperPod workshop.

Alternatively, you can also use AWS Systems Manager and run a command like the following to start the session. You can find the cluster ID, instance group name, and instance ID on the Amazon SageMaker console.

aws ssm start-session --target sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id] --region region_name
  1. In the cluster’s login or head node, run the following commands to set up the environment. Run sudo su - ubuntu to run the remaining commands as the root user unless you have a specific user ID to access the cluster and your POSIX user is created through a lifecycle script on the cluster. Refer to the multi-user setup for more details.
# create a virtual environment 
python3 -m venv ${PWD}/venv
source venv/bin/activate

# clone the recipes repository and set up the environment
git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt
  1. Create a squash file using Enroot to run the job on the cluster. Enroot runtime offers GPU acceleration, rootless container support, and seamless integration with high performance computing (HPC) environments, making it ideal for running our workflows securely.
# create a squash file using Enroot
REGION=<region>
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  1. After you’ve created the squash file, update the recipes_collection/config.yaml file with the absolute path to the squash file (created in the preceding step), and update the instance_type if needed. The final config file should have the following parameters:
...

cluster_type: slurm 
...

instance_type: p4d.24xlarge
...

container: /fsx/<path-to-smdistributed-modelparallel>.sqsh
...
  1. Download the prepared dataset that you uploaded to S3 into the FSx for Lustre volume attached to the cluster. Run the following commands to download the files from Amazon S3:
aws s3 cp s3://{bucket_name}/{input_path}/train /fsx/ubuntu/deepseek/data/train --recursive
aws s3 cp s3://{bucket_name}/{input_path}/test /fsx/ubuntu/deepseek/data/test --recursive
  1. Update the launcher script for fine-tuning the DeepSeek-R1 Distill Qwen 7B model. The launcher scripts serve as convenient wrappers for executing the training script main.py file), which streamlines the process of fine-tuning and parameter adjustment. For fine-tuning the DeepSeek-R1 Qwen 7B model, you can find the specific script at:
launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_7b_seq16k_gpu_fine_tuning.sh
  1. Before running the script, you need to modify the location of the training and validation files and update the HuggingFace model ID and optionally the access token for private models and datasets. The script should look like the following (update recipes.trainer.num_nodes if you’re using a multi-node cluster):
SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}

HF_MODEL_NAME_OR_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" # HuggingFace pretrained model name or path
HF_ACCESS_TOKEN="hf_xxxx" # Optional HuggingFace access token

TRAIN_DIR="/fsx/ubuntu/deepseek/data/train" # Location of training dataset 
VAL_DIR="/fsx/ubuntu/deepseek/data/test" # Location of validation dataset

EXP_DIR="/fsx/ubuntu/deepseek/results" # Location to save experiment info including logging, checkpoints, etc

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" 
    recipes=fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq16k_gpu_fine_tuning 
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" 
    recipes.run.name="hf-deepseek-r1-distilled-qwen-7b-fine-tuning" 
    recipes.exp_manager.exp_dir="$EXP_DIR" 
    recipes.trainer.num_nodes=1 
    recipes.model.data.train_dir="$TRAIN_DIR" 
    recipes.model.data.val_dir="$VAL_DIR" 
    recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" 
    recipes.model.hf_access_token="$HF_ACCESS_TOKEN" 

You can view the recipe for this fine-tuning task under, overriding any additional parameters as needed:

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq16k_gpu_fine_tuning.yaml
  1. Submit the job by running the launcher script:
bash launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_7b_seq16k_gpu_fine_tuning.sh

You can monitor the job using Slurm commands such as squeue and scontrol show to view the status of the job and the corresponding logs. After the job is complete, the trained model will also be available in the results folder, as shown in the following code:

cd results
 ls -R
.:
checkpoints  experiment

./checkpoints:
full

./checkpoints/full:
steps_50

./checkpoints/full/steps_50:
config.json  pytorch_model.bin

./experiment:
...
  1. Upload the fine-tuned model checkpoint to Amazon S3 for evaluating the model using the validation data:
aws s3 cp /fsx/<path_to_checkpoint> s3://{bucket_name}/{model_prefix}/qwen7b --recursive

Evaluate the fine-tuned model

To objectively evaluate your fine-tuned model, you can run an evaluation job on the validation portion of the dataset.

You can run a SageMaker training job and use ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-L-Sum), which measure the similarity between machine-generated text and human-written reference text. The SageMaker training job will compute ROUGE metrics for both the base DeepSeek-R1 Distill Qwen 7B model and the fine-tuned one. You can access the code sample for ROUGE evaluation in the sagemaker-distributed-training-workshop on GitHub. Please refer this notebook for details.

Complete the following steps:

  1. Define the S3 path where the fine-tuned checkpoints are stored, the instance_type, and the image uri to use in the training job:
trained_model = <S3_PATH>
instance_type = "ml.p4d.24xlarge"

image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=sagemaker_session.boto_session.region_name,
    version="2.4",
    instance_type=instance_type,
    image_scope="training"
)
#763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.4-gpu-py311
  1. Create the ModelTrainer function to encapsulate the evaluation script and define the input data:
from sagemaker.modules.configs import Compute, InputData, OutputDataConfig, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer

# Define the script to be run
source_code = SourceCode(
    source_dir="./scripts",
    requirements="requirements.txt",
    entry_script="evaluate_recipe.py",
)

# Define the compute
...

# Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    compute=compute_configs,
    ...
    hyperparameters={
        "model_id": model_id,  # Hugging Face model id
        "dataset_name": dataset_name
    }
)

# Pass the input data
train_input = InputData(
   channel_name="adapterdir",
   data_source=trained_model,
)

test_input = InputData(
   channel_name="testdata",
   data_source=test_dataset_s3_path, # S3 path where training data is stored
)

# Check input channels configured
data = [train_input, test_input]
  1. Submit the training job:
# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=True)

The following table shows the task output for the fine-tuned model and the base model.

Model Rouge 1 Rouge 2 Rouge L Rouge L Sum
Base 0.36362 0.08739 0.16345 0.3204
Fine-tuned 0.44232 0.13022 0.17769 0.38989
% Difference 21.64207 49.01703 8.7121 21.68871

Our fine-tuned model demonstrates remarkable efficiency, achieving about 22% overall improvement on the reasoning task after only one training epoch. The most significant gain appears in Rouge 2 scores—which measure bigram overlap—with about 49% increase, indicating better alignment between generated and reference summaries.

Notably, preliminary experiments suggest these results could be further enhanced by extending the training duration. Increasing the number of epochs shows promising potential for additional performance gains while maintaining computational efficiency.

Clean up

To clean up your resources to avoid incurring any more charges, follow these steps:

  1. Delete any unused SageMaker Studio resources
  2. (Optional) Delete the SageMaker Studio domain
  3. Verify that your training job isn’t running anymore. To do so, on your SageMaker console, choose Training and check Training jobs.
  4. If you created a HyperPod cluster, delete the cluster to stop incurring costs. If you created the networking stack from the HyperPod workshop, delete the stack as well to clean up the virtual private cloud (VPC) resources and the FSx for Lustre volume.

Conclusion

In the first post of this two-part DeepSeek-R1 series, we discussed how SageMaker HyperPod recipes provide a powerful yet accessible solution for organizations to scale their AI model training capabilities with large language models (LLMs) including DeepSeek. The architecture streamlines complex distributed training workflows through its intuitive recipe-based approach, reducing setup time from weeks to minutes.

We recommend starting your LLM customization journey by exploring our sample recipes in the Amazon SageMaker HyperPod documentation. The AWS AI/ML community offers extensive resources, including workshops and technical guidance, to support your implementation journey.

To begin using the SageMaker HyperPod recipes, visit the sagemaker-hyperpod-recipes repo on GitHub for comprehensive documentation and example implementations. Our team continues to expand the recipe ecosystem based on customer feedback and emerging ML trends, making sure that you have the tools needed for successful AI model training.

In our second post, we discuss how these recipes could further be used to fine-tune DeepSeek-R1 671b model. Stay tuned!


About the Authors

 Kanwaljit Khurmi is a Principal Worldwide Generative AI Solutions Architect at AWS. He collaborates with AWS product teams, engineering departments, and customers to provide guidance and technical assistance, helping them enhance the value of their hybrid machine learning solutions on AWS. Kanwaljit specializes in assisting customers with containerized applications and high-performance computing solutions.

 Bruno Pistone is a Senior World Wide Generative AI/ML Specialist Solutions Architect at AWS based in Milan, Italy. He works with AWS product teams and large customers to help them fully understand their technical needs and design AI and Machine Learning solutions that take full advantage of the AWS cloud and Amazon Machine Learning stack. His expertise includes: End-to-end Machine Learning, model customization, and generative AI. He enjoys spending time with friends, exploring new places, and traveling to new destinations.

 Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

 Durga Sury is a Senior Solutions Architect on the Amazon SageMaker team. Over the past 5 years, she has worked with multiple enterprise customers to set up a secure, scalable AI/ML platform built on SageMaker.

 Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.

Anirudh Viswanathan is a Sr Product Manager, Technical – External Services with the SageMaker AI Training team. He holds a Masters in Robotics from Carnegie Mellon University, an MBA from the Wharton School of Business, and is named inventor on over 40 patents. He enjoys long-distance running, visiting art galleries, and Broadway shows.

Read More

Reduce conversational AI response time through inference at the edge with AWS Local Zones

Reduce conversational AI response time through inference at the edge with AWS Local Zones

Recent advances in generative AI have led to the proliferation of new generation of conversational AI assistants powered by foundation models (FMs). These latency-sensitive applications enable real-time text and voice interactions, responding naturally to human conversations. Their applications span a variety of sectors, including customer service, healthcare, education, personal and business productivity, and many others.

Conversational AI assistants are typically deployed directly on users’ devices, such as smartphones, tablets, or desktop computers, enabling quick, local processing of voice or text input. However, the FM that powers the assistant’s natural language understanding and response generation is usually cloud-hosted, running on powerful GPUs. When a user interacts with the AI assistant, their device first processes the input locally, including speech-to-text (STT) conversion for voice agents, and compiles a prompt. This prompt is then securely transmitted to the cloud-based FM over the network. The FM analyzes the prompt and begins generating an appropriate response, streaming it back to the user’s device. The device further processes this response, including text-to-speech (TTS) conversion for voice agents, before presenting it to the user. This efficient workflow strikes a balance between the powerful capabilities of cloud-based FMs and the convenience and responsiveness of local device interaction, as illustrated in the following figure.Request flow for a conversational AI assistant

A critical challenge in developing such applications is reducing response latency to enable real-time, natural interactions. Response latency refers to the time between the user finishing their speech and beginning to hear the AI assistant’s response. This delay typically comprises two primary components:

  • On-device processing latency – This encompasses the time required for local processing, including TTS and STT operations.
  • Time to first token (TTFT) – This measures the interval between the device sending a prompt to the cloud and receiving the first token of the response. TTFT consists of two components. First is the network latency, which is the round-trip time for data transmission between the device and the cloud. Second is the first token generation time, which is the period between the FM receiving a complete prompt and generating the first output token. TTFT is crucial for user experience in conversational AI interfaces that use response streaming with FMs. With response streaming, users start receiving the response while it’s still being generated, significantly improving perceived latency.

The ideal response latency for humanlike conversation flow is generally considered to be in the 200–500 milliseconds (ms) range, closely mimicking natural pauses in human conversation. Given the additional on-device processing latency, achieving this target requires a TTFT well below 200 ms.

Although many customers focus on optimizing the technology stack behind the FM inference endpoint through techniques such as model optimization, hardware acceleration, and semantic caching to reduce the TTFT, they often overlook the significant impact of network latency. This latency can vary considerably due to geographic distance between users and cloud services, as well as the diverse quality of internet connectivity.

Hybrid architecture with AWS Local Zones

To minimize the impact of network latency on TTFT for users regardless of their locations, a hybrid architecture can be implemented by extending AWS services from commercial Regions to edge locations closer to end users. This approach involves deploying additional inference endpoints on AWS edge services and using Amazon Route 53 to implement dynamic routing policies, such as geolocation routing, geoproximity routing, or latency-based routing. These strategies dynamically distribute traffic between edge locations and commercial Regions, providing fast response times based on real-time network conditions and user locations.

AWS Local Zones are a type of edge infrastructure deployment that places select AWS services close to large population and industry centers. They enable applications requiring very low latency or local data processing using familiar APIs and tool sets. Each Local Zone is a logical extension of a corresponding parent AWS Region, which means customers can extend their Amazon Virtual Private Cloud (Amazon VPC) connections by creating a new subnet with a Local Zone assignment.

This guide demonstrates how to deploy an open source FM from Hugging Face on Amazon Elastic Compute Cloud (Amazon EC2) instances across three locations: a commercial AWS Region and two AWS Local Zones. Through comparative benchmarking tests, we illustrate how deploying FMs in Local Zones closer to end users can significantly reduce latency—a critical factor for real-time applications such as conversational AI assistants.

Prerequisites

To run this demo, complete the following prerequisites:

Solution walkthrough

This section walks you through the steps to launch an Amazon EC2 G4dn instance and deploy an FM for inference in the Los Angeles Local Zone. The instructions are also applicable for deployments in the parent Region, US West (Oregon), and the Honolulu Local Zone.

We use Meta’s open source Llama 3.2-3B as the FM for this demonstration. This is a lightweight FM from the Llama 3.2 family, classified as a small language model (SLM) due to its small number of parameters. Compared to large language models (LLMs), SLMs are more efficient and cost-effective to train and deploy, excel when fine-tuned for specific tasks, offer faster inference times, and have lower resource requirements. These characteristics make SLMs particularly well-suited for deployment on edge services such as AWS Local Zones.

To launch an EC2 instance in the Los Angeles Local Zone subnet, follow these steps:

  1. On the Amazon EC2 console dashboard, in the Launch instance box, choose Launch instance.
  2. Under Name and tags, enter a descriptive name for the instance (for example, la-local-zone-instance).
  3. Under Application and OS Images (Amazon Machine Image), select an AWS Deep Learning AMI that comes preconfigured with NVIDIA OSS driver and PyTorch. For our deployment, we used Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.1 (Amazon Linux 2).
  4. Under Instance type, from the Instance type list, select the hardware configuration for your instance that’s supported in a Local Zone. We selected G4dn.2xlarge for this solution. This instance is equipped with one NVIDIA T4 Tensor Core GPU and 16 GB of GPU memory, which makes it ideal for high performance and cost-effective inference of SLMs on the edge. Available instance types for each Local Zone can be found at AWS Local Zones features. Review the hardware requirements for your FM to select the appropriate instance.
  5. Under Key pair (login), choose an existing key pair or create a new one.
  6. Next to Network settings, choose Edit, and then:
    1. Select your VPC.
    2. Select your Local Zone subnet.
    3. Create a security group or select an existing one. Configure the security group’s inbound rules to allow traffic only from your client’s IP address on port 8080.
  7. You can keep the default selections for the other configuration settings for your instance. To determine the storage types that are supported, refer to the Compute and storage section in AWS Local Zones features.
  8. Review the summary of your instance configuration in the Summary panel and, when you’re ready, choose Launch instance.
  9. A confirmation page lets you know that your instance is launching. Choose View all instances to close the confirmation page and return to the console.

Next, complete the following steps to deploy Llama 3.2-3B using the Hugging Face Text Generation Inference (TGI) as the model server:

  1. Connect by using Secure Shell (SSH) into the instance
  2. Start the docker service using the following command. This comes preinstalled with the AMI we selected.
sudo service docker start
  1. Run the following command to download and run the Docker image for TGI server as well as Llama 3.2-3B model. In our deployment, we used Docker image version 2.4.0, but results might vary based on your selected version. The full list of supported models by TGI can be found at Hugging Face Supported Models. For more details about the deployment and optimization of TGI, refer to this text-generation-inference GitHub page.
model=meta-llama/Llama-3.2-3B
volume=$PWD/data
token=<ENTER YOUR HUGGING FACE TOKEN>

sudo docker run -d --gpus all 
    --shm-size 1g 
    -e HF_TOKEN=$token 
    -p 8080:80 
    -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.4.0 
    --model-id $model
  1. After the TGI container is running, you can test your endpoint by running the following command from your local environment:
curl <REPLACE WITH YOUR EC2 PUBLIC IP >:8080/generate -X POST 
    -d '{"inputs":"What is deep learning?","parameters":{"max_new_tokens":200, "temperature":0.2, "top_p":0.9}}' 
    -H 'Content-Type: application/json'

Performance evaluation

To demonstrate TTFT improvements with FM inference on Local Zones, we followed the steps in the previous section to deploy Llama 3.2 3B in three locations: in the us-west-2-c Availability Zone in the parent Region, US West (Oregon); in the us-west-2-lax-1a Local Zone in Los Angeles; and in the us-west-2-hnl-1a Local Zone in Honolulu. This is illustrated in the following figure. Notice that the architecture provided in this post is meant to be used for performance evaluation in a development environment. Before migrating any of the provided architecture to production, we recommend following the AWS Well-Architected Framework.

We conducted two separate test scenarios to evaluate TTFT as explained in the following:

Los Angeles test scenario:

  • Test user’s location – Los Angeles metropolitan area
  • Test A – 150 requests sent to FM deployed in Los Angeles Local Zone
  • Test B – 150 requests sent to FM deployed in US West (Oregon)

Honolulu test scenario:

  • Test user’s location – Honolulu metropolitan area
  • Test C – 150 requests sent to FM deployed in Honolulu Local Zone
  • Test D – 150 requests sent to FM deployed in US West (Oregon)

Architecture diagram for the deployment of FM inference endpoints

Evaluation setup

To conduct TTFT measurements, we use the load testing capabilities of the open source project LLMPerf. This tool launches multiple requests from the test user’s client to the FM endpoint and measures various performance metrics, including TTFT. Each request contains a random prompt with a mean token count of 250 tokens. Although a single prompt for short-form conversations typically consists of 50 tokens, we set the mean input token size to 250 tokens to account for multi-turn conversation history, system prompts, and contextual information that better represents real-world usage patterns.

Detailed instructions for installing LLMPerf and executing the load testing are available in the project’s documentation. Additionally, because we are using the Hugging Face TGI as the inference server, we follow the corresponding instructions from LLMPerf to perform the load testing. The following is the example command to initiate the load testing from the command line:

export HUGGINGFACE_API_BASE="http://<REPLACE WITH YOUR EC2 PUBLIC IP>:8080" 
export HUGGINGFACE_API_KEY="" 

python token_benchmark_ray.py 
    --model "huggingface/meta-llama/Llama-3.2-3B" 
    --mean-input-tokens 250 
    --stddev-input-tokens 50 
    --mean-output-tokens 100 
    --stddev-output-tokens 20 
    --max-num-completed-requests 150
    --timeout 600 
    --num-concurrent-requests 1 
    --results-dir "result_outputs" 
    --llm-api "litellm" 
    --additional-sampling-params '{}'

Each test scenario compares the TTFT latency between Local Zone and the parent Region endpoints to assess the impact of geographical distance. Latency results might vary based on several factors, including:

  • Test parameters and configuration
  • Time of day and network traffic
  • Internet service provider
  • Specific client location within the test Region
  • Current server load

Results

The following tables below present TTFT measurements in milliseconds (ms) for two distinct test scenarios. The results demonstrate significant TTFT reductions when using a Local Zone compared to the parent Region for both the Los Angeles and the Honolulu test scenarios. The observed differences in TTFT are solely attributed to network latency because identical FM inference configurations were employed in both the Local Zone and the parent Region.

User location: Los Angeles Metropolitan Area
LLM inference endpoint Mean (ms) Min (ms) P25 (ms) P50 (ms) P75 (ms) P95 (ms) P99 (ms) Max (ms)
Parent Region: US West (Oregon) 135 118 125 130 139 165 197 288
Local Zone: Los Angeles 80 50 72 75 86 116 141 232

The user in Los Angeles achieved a mean TTFT of 80 ms when calling the FM endpoint in the Los Angeles Local Zone, compared to 135 ms for the endpoint in the US West (Oregon) Region. This represents a 55 ms (about 41%) reduction in latency.

User location: Honolulu Metropolitan Area
LLM inference endpoint Mean (ms) Min (ms) P25 (ms) P50 (ms) P75 (ms) P95 (ms) P99 (ms) Max (ms)
Parent Region: US West (Oregon) 197 172 180 183 187 243 472 683
Local Zone: Honolulu 114 58 70 85 164 209 273 369

The user in Honolulu achieved a mean TTFT of 114 ms when calling the FM endpoint in the Honolulu Local Zone, compared to 197 ms for the endpoint in the US West (Oregon) Region. This represents an 83 ms (about 42%) reduction in latency.

Moreover, the TTFT reduction achieved by Local Zone deployments is consistent across all metrics in both test scenarios, from minimum to maximum values and throughout all percentiles (P25–P99), indicating a consistent improvement across all requests.

Finally, remember that TTFT is just one component of overall response latency, alongside on-device processing latency. By reducing TTFT using Local Zones, you create additional margin for on-device processing latency, making it easier to achieve the target response latency range needed for humanlike conversation.

Cleanup

In this post, we created Local Zones, subnets, security groups, and EC2 instances. To avoid incurring additional charges, it’s crucial to properly clean up these resources when they’re no longer needed. To do so, follow these steps:

  1. Terminate the EC2 instances and delete their associated Amazon Elastic Block Store (Amazon EBS) volumes.
  2. Delete the security groups and subnets.
  3. Disable the Local Zones.

Conclusion

In conclusion, this post highlights how edge computing services, such as AWS Local Zones, play a crucial role in reducing FM inference latency for conversational AI applications. Our test deployments of Meta’s Llama 3.2-3B demonstrated that placing FM inference endpoints closer to end users through Local Zones dramatically reduces TTFT compared to traditional Regional deployments. This TTFT reduction plays a critical role in optimizing the overall response latency, helping achieve the target response times essential for natural, humanlike interactions regardless of user location.

To use these benefits for your own applications, we encourage you to explore the AWS Local Zones documentation. There, you’ll find information on available locations and supported AWS services so you can bring the power of edge computing to your conversational AI solutions.


About the Authors

Nima SeifiNima Seifi is a Solutions Architect at AWS, based in Southern California, where he specializes in SaaS and LLMOps. He serves as a technical advisor to startups building on AWS. Prior to AWS, he worked as a DevOps architect in the e-commerce industry for over 5 years, following a decade of R&D work in mobile internet technologies. Nima has authored 20+ technical publications and holds 7 U.S. patents. Outside of work, he enjoys reading, watching documentaries, and taking beach walks.

Nelson OngNelson Ong is a Solutions Architect at Amazon Web Services. He works with early stage startups across industries to accelerate their cloud adoption.

Read More

Pixtral-12B-2409 is now available on Amazon Bedrock Marketplace

Pixtral-12B-2409 is now available on Amazon Bedrock Marketplace

Today, we are excited to announce that Pixtral 12B (pixtral-12b-2409), a state-of-the-art 12 billion parameter vision language model (VLM) from Mistral AI that excels in both text-only and multimodal tasks, is available for customers through Amazon Bedrock Marketplace. Amazon Bedrock Marketplace is a new capability in Amazon Bedrock that enables developers to discover, test, and use over 100 popular, emerging, and specialized foundation models (FMs) alongside the current selection of industry-leading models in Amazon Bedrock. You can also use this model with Amazon SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference.

In this post, we walk through how to discover, deploy, and use the Pixtral 12B model for a variety of real-world vision use cases.

Overview of Pixtral 12B

Pixtral 12B, Mistral’s inaugural VLM, delivers robust performance across a range of benchmarks, surpassing other open models and rivaling larger counterparts, according to Mistral’s evaluation. Designed for both image and document comprehension, Pixtral demonstrates advanced capabilities in vision-related tasks, including chart and figure interpretation, document question answering, multimodal reasoning, and instruction following—several of which are illustrated with examples later in this post. The model processes images at their native resolution and aspect ratio, providing high-fidelity input handling. Unlike many open source alternatives, Pixtral 12B achieves strong results in text-based benchmarks—such as instruction following, coding, and mathematical reasoning—without sacrificing its proficiency in multimodal tasks.

Mistral developed a novel architecture for Pixtral 12B, optimized for both computational efficiency and performance. The model consists of two main components: a 400-million-parameter vision encoder, responsible for tokenizing images, and a 12-billion-parameter multimodal transformer decoder, which predicts the next text token based on a sequence of text and images. The vision encoder was specifically trained to natively handle variable image sizes, enabling Pixtral to accurately interpret high-resolution diagrams, charts, and documents while maintaining fast inference speeds for smaller images such as icons, clipart, and equations. This architecture supports processing an arbitrary number of images of varying sizes within a large context window of 128k tokens.

License agreements are a critical decision factor when using open-weights models. Similar to other Mistral models, such as Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, and Mistral Nemo 12B, Pixtral 12B is released under the commercially permissive Apache 2.0, providing enterprise and startup customers with a high-performing VLM option to build complex multimodal applications.

Performance metrics and benchmarks

Pixtral 12B is trained to understand both natural images and documents, achieving 52.5% on the Massive Multitask Language Understanding (MMLU) reasoning benchmark, surpassing a number of larger models according to Mistral. The MMLU benchmark is a test that evaluates a language model’s ability to understand and use language across a variety of subjects. The MMLU consists of over 10,000 multiple-choice questions spanning a variety of academic subjects, including mathematics, philosophy, law, and medicine. The model shows strong abilities in tasks such as chart and figure understanding, document question answering, multimodal reasoning, and instruction following. Pixtral is able to ingest images at their natural resolution and aspect ratio, giving the user flexibility on the number of tokens used to process an image. Pixtral is also able to process multiple images in its long context window of 128,000 tokens. Unlike previous open source models, Pixtral doesn’t compromise on text benchmark performance to excel in multimodal tasks, according to Mistral.

You can review the Mistral published benchmarks

Prerequisites

To try out Pixtral 12B in Amazon Bedrock Marketplace, you will need the following prerequisites:

Deploy Pixtral 12B in Amazon Bedrock Marketplace

On the Amazon Bedrock console, you can search for models that help you with a specific use case or language. The results of the search include both serverless models and models available in Amazon Bedrock Marketplace. You can filter results by provider, modality (such as text, image, or audio), or task (such as classification or text summarization).

To access Pixtral 12B in Amazon Bedrock Marketplace, follow these steps:

  1. On the Amazon Bedrock console, choose Model catalog under Foundation models in the navigation pane.
  2. Filter for Hugging Face as a provider and choose the Pixtral 12B model, or search for Pixtral in the Filter for a model input box.

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

The page also includes deployment options and licensing information to help you get started with Pixtral 12B in your applications.

  1. To begin using Pixtral 12B, choose Deploy.

You will be prompted to configure the deployment details for Pixtral 12B. The model ID will be prepopulated.

  1. Read carefully and accept the End User License Agreement (EULA).
  2. The Endpoint Name is automatically populated. Customers can choose to rename the endpoint.
  3. For Number of instances, enter a number of instances (between 1–100).
  4. For Instance type, choose your instance type. For optimal performance with Pixtral 12B, a GPU-based instance type like ml.g6.12xlarge is recommended.

Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you might want to review these settings to align with your organization’s security and compliance requirements.

  1. Choose Deploy to begin using the model.

When the deployment is complete, Endpoint status should change to In Service. After the endpoint is in service, you can test Pixtral 12B capabilities directly in the Amazon Bedrock playground.

  1. Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters like temperature and maximum length.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.

You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with Amazon Bedrock APIs, you need to use the endpoint ARN as model-id in the Amazon Bedrock SDK.

Pixtral 12B use cases

In this section, we provide example use cases of Pixtral 12B using sample prompts. We have defined helper functions to invoke the Pixtral 12B model using Amazon Bedrock Converse APIs:

def get_image_format(image_path):
    with Image.open(image_path) as img:
        # Normalize the format to a known valid one
        fmt = img.format.lower() if img.format else 'jpeg'
        # Convert 'jpg' to 'jpeg'
        if fmt == 'jpg':
            fmt = 'jpeg'
    return fmt

def call_bedrock_model(model_id=None, prompt="", image_paths=None, system_prompt="", temperature=0.6, top_p=0.9, max_tokens=3000):
    
    if isinstance(image_paths, str):
        image_paths = [image_paths]
    if image_paths is None:
        image_paths = []
    
    # Start building the content array for the user message
    content_blocks = []

    # Include a text block if prompt is provided
    if prompt.strip():
        content_blocks.append({"text": prompt})

    # Add images as raw bytes
    for img_path in image_paths:
        fmt = get_image_format(img_path)
        # Read the raw bytes of the image (no base64 encoding!)
        with open(img_path, 'rb') as f:
            image_raw_bytes = f.read()

        content_blocks.append({
            "image": {
                "format": fmt,
                "source": {
                    "bytes": image_raw_bytes
                }
            }
        })

    # Construct the messages structure
    messages = [
        {
            "role": "user",
            "content": content_blocks
        }
    ]

    # Prepare additional kwargs if system prompts are provided
    kwargs = {}
    
    kwargs["system"] = [{"text": system_prompt}]

    # Build the arguments for the `converse` call
    converse_kwargs = {
        "messages": messages,
        "inferenceConfig": {
            "maxTokens": 4000,
            "temperature": temperature,
            "topP": top_p
        },
        **kwargs
    }

    
    converse_kwargs["modelId"] = model_id

    # Call the converse API
    try:
        response = bedrock_runtime.converse(**converse_kwargs)
    
        # Parse the assistant response
        assistant_message = response.get('output', {}).get('message', {})
        assistant_content = assistant_message.get('content', [])
        result_text = "".join(block.get('text', '') for block in assistant_content)
    except Exception as e:
        result_text = f"Error message: {e}"
    return result_text

Visual logical reasoning

One of the interesting use cases of vision models is solving logical reasoning problems or visual puzzles. Pixtral 12B vision models are highly capable in solving logical reasoning questions. Let’s explore an example.

We use the following input image.

Our prompt and input payload are as follows:

system_prompt='You are solving logical reasoning problems.'
task = 'Which of these figures differ from the other four?'
image_path = './Pixtral_data/logical_reasoning.jpg'

print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get following response:

The image shows five figures labeled A, B, C, D, and E. Each figure consists of a head, body, and legs. The task is to identify which figure differs from the others.

**Reasoning:**
- Figure A: The head is connected directly to the body, and the legs are straight.
- Figure B: The head is connected directly to the body, and the legs are straight.
- Figure C: The head is connected directly to the body, and the legs are straight.
- Figure D: The head is connected directly to the body, and the legs are straight.
- Figure E: The head is connected directly to the body, but the legs are bent at the knees.

**Conclusion:** Figure E is the one that differs from the others.

Structured product information

Extracting product information is crucial for the retail industry, especially on sites that host third-party sellers, where product images are the most accessible resource. Accurately capturing relevant details from these images is vital for a product’s success in ecommerce. For instance, using advanced visual models like Pixtral 12B, retailers can efficiently extract key attributes from clothing product images, such as color, style, and patterns. This capability not only streamlines inventory management but also enhances customer experiences by providing essential information that aids in informed purchasing decisions.

We use the following input image.

Our prompt and input payload are as follows:

system_prompt='''You are a catalog manager for an ecommerce portal. You have an image of a product. 
Think very carefully in a step-by-step manner to extract product information. 
Always provide response in prescribed JSON format.
'''
image_path = './Pixtral_data/cap.png'
task = '''
Organize product information in JSON format to store in a database. 

Output json schema:

{
"product_name": "",
"description": "",
"category": "",
"sub_category": "",
"color": "",
"size": "",
"brand": "",
"material": "",
"features": [],
"image_alt_text": ""
}
'''

print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get the following response:

Certainly! Based on the image provided, here is the extracted product information in JSON format:

```json
{
  "product_name": "Nike Blue Cap",
  "description": "A stylish blue cap from Nike with a sleek design and adjustable fit.",
  "category": "Apparel",
  "sub_category": "Headwear",
  "color": "Blue",
  "size": "Adjustable",
  "brand": "Nike",
  "material": "Nylon",
  "features": [
    "Adjustable fit",
    "Lightweight",
    "Breathable material",
    "Embroidered Nike logo"
  ],
  "image_alt_text": "A blue Nike cap with a white Nike swoosh logo on the front."
}
```

Vehicle damage assessment

In the insurance industry, image analysis plays a crucial role in claims processing. For vehicle damage assessment, vision models like Pixtral 12B can be used to compare images taken at policy issuance with those submitted during a claim. This approach can streamline the evaluation process, potentially reducing loss adjustment expenses and expediting claim resolution. By automating the identification and characterization of automobile damage, insurers can enhance efficiency, improve accuracy, and ultimately provide a better experience for policyholders.

We use the following input images.

Our prompt and input payload are as follows:

image_paths = [
    "Pixtral_data/car_image_before.png",
    "Pixtral_data/car_image_after.png"
]

system_prompt='''
You are a helpful ai assistant for an insurance agent. 
Insurance agent has received a claim for a vehicle damage. 
'''

task = '''This claim includes two images. 
One of the image was taken before the incident and another was taken after the incident.

Analyse these images and answer below questions:
1. describe if there is any damage to the vehicle
2. should insurance agent accept or reject the claim

'''

print('Input Images:nn')
Image.open(image_paths[0]).show()
Image.open(image_paths[1]).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_paths)

print(f'nResponse from the model:nn{response}')

We get the following response:

### Analysis of Vehicle Damage

#### Image 1 (Before Incident)
- **Vehicle Condition**: The vehicle appears to be in good condition with no visible damage.
- **Paint and Body**: The paint is uniform, and there are no visible scratches, dents, or other forms of damage.
- **Tires and Wheels**: The tires and wheels look intact and in good condition.

#### Image 2 (After Incident)
- **Vehicle Condition**: The vehicle shows noticeable damage.
- **Paint and Body**: There is a visible dent or scratch on the rear side of the vehicle.
- **Tires and Wheels**: The tires and wheels appear to be intact, but the focus is on the body damage.

### Questions and Answers

1. **Describe if there is any damage to the vehicle:**
   - **Before Incident**: No visible damage.
   - **After Incident**: There is visible damage on the rear side of the vehicle, likely a dent or scratch.

2. **Should the insurance agent accept or reject the claim:**
   - **Recommendation**: The insurance agent should accept the claim. The images clearly show that the vehicle sustained damage after the incident. The before and after images provide sufficient evidence of the damage, which was likely caused by an incident.

### Conclusion
Based on the analysis of the images, the insurance agent should accept the claim as the damage to the vehicle is evident and appears to have occurred after the incident.

Handwriting recognition

Another feature in vision language models is their ability to recognize handwriting and extract handwritten text. Pixtral 12B performs well on extracting content from complex and poorly handwritten notes.

We use the following input image.


Our prompt and input payload are as follows:

system_prompt='You are a Graphologists'
task = '''
Analyze the image and transcribe any handwritten text present. 
Convert the handwriting into a single, continuous string of text. 
Maintain the original spelling, punctuation, and capitalization as written. Ignore any printed text, drawings, or other non-handwritten elements in the image.
'''

image_path = './Pixtral_data/a01-000u-04.png'


print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get the following response:

put down a resolution on the subject

Reasoning of complex figures

VLMs excel at interpreting and reasoning about complex figures, charts, and diagrams. In this particular use case, we use Pixtral 12B to analyze an intricate image containing GDP data. Pixtral 12B’s advanced capabilities in document understanding and complex figure analysis make it well-suited for extracting insights from visual representations of economic data. By processing both the visual elements and accompanying text, Pixtral 12B can provide detailed interpretations and reasoned analysis of the GDP figures presented in the image.

We use the following input image.

Our prompt and input payload are as follows:

system_prompt='You are a Global Economist.'
task = 'List the top 5 countries in Europe with the highest GDP'
image_path = './Pixtral_data/gdp.png'

print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get the following response:

The top 5 countries in Europe with the highest GDP, based on the provided diagram, are:

1. **Germany**
   - GDP: $3.99 trillion
   - GDP Percentage: 4.65%

2. **United Kingdom**
   - GDP: $2.82 trillion
   - GDP Percentage: 3.29%

3. **France**
   - GDP: $2.78 trillion
   - GDP Percentage: 3.24%

4. **Italy**
   - GDP: $2.07 trillion
   - GDP Percentage: 2.42%

5. **Spain**
   - GDP: $1.43 trillion
   - GDP Percentage: 1.66%

These countries are highlighted in green on the diagram.

Clean up

To avoid unwanted charges, clean up your resources. If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

Delete the Amazon Bedrock Marketplace deployment

  1. On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Marketplace deployments.
  2. In the Managed deployments section, locate the endpoint you want to delete.
  3. Verify the endpoint details to make sure you’re deleting the correct deployment:
    1. Endpoint name
    2. Model name
    3. Endpoint status
  4. Select the endpoint, and choose Delete.
  5. Choose Delete to delete the endpoint.
  6. In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Conclusion

In this post, we showed you how to get started with the Pixtral 12B model in Amazon Bedrock and deploy the model for inference. The Pixtral 12B vision model enables you to solve multiple use cases, including document understanding, logical reasoning, handwriting recognition, image comparison, entity extraction, extraction of structured data from scanned images, and caption generation. These capabilities can drive productivity in a number of enterprise use cases, including ecommerce (retail), marketing, FSI, and much more.

For more Mistral resources on AWS, check out the GitHub repo. The complete code for the samples featured in this post is available on GitHub. Pixtral 12B is also available in Amazon SageMaker JumpStart; refer to Pixtral 12B is now available on Amazon SageMaker JumpStart for details.


About the Authors

Deepesh Dhapola is a Senior Solutions Architect at AWS India, where he assists financial services and fintech clients in scaling and optimizing their applications on the AWS platform. He specializes in core machine learning and generative AI. Outside of work, Deepesh enjoys spending time with his family and experimenting with various cuisines.

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

Shane Rai is a Principal GenAI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using AWS’s breadth of cloud-based AI/ML services including model offerings from top tier foundation model providers.

John Liu has 14 years of experience as a product executive and 10 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols and fintech companies, and also spent 9 years as a portfolio manager at various hedge funds.

Read More

Streamline work insights with the Amazon Q Business connector for Smartsheet

Streamline work insights with the Amazon Q Business connector for Smartsheet

Amazon Q Business is a fully managed, generative AI–powered assistant that empowers enterprises to unlock the full potential of their data and organizational knowledge. With Amazon Q Business, you can quickly access answers to questions, generate summaries and content, and complete tasks by using the expertise and information stored across various data sources and enterprise systems within your organization. At the heart of this innovative solution are data source connectors, which seamlessly integrate and index content from multiple data sources and enterprise systems such as SharePoint, Confluence, and Smartsheet.

This post explains how to integrate Smartsheet with Amazon Q Business to use natural language and generative AI capabilities for enhanced insights. Smartsheet, the AI-enhanced enterprise-grade work management platform, helps users manage projects, programs, and processes at scale. By connecting Amazon Q Business with Smartsheet, business users, customer solutions managers, product managers and others can gain deeper understanding into their work by asking natural language questions.

The following are examples of questions you can ask Amazon Q Business to gain actionable insights:

  • Project status updates – Get quick insights into project health
    • What’s the status of the website redesign project?
    • Is the mobile app launch on track for the planned date?
    • Which projects are currently behind schedule in the Q3 roadmap?
  • Task management – Find information about tasks and action items
    • What tasks are assigned to John Doe?
    • Has the marketing plan been completed?
    • What’s the due date for the customer research presentation?
  • Resource allocation – Understand resource distribution and workload
    • How many resources are allocated to the product launch project?
    • Which projects require additional staffing based on current task loads?
  • Budget tracking – Monitor project and departmental budgets in real time
    • What is the current budget status for the marketing campaign?
    • How much budget is remaining for the customer service training initiative?

Overview of Smartsheet

Smartsheet combines the simplicity of a spreadsheet with powerful features for collaboration, workflow automation, content management, and reporting. Smartsheet powers mission-critical work securely and reliably at scale for thousands of organizations worldwide, including over 85% of Fortune 500 companies. Customers rely on Smartsheet to open thousands of new restaurant locations, distribute vaccines, build rockets, and more.

In this example, we’re using Smartsheet to track tasks for a software development project. This sheet includes columns for Task, Owner, Team, Stage, Start Date, End Date, and more.

Overview of the Smartsheet connector for Amazon Q Business

By integrating Smartsheet as a data source in Amazon Q Business, you can seamlessly extract insights. For example, service operations managers can use the new connector to deliver complex projects more efficiently and consistently. By asking the Amazon Q Business intelligent assistant specific questions, the team can access insights from multiple data sources, including sheets, conversations, and attachments in Smartsheet that have been connected. The generative AI–powered assistant performs deep searches within the data while respecting access and permission levels, saving valuable time and enhancing project oversight. This streamlined process improves client retention, increases accuracy, and elevates overall service quality.

You can integrate Smartsheet to Amazon Q Business through the AWS Management Console, AWS Command Line Interface (AWS CLI), or the CreateDataSource API.

The Amazon Q Business Smartsheet connector understands user access permissions and strictly enforces them at the time of the query. This makes sure that users can’t access content they don’t have permissions for. For managing security, refer to Identity and access management for Amazon Q Business.

Prerequisites

Before you begin, make sure that you have completed the following prerequisites.

For detailed guidance on completing these steps, refer to Prerequisites for connecting Amazon Q Business to Smartsheet.

Configure and prepare the Amazon Q Business Smartsheet connector

Follow the steps below to create the retriever and data source:

  1. Under Enhancements in the navigation pane, select Data sources. Then choose Add an index, as shown in the following screenshot.
  2. Under Index provisioning, select Enterprise and then choose Add an index, as shown in the following screenshot. The Enterprise option is ideal for workloads requiring maximum update performance.
  3. Under Enhancements in the left navigation pane, select Data sources.
  4. On the Data sources page, choose Add data source, as shown in the following screenshot.
  5. On the Add data source page, in the Data sources section, add the Smartsheet data source to your Amazon Q Business application and follow the steps at Connecting Amazon Q Business to Smartsheet using the console.
  6. On the Smartsheet data source page, select and enter the following information:
    • Data source name
    • AWS Secrets Manager secret
    • IAM role and Role name
    • Sync scope
    • Frequency

Creating the data source should only take a few minutes. After it’s set up, you’ll notice a green success notification on the console and the data source will be displayed in the Data source details section, as shown in the following screenshot.

Next, you need to sync the Smartsheet data source. In the Sync history section, choose Sync now to initiate the process of crawling and ingesting data from your source into Amazon Q Business. After the sync job is complete, your data source will be fully ready for use, as shown in the following screenshots.

Amazon Q Business and Smartsheet connector in action

After creating your Amazon Q Business application, and successfully syncing the Smartsheet data source, you can test the integration. Ask questions related to a project and observe how the app responds in real time. Follow these steps:

  1. In the left navigation pane, select AmazonQ-smartsheet-connector application and choose Deployed URL, as shown in the following screenshot.

In our example, we asked the following questions regarding our project captured in Smartsheet, and Amazon Q Business generated responses for the project owners regarding status and provided additional information for each task.

Question 1 – What is the project status for creating the UI?

Amazon Q Business response – As shown in the following screenshot, Amazon Q Business generated a response identifying the status as in progress, the name of the team member performing the work, and the scheduled completion date.

Question 2 – List all the projects with their deadlines

Amazon Q Business response – As shown in the following screenshot, Amazon Q Business generated an answer and listed the projects with their deadlines, including citation links from the Smartsheet.

Question 3 – What project is Chloe Evans handling? Can you provide more information about it?

Amazon Q Business response – As shown in the following screenshot, Amazon Q Business generated an answer summarizing the tasks that Chloe Evans is handling.

Troubleshooting

If you encounter issues while asking questions in Amazon Q Business, the problem might be due to missing permissions required to access certain information. Amazon Q Business strictly enforces the document permissions set in its data source. Follow these steps to troubleshoot:

  1. Check indexing status – Confirm whether the Smartsheet connector has been successfully indexed in the Amazon Q Business application. This makes sure that the data source is properly integrated.
  2. Verify user permissions – Make sure that the Smartsheet user account has the necessary permissions to access and read the information from the sheet. Proper permissions are critical for enabling Amazon Q Business to retrieve and process the required data.

Additionally, as an administrator managing the Amazon Q Business application, you can troubleshoot these issues by following these steps and using the document-level sync reports, which enhance visibility into data source sync operations. These reports provide comprehensive and detailed insights integrated into the sync history, including granular indexing status, metadata, and access control list (ACL) details for every document processed during a data source sync job.

The detailed document reports are stored in the new SYNC_RUN_HISTORY_REPORT log stream under the Amazon Q Business application log group, making sure that critical sync job details are available on demand when troubleshooting. For more information, refer to document-level sync reports.

As shown in the following screenshot, we used Amazon CloudWatch Logs Insights to query the SYNC_RUN_HISTORY_REPORT log stream, which meant we could review the sync status in detail.

Clean up

Complete the following steps to clean up your resources:

  1. Select the app you created earlier, then choose Delete.
  2. On the Amazon Q Business console, choose Applications in the navigation pane.
  3. Select the application you created, and on the Actions menu, choose Delete.

Conclusion

In this post, we explored how Amazon Q Business can seamlessly integrate with Smartsheet to help enterprises unlock the full potential of their data and knowledge. With the Smartsheet connector, organizations can empower their teams to find answers quickly, accelerate project tracking, streamline task management, automate workflows, and enhance collaboration.

Now that you’ve learned how to integrate Amazon Q Business with your Smartsheet content, it’s time to tap into the full potential of your organization’s data. To get started, sign up for an Amazon Q Business account and follow the steps in this post to set up the Smartsheet connector. Then, you can start asking Amazon Q Business natural language questions and watch it surface insights in seconds.


About the Authors

Brandon Seiter is a Senior Director, Corporate Development at Smartsheet. He has 15+ years of experience helping companies develop their inorganic growth strategies and overseeing their Corporate Development activities, including mergers & acquisitions, strategic partnerships and new business incubation. At Smartsheet he plays a pivotal role in nurturing relationships with Smartsheet’s technology partners and developing joint partner initiatives.

Aidin Khosrowshahi is an Amazon Q Specialist Solutions Architect at AWS, where he brings his passion for generative AI and serverless applications to life. As an active member of the AI/ML and serverless community, he specializes in Amazon Q Business and Developer solutions while serving as a generative AI expert. He helps customers implement best practices while fostering collaboration and driving innovation across the AI/ML ecosystem.

Chinmayee Rane is a Generative AI Specialist Solutions Architect at AWS, with a core focus on generative AI. She helps Independent Software Vendors (ISVs) accelerate the adoption of generative AI by designing scalable and impactful solutions. With a strong background in applied mathematics and machine learning, she specializes in intelligent document processing and AI-driven innovation. Outside of work, she enjoys salsa and bachata dancing.

Lokesh Chauhan is a Sr. Technical Account Manager at AWS, where he partners with Enterprise Customers to optimize their AWS journey and drive cloud success. He is a member of the AI/ML community and serves as a generative AI expert. As a 12x AWS-certified TAM, he brings deep technical expertise across the AWS platform. Prior to joining AWS, he held leadership positions including Project Lead and Sr. Database admin, building extensive experience in database and operations across multiple organizations.

Read More