Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge

Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge

AI agents are quickly becoming an integral part of customer workflows across industries by automating complex tasks, enhancing decision-making, and streamlining operations. However, the adoption of AI agents in production systems requires scalable evaluation pipelines. Robust agent evaluation enables you to gauge how well an agent is performing certain actions and gain key insights into them, enhancing AI agent safety, control, trust, transparency, and performance optimization.

Amazon Bedrock Agents uses the reasoning of foundation models (FMs) available on Amazon Bedrock, APIs, and data to break down user requests, gather relevant information, and efficiently complete tasks—freeing teams to focus on high-value work. You can enable generative AI applications to automate multistep tasks by seamlessly connecting with company systems, APIs, and data sources.

Ragas is an open source library for testing and evaluating large language model (LLM) applications across various LLM use cases, including Retrieval Augmented Generation (RAG). The framework enables quantitative measurement of the effectiveness of the RAG implementation. In this post, we use the Ragas library to evaluate the RAG capability of Amazon Bedrock Agents.

LLM-as-a-judge is an evaluation approach that uses LLMs to assess the quality of AI-generated outputs. This method employs an LLM to act as an impartial evaluator, to analyze and score outputs. In this post, we employ the LLM-as-a-judge technique to evaluate the text-to-SQL and chain-of-thought capabilities of Amazon Bedrock Agents.

Langfuse is an open source LLM engineering platform, which provides features such as traces, evals, prompt management, and metrics to debug and improve your LLM application.

In the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents, we showcased research agents for cancer biomarker discovery for pharmaceutical companies. In this post, we extend the prior work and showcase Open Source Bedrock Agent Evaluation with the following capabilities:

  • Evaluating Amazon Bedrock Agents on its capabilities (RAG, text-to-SQL, custom tool use) and overall chain-of-thought
  • Comprehensive evaluation results and trace data sent to Langfuse with built-in visual dashboards
  • Trace parsing and evaluations for various Amazon Bedrock Agents configuration options

First, we conduct evaluations on a variety of different Amazon Bedrock Agents. These include a sample RAG agent, a sample text-to-SQL agent, and pharmaceutical research agents that use multi-agent collaboration for cancer biomarker discovery. Then, for each agent, we showcase navigating the Langfuse dashboard to view traces and evaluation results.

Technical challenges

Today, AI agent developers generally face the following technical challenges:

  • End-to-end agent evaluation – Although Amazon Bedrock provides built-in evaluation capabilities for LLM models and RAG retrieval, it lacks metrics specifically designed for Amazon Bedrock Agents. There is a need for evaluating the holistic agent goal, as well as individual agent trace steps for specific tasks and tool invocations. Support is also needed for both single and multi-agents, and both single and multi-turn datasets.
  • Challenging experiment management – Amazon Bedrock Agents offers numerous configuration options, including LLM model selection, agent instructions, tool configurations, and multi-agent setups. However, conducting rapid experimentation with these parameters is technically challenging due to the lack of systematic ways to track, compare, and measure the impact of configuration changes across different agent versions. This makes it difficult to effectively optimize agent performance through iterative testing.

Solution overview

The following figure illustrates how Open Source Bedrock Agent Evaluation works on a high level. The framework runs an evaluation job that will invoke your own agent in Amazon Bedrock and evaluate its response.

Evaluation Workflow

The workflow consists of the following steps:

  1. The user specifies the agent ID, alias, evaluation model, and dataset containing question and ground truth pairs.
  2. The user executes the evaluation job, which will invoke the specified Amazon Bedrock agent.
  3. The retrieved agent invocation traces are run through a custom parsing logic in the framework.
  4. The framework conducts an evaluation based on the agent invocation results and the question type:
    1. Chain-of-thought – LLM-as-a-judge with Amazon Bedrock LLM calls (conducted for every evaluation run for different types of questions)
    2. RAG – Ragas evaluation library
    3. Text-to-SQL – LLM-as-a-judge with Amazon Bedrock LLM calls
  5. Evaluation results and parsed traces are gathered and sent to Langfuse for evaluation insights.

Prerequisites

To deploy the sample RAG and text-to-SQL agents and follow along with evaluating them using Open Source Bedrock Agent Evaluation, follow the instructions in Deploying Sample Agents for Evaluation.

To bring your own agent to evaluate with this framework, refer to the following README and follow the detailed instructions to deploy the Open Source Bedrock Agent Evaluation framework.

Overview of evaluation metrics and input data

First, we create sample Amazon Bedrock agents to demonstrate the capabilities of Open Source Bedrock Agent Evaluation. The text-to-SQL agent uses the BirdSQL Mini-Dev dataset, and the RAG agent uses the Hugging Face rag-mini-wikpedia dataset.

Evaluation metrics

The Open Source Bedrock Agent Evaluation framework conducts evaluations on two broad types of metrics:

  • Agent goal – Chain-of-thought (run on every question)
  • Task accuracy – RAG, text-to-SQL (run only when the specific tool is used to answer question)

Agent goal metrics measure how well an agent identifies and achieves the goals of the user. There are two main types: reference-based evaluation and no reference evaluation. Examples can be found in Agent Goal accuracy as defined by Ragas:

  • Reference-based evaluation – The user provides a reference that will be used as the ideal outcome. The metric is computed by comparing the reference with the goal achieved by the end of the workflow.
  • Evaluation without reference – The metric evaluates the performance of the LLM in identifying and achieving the goals of the user without reference.

We will showcase evaluation without reference using chain-of-thought evaluation. We conduct evaluations by comparing the agent’s reasoning and the agent’s instruction. For this evaluation, we use some metrics from the evaluator prompts for Amazon Bedrock LLM-as-a-judge. In this framework, the chain-of-thought evaluations are run on every question that the agent is evaluated against.

Task accuracy metrics measure how well an agent calls the required tools to complete a given task. For the two task accuracy metrics, RAG and text-to-SQL, evaluations are conducted based on comparing the actual agent answer against the ground truth dataset that must be provided in the input dataset. The task accuracy metrics are only evaluated when the corresponding tool is used to answer the question.

The following is a breakdown of the key metrics used in each evaluation type included in the framework:

  • RAG:
    • Faithfulness – How factually consistent a response is with the retrieved context
    • Answer relevancy – How directly and appropriately the original question is addressed
    • Context recall – How many of the relevant pieces of information were successfully retrieved
    • Semantic similarity – The assessment of the semantic resemblance between the generated answer and the ground truth
  • Text-to-SQL:
  • Chain-of-thought:
    • Helpfulness – How well the agent satisfies explicit and implicit expectations
    • Faithfulness – How well the agent sticks to available information and context
    • Instruction following – How well the agent respects all explicit directions

User-agent trajectories

The input dataset is in the form of trajectories, where each trajectory consists of one or more questions to be answered by the agent. The trajectories are meant to simulate how a user might interact with the agent. Each trajectory consists of a unique question_id, question_type, question, and ground_truth information. The following are examples of actual trajectories used to evaluate each type of agent in this post.

For more simple agent setups like the RAG and text-to-SQL sample agent, we created trajectories consisting of a single question, as shown in the following examples.

The following is an example of a RAG sample agent trajectory:

{
	"Trajectory0": [
		{
			"question_id": 0,
			"question_type": "RAG",
			"question": "Was Abraham Lincoln the sixteenth President of the United States?",
			"ground_truth": "yes"
		}
	]
}

The following is an example of a text-to-SQL sample agent trajectory:

{
	"Trajectory1": [
		{
			"question_id": 1,
			"question": "What is the highest eligible free rate for K-12 students in the schools in Alameda County?",
			"question_type": "TEXT2SQL",
			"ground_truth": {
				"ground_truth_sql_query": "SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` FROM frpm WHERE `County Name` = 'Alameda' ORDER BY (CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC LIMIT 1",
				"ground_truth_sql_context": "[{'table_name': 'frpm', 'columns': [('cdscode', 'varchar'), ('academic year', 'varchar'), ...",
				"ground_truth_query_result": "1.0",
				"ground_truth_answer": "The highest eligible free rate for K-12 students in schools in Alameda County is 1.0."
		}
	]
}

Pharmaceutical research agent use case example

In this section, we demonstrate how you can use the Open Source Bedrock Agent Evaluation framework to evaluate pharmaceutical research agents discussed in the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents . It showcases a variety of specialized agents, including a biomarker database analyst, statistician, clinical evidence researcher, and medical imaging expert in collaboration with a supervisor agent.

The pharmaceutical research agent was built using the multi-agent collaboration feature of Amazon Bedrock. The following diagram shows the multi-agent setup that was evaluated using this framework.

HCLS Agents Architecture

As shown in the diagram, the RAG evaluations will be conducted on the clinical evidence researcher sub-agent. Similarly, text-to-SQL evaluations will be run on the biomarker database analyst sub-agent. The chain-of-thought evaluation evaluates the final answer of the supervisor agent to check if it properly orchestrated the sub-agents and answered the user’s question.

Research agent trajectories

For a more complex setup like the pharmaceutical research agents, we used a set of industry relevant pregenerated test questions. By creating groups of questions based on their topic regardless of the sub-agents that might be invoked to answer the question, we created trajectories that include multiple questions spanning multiple types of tool use. With relevant questions already generated, integrating with the evaluation framework simply required properly formatting the ground truth data into trajectories.

We walk through evaluating this agent against a trajectory containing a RAG question and a text-to-SQL question:

{
	"Trajectory1": [
		{
			"question_id": 3,
			"question_type": "RAG",
			"question": "According to the knowledge base, how did the EGF pathway associate with CT imaging features?",
			"ground_truth": "The EGF pathway was significantly correlated with the presence of ground-glass opacity and irregular nodules or nodules with poorly defined margins."
		},
		{
			"question_id": 4,
			"question_type": "TEXT2SQL",
			"question": "According to the database, What percentage of patients have EGFR mutations?",
			"ground_truth": {
				"ground_truth_sql_query": "SELECT (COUNT(CASE WHEN EGFR_mutation_status = 'Mutant' THEN 1 END) * 100.0 / COUNT(*)) AS percentage FROM clinical_genomic;",
				"ground_truth_sql_context": "Table clinical_genomic: - Case_ID: VARCHAR(50) - EGFR_mutation_status: VARCHAR(50)",
				"ground_truth_query_result": "14.285714",
				"ground_truth_answer": "According to the query results, approximately 14.29% of patients in the clinical_genomic table have EGFR mutations."
			}
		}
	]
}

Chain-of-thought evaluations are conducted for every question, regardless of tool use. This will be illustrated through a set of images of agent trace and evaluations on the Langfuse dashboard.

After running the agent against the trajectory, the results are sent to Langfuse to view the metrics. The following screenshot shows the trace of the RAG question (question ID 3) evaluation on Langfuse.

Langfuse RAG Trace

The screenshot displays the following information:

  • Trace information (input and output of agent invocation)
  • Trace steps (agent generation and the corresponding sub-steps)
  • Trace metadata (input and output tokens, cost, model, agent type)
  • Evaluation metrics (RAG and chain-of-thought metrics)

The following screenshot shows the trace of the text-to-SQL question (question ID 4) evaluation on Langfuse, which evaluated the biomarker database analyst agent that generates SQL queries to run against an Amazon Redshift database containing biomarker information.

Langfuse text-to-SQL Trace

The screenshot shows the following information:

  • Trace information (input and output of agent invocation)
  • Trace steps (agent generation and the corresponding sub-steps)
  • Trace metadata (input and output tokens, cost, model, agent type)
  • Evaluation metrics (text-to-SQL and chain-of-thought metrics)

The chain-of-thought evaluation is included in part of both questions’ evaluation traces. For both traces, LLM-as-a-judge is used to generate scores and explanation around an Amazon Bedrock agent’s reasoning on a given question.

Overall, we ran 56 questions grouped into 21 trajectories against the agent. The traces, model costs, and scores are shown in the following screenshot.

Langfuse Dashboard

The following table contains the average evaluation scores across 56 evaluation traces.

Metric Category Metric Type Metric Name Number of Traces Metric Avg. Value
Agent Goal COT Helpfulness 50 0.77
Agent Goal COT Faithfulness 50 0.87
Agent Goal COT Instruction following 50 0.69
Agent Goal COT Overall (average of all metrics) 50 0.77
Task Accuracy TEXT2SQL Answer correctness 26 0.83
Task Accuracy TEXT2SQL SQL semantic equivalence 26 0.81
Task Accuracy RAG Semantic similarity 20 0.66
Task Accuracy RAG Faithfulness 20 0.5
Task Accuracy RAG Answer relevancy 20 0.68
Task Accuracy RAG Context recall 20 0.53

Security considerations

Consider the following security measures:

  • Enable Amazon Bedrock agent logging – For security best practices of using Amazon Bedrock Agents, enable Amazon Bedrock model invocation logging to capture prompts and responses securely in your account.
  • Check for compliance requirements – Before implementing Amazon Bedrock Agents in your production environment, make sure that the Amazon Bedrock compliance certifications and standards align with your regulatory requirements. Refer to Compliance validation for Amazon Bedrock for more information and resources on meeting compliance requirements.

Clean up

If you deployed the sample agents, run the following notebooks to delete the resources created.

If you chose the self-hosted Langfuse option, follow these steps to clean up your AWS self-hosted Langfuse setup.

Conclusion

In this post, we introduced the Open Source Bedrock Agent Evaluation framework, a Langfuse-integrated solution that streamlines the agent development process. The framework comes equipped with built-in evaluation logic for RAG, text-to-SQL, chain-of-thought reasoning, and integration with Langfuse for viewing evaluation metrics. With the Open Source Bedrock Agent Evaluation agent, developers can quickly evaluate their agents and rapidly experiment with different configurations, accelerating the development cycle and improving agent performance.

We demonstrated how this evaluation framework can be integrated with pharmaceutical research agents. We used it to evaluate agent performance against biomarker questions and sent traces to Langfuse to view evaluation metrics across question types.

The Open Source Bedrock Agent Evaluation framework enables you to accelerate your generative AI application building process using Amazon Bedrock Agents. To self-host Langfuse in your AWS account, see Hosting Langfuse on Amazon ECS with Fargate using CDK Python. To explore how you can streamline your Amazon Bedrock Agents evaluation process, get started with Open Source Bedrock Agent Evaluation.

Refer to Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications from the Amazon Bedrock team to learn more about multi-agent collaboration and end-to-end agent evaluation.


About the authors

Hasan PoonawalaHasan Poonawala is a Senior AI/ML Solutions Architect at AWS, working with healthcare and life sciences customers. Hasan helps design, deploy, and scale generative AI and machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development, and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Blake ShinBlake Shin is an Associate Specialist Solutions Architect at AWS who enjoys learning about and working with new AI/ML technologies. In his free time, Blake enjoys exploring the city and playing music.

Rishiraj ChandraRishiraj Chandra is an Associate Specialist Solutions Architect at AWS, passionate about building innovative artificial intelligence and machine learning solutions. He is committed to continuously learning and implementing emerging AI/ML technologies. Outside of work, Rishiraj enjoys running, reading, and playing tennis.

Read More

Enterprise-grade natural language to SQL generation using LLMs: Balancing accuracy, latency, and scale

Enterprise-grade natural language to SQL generation using LLMs: Balancing accuracy, latency, and scale

This blog post is co-written with Renuka Kumar and Thomas Matthew from Cisco.

Enterprise data by its very nature spans diverse data domains, such as security, finance, product, and HR. Data across these domains is often maintained across disparate data environments (such as Amazon Aurora, Oracle, and Teradata), with each managing hundreds or perhaps thousands of tables to represent and persist business data. These tables house complex domain-specific schemas, with instances of nested tables and multi-dimensional data that require complex database queries and domain-specific knowledge for data retrieval.

Recent advances in generative AI have led to the rapid evolution of natural language to SQL (NL2SQL) technology, which uses pre-trained large language models (LLMs) and natural language to generate database queries in the moment. Although this technology promises simplicity and ease of use for data access, converting natural language queries to complex database queries with accuracy and at enterprise scale has remained a significant challenge. For enterprise data, a major difficulty stems from the common case of database tables having embedded structures that require specific knowledge or highly nuanced processing (for example, an embedded XML formatted string). As a result, NL2SQL solutions for enterprise data are often incomplete or inaccurate.

This post describes a pattern that AWS and Cisco teams have developed and deployed that is viable at scale and addresses a broad set of challenging enterprise use cases. The methodology allows for the use of simpler, and therefore more cost-effective and lower latency, generative models by reducing the processing required for SQL generation.

Specific challenges for enterprise-scale NL2SQL

Generative accuracy is paramount for NL2SQL use cases; inaccurate SQL queries might result in a sensitive enterprise data leak, or lead to inaccurate results impacting critical business decisions. Enterprise-scale data presents specific challenges for NL2SQL, including the following:

  • Complex schemas optimized for storage (and not retrieval) – Enterprise databases are often distributed in nature and optimized for storage and not for retrieval. As a result, the table schemas are complex, involving nested tables and multi-dimensional data structures (for example, a cell containing an array of data). As a further result, creating queries for retrieval from these data stores requires specific expertise and involves complex filtering and joins.
  • Diverse and complex natural language queries – The user’s natural language input might also be complex because they might refer to a list of entities of interest or date ranges. Converting the logical meaning of these user queries into a database query can lead to overly long and complex SQL queries due to the original design of the data schema.
  • LLM knowledge gap – NL2SQL language models are typically trained on data schemas that are publicly available for education purposes and might not have the necessary knowledge complexity required of large, distributed databases in production environments. Consequently, when faced with complex enterprise table schemas or complex user queries, LLMs have difficulty generating correct query statements because they have difficulty understanding interrelationships between the values and entities of the schema.
  • LLM attention burden and latency – Queries containing multi-dimensional data often involve multi-level filtering over each cell of the data. To generate queries for cases such as these, the generative model requires more attention to support attending to the increase in relevant tables, columns, and values; analyzing the patterns; and generating more tokens. This increases the LLM’s query generation latency, and the likelihood of query generation errors, because of the LLM misunderstanding data relationships and generating incorrect filter statements.
  • Fine-tuning challenge – One common approach to achieve higher accuracy with query generation is to fine-tune the model with more SQL query samples. However, it is non-trivial to craft training data for generating SQL for embedded structures within columns (for example, JSON, or XML), to handle sets of identifiers, and so on, to get baseline performance (which is the problem we are trying to solve in the first place). This also introduces a slowdown in the development cycle.

Solution design and methodology

The solution described in this post provides a set of optimizations that solve the aforementioned challenges while reducing the amount of work that has to be performed by an LLM for generating accurate output. This work extends upon the post Generating value from enterprise data: Best practices for Text2SQL and generative AI. That post has many useful recommendations for generating high-quality SQL, and the guidelines outlined might be sufficient for your needs, depending on the inherent complexity of the database schemas.

To achieve generative accuracy for complex scenarios, the solution breaks down NL2SQL generation into a sequence of focused steps and sub-problems, narrowing the generative focus to the appropriate data domain. Using data abstractions for complex joins and data structure, this approach enables the use of smaller and more affordable LLMs for the task. This approach results in reduced prompt size and complexity for inference, reduced response latency, and improved accuracy, while enabling the use of off-the-shelf pre-trained models.

Narrowing scope to specific data domains

The solution workflow narrows down the overall schema space into the data domain targeted by the user’s query. Each data domain corresponds to the set of database data structures (tables, views, and so on) that are commonly used together to answer a set of related user queries, for an application or business domain. The solution uses the data domain to construct prompt inputs for the generative LLM.

This pattern consists of the following elements:

  • Mapping input queries to domains – This involves mapping each user query to the data domain that is appropriate for generating the response for NL2SQL at runtime. This mapping is similar in nature to intent classification, and enables the construction of an LLM prompt that is scoped for each input query (described next).
  • Scoping data domain for focused prompt construction – This is a divide-and-conquer pattern. By focusing on the data domain of the input query, redundant information, such as schemas for other data domains in the enterprise data store, can be excluded. This might be considered as a form of prompt pruning; however, it offers more than prompt reduction alone. Reducing the prompt context to the in-focus data domain enables greater scope for few-shot learning examples, declaration of specific business rules, and more.
  • Augmenting SQL DDL definitions with metadata to enhance LLM inference – This involves enhancing the LLM prompt context by augmenting the SQL DDL for the data domain with descriptions of tables, columns, and rules to be used by the LLM as guidance on its generation. This is described in more detail later in this post.
  • Determine query dialect and connection information – For each data domain, the database server metadata (such as the SQL dialect and connection URI) is captured during use case onboarding and made available at runtime to be automatically included in the prompt for SQL generation and subsequent query execution. This enables scalability through decoupling the natural language query from the specific queried data source. Together, the SQL dialect and connectivity abstractions allow for the solution to be data source agnostic; data sources might be distributed within or across different clouds, or provided by different vendors. This modularity enables scalable addition of new data sources and data domains, because each is independent.

Managing identifiers for SQL generation (resource IDs)

Resolving identifiers involves extracting the named resources, as named entities, from the user’s query and mapping the values to unique IDs appropriate for the target data source prior to NL2SQL generation. This can be implemented using natural language processing (NLP) or LLMs to apply named entity recognition (NER) capabilities to drive the resolution process. This optional step has the most value when there are many named resources and the lookup process is complex. For instance, in a user query such as “In what games did Isabelle Werth, Nedo Nadi, and Allyson Felix compete?” there are named resources: ‘allyson felix’, ‘isabelle werth’, and ‘nedo nadi’. This step allows for rapid and precise feedback to the user when a resource can’t be resolved to an identifier (for example, due to ambiguity).

This optional process of handling many or paired identifiers is included to offload the burden on LLMs for user queries with challenging sets of identifiers to be incorporated, such as those that might come in pairs (such as ID-type, ID-value), or where there are many identifiers. Rather than having the generative LLM insert each unique ID into the SQL directly, the identifiers are made available by defining a temporary data structure (such as a temporary table) and a set of corresponding insert statements. The LLM is prompted with few-shot learning examples to generate SQL for the user query by joining with the temporary data structure, rather than attempt identity injection. This results in a simpler and more consistent query pattern for cases when there are one, many, or pairs of identifiers.

Handling complex data structures: Abstracting domain data structures

This step is aimed at simplifying complex data structures into a form that can be understood by the language model without having to decipher complex inter-data relationships. Complex data structures might appear as nested tables or lists within a table column, for instance.

We can define temporary data structures (such as views and tables) that abstract complex multi-table joins, nested structures, and more. These higher-level abstractions provide simplified data structures for query generation and execution. The top-level definitions of these abstractions are included as part of the prompt context for query generation, and the full definitions are provided to the SQL execution engine, along with the generated query. The resulting queries from this process can use simple set operations (such as IN, as opposed to complex joins) that LLMs are well trained on, thereby alleviating the need for nested joins and filters over complex data structures.

Augmenting data with data definitions for prompt construction

Several of the optimizations noted earlier require making some of the specifics of the data domain explicit. Fortunately, this only has to be done when schemas and use cases are onboarded or updated. The benefit is higher generative accuracy, reduced generative latency and cost, and the ability to support arbitrarily complex query requirements.

To capture the semantics of a data domain, the following elements are defined:

  • The standard tables and views in data schema, along with comments to describe the tables and columns.
  • Join hints for the tables and views, such as when to use outer joins.
  • Data domain-specific rules, such as which columns might not appear in a final select statement.
  • The set of few-shot examples of user queries and corresponding SQL statements. A good set of examples would include a wide variety of user queries for that domain.
  • Definitions of the data schemas for any temporary tables and views used in the solution.
  • A domain-specific system prompt that specifies the role and expertise that the LLM has, the SQL dialect, and the scope of its operation.
  • A domain-specific user prompt.
  • Additionally, if temporary tables or views are used for the data domain, a SQL script is required that, when executed, creates the desired temporary data structures needs to be defined. Depending on the use case, this can be a static or dynamically generated script.

Accordingly, the prompt for generating the SQL is dynamic and constructed based on the data domain of the input question, with a set of specific definitions of data structure and rules appropriate for the input query. We refer to this set of elements as the data domain context. The purpose of the data domain context is to provide the necessary prompt metadata for the generative LLM. Examples of this, and the methods described in the previous sections, are included in the GitHub repository. There is one context for each data domain, as illustrated in the following figure.

Bringing it all together: The execution flow

This section describes the execution flow of the solution. An example implementation of this pattern is available in the GitHub repository. Access the repository to follow along with the code.

To illustrate the execution flow, we use an example database with data about Olympics statistics and another with the company’s employee vacation schedule. We follow the execution flow for the domain regarding Olympics statistics using the user query “In what games did Isabelle Werth, Nedo Nadi, and Allyson Felix compete?” to show the inputs and outputs of the steps in the execution flow, as illustrated in the following figure.

High-level processing workflow

Preprocess the request

The first step of the NL2SQL flow is to preprocess the request. The main objective of this step is to classify the user query into a domain. As explained earlier, this narrows down the scope of the problem to the appropriate data domain for SQL generation. Additionally, this step identifies and extracts the referenced named resources in the user query. These are then used to call the identity service in the next step to get the database identifiers for these named resources.

Using the earlier mentioned example, the inputs and outputs of this step are as follows:

user_query = "In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?"
pre_processed_request = request_pre_processor.run(user_query)
domain = pre_processed_request[app_consts.DOMAIN]

# Output pre_processed_request:
  {'user_query': 'In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?',
   'domain': 'olympics',
   'named_resources': {'allyson felix', 'isabelle werth', 'nedo nadi'} }

Resolve identifiers (to database IDs)

This step processes the named resources’ strings extracted in the previous step and resolves them to be identifiers that can be used in database queries. As mentioned earlier, the named resources (for example, “group22”, “user123”, and “I”) are looked up using solution-specific means, such through database lookups or an ID service.

The following code shows the execution of this step in our running example:

named_resources = pre_processed_request[app_consts.NAMED_RESOURCES]
if len(named_resources) > 0:
  identifiers = id_service_facade.resolve(named_resources)
  # add identifiers to the pre_processed_request object
  pre_processed_request[app_consts.IDENTIFIERS] = identifiers
else:
  pre_processed_request[app_consts.IDENTIFIERS] = []

# Output pre_processed_request:
  {'user_query': 'In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?',
   'domain': 'olympics',
   'named_resources': {'allyson felix', 'isabelle werth', 'nedo nadi'},
   'identifiers': [ {'id': 34551, 'role': 32, 'name': 'allyson felix'},
   {'id': 129726, 'role': 32, 'name': 'isabelle werth'},
   {'id': 84026, 'role': 32, 'name': 'nedo nadi'} ] }

Prepare the request

This step is pivotal in this pattern. Having obtained the domain and the named resources along with their looked-up IDs, we use the corresponding context for that domain to generate the following:

  • A prompt for the LLM to generate a SQL query corresponding to the user query
  • A SQL script to create the domain-specific schema

To create the prompt for the LLM, this step assembles the system prompt, the user prompt, and the received user query from the input, along with the domain-specific schema definition, including new temporary tables created as well as any join hints, and finally the few-shot examples for the domain. Other than the user query that is received as in input, other components are based on the values provided in the context for that domain.

A SQL script for creating required domain-specific temporary structures (such as views and tables) is constructed from the information in the context. The domain-specific schema in the LLM prompt, join hints, and the few-shot examples are aligned with the schema that gets generated by running this script. In our example, this step is shown in the following code. The output is a dictionary with two keys, llm_prompt and sql_preamble. The value strings for these have been clipped here; the full output can be seen in the Jupyter notebook.

prepared_request = request_preparer.run(pre_processed_request)

# Output prepared_request:
{'llm_prompt': 'You are a SQL expert. Given the following SQL tables definitions, ...
CREATE TABLE games (id INTEGER PRIMARY KEY, games_year INTEGER, ...);
...
<example>
question: How many gold medals has Yukio Endo won? answer: ```{"sql":
"SELECT a.id, count(m.medal_name) as "count"
FROM athletes_in_focus a INNER JOIN games_competitor gc ...
WHERE m.medal_name = 'Gold' GROUP BY a.id;" }```
</example>
...
'sql_preamble': [ 'CREATE temp TABLE athletes_in_focus (row_id INTEGER
PRIMARY KEY, id INTEGER, full_name TEXT DEFAULT NULL);',
'INSERT INTO athletes_in_focus VALUES
(1,84026,'nedo nadi'), (2,34551,'allyson felix'), (3,129726,'isabelle werth');"]}

Generate SQL

Now that the prompt has been prepared along with any information necessary to provide the proper context to the LLM, we provide that information to the SQL-generating LLM in this step. The goal is to have the LLM output SQL with the correct join structure, filters, and columns. See the following code:

llm_response = llm_service_facade.invoke(prepared_request[ 'llm_prompt' ])
generated_sql = llm_response[ 'llm_output' ]

# Output generated_sql:
{'sql': 'SELECT g.games_name, g.games_year FROM athletes_in_focus a
JOIN games_competitor gc ON gc.person_id = a.id
JOIN games g ON gc.games_id = g.id;'}

Execute the SQL

After the SQL query is generated by the LLM, we can send it off to the next step. At this step, the SQL preamble and the generated SQL are merged to create a complete SQL script for execution. The complete SQL script is then executed against the data store, a response is fetched, and then the response is passed back to the client or end-user. See the following code:

sql_script = prepared_request[ 'sql_preamble' ] + [ generated_sql[ 'sql' ] ]
database = app_consts.get_database_for_domain(domain)
results = rdbms_service_facade.execute_sql(database, sql_script)

# Output results:
{'rdbms_output': [
('games_name', 'games_year'),
('2004 Summer', 2004),
...
('2016 Summer', 2016)],
'processing_status': 'success'}

Solution benefits

Overall, our tests have shown several benefits, such as:

  • High accuracy – This is measured by a string matching of the generated query with the target SQL query for each test case. In our tests, we observed over 95% accuracy for 100 queries, spanning three data domains.
  • High consistency – This is measured in terms of the same SQL generated being generated across multiple runs. We observed over 95% consistency for 100 queries, spanning three data domains. With the test configuration, the queries were accurate most of the time; a small number occasionally produced inconsistent results.
  • Low cost and latency – The approach supports the use of small, low-cost, low-latency LLMs. We observed SQL generation in the 1–3 second range using models Meta’s Code Llama 13B and Anthropic’s Claude Haiku 3.
  • Scalability – The methods that we employed in terms of data abstractions facilitate scaling independent of the number of entities or identifiers in the data for a given use case. For instance, in our tests consisting of a list of 200 different named resources per row of a table, and over 10,000 such rows, we measured a latency range of 2–5 seconds for SQL generation and 3.5–4.0 seconds for SQL execution.
  • Solving complexity – Using the data abstractions for simplifying complexity enabled the accurate generation of arbitrarily complex enterprise queries, which almost certainly would not be possible otherwise.

We attribute the success of the solution with these excellent but lightweight models (compared to a Meta Llama 70B variant or Anthropic’s Claude Sonnet) to the points noted earlier, with the reduced LLM task complexity being the driving force. The implementation code demonstrates how this is achieved. Overall, by using the optimizations outlined in this post, natural language SQL generation for enterprise data is much more feasible than would be otherwise.

AWS solution architecture

In this section, we illustrate how you might implement the architecture on AWS. The end-user sends their natural language queries to the NL2SQL solution using a REST API. Amazon API Gateway is used to provision the REST API, which can be secured by Amazon Cognito. The API is linked to an AWS Lambda function, which implements and orchestrates the processing steps described earlier using a programming language of the user’s choice (such as Python) in a serverless manner. In this example implementation, where Amazon Bedrock is noted, the solution uses Anthropic’s Claude Haiku 3.

Briefly, the processing steps are as follows:

  1. Determine the domain by invoking an LLM on Amazon Bedrock for classification.
  2. Invoke Amazon Bedrock to extract relevant named resources from the request.
  3. After the named resources are determined, this step calls a service (the Identity Service) that returns identifier specifics relevant to the named resources for the task at hand. The Identity Service is logically a key/value lookup service, which might support for multiple domains.
  4. This step runs on Lambda to create the LLM prompt to generate the SQL, and to define temporary SQL structures that will be executed by the SQL engine along with the SQL generated by the LLM (in the next step).
  5. Given the prepared prompt, this step invokes an LLM running on Amazon Bedrock to generate the SQL statements that correspond to the input natural language query.
  6. This step executes the generated SQL query against the target database. In our example implementation, we used an SQLite database for illustration purposes, but you could use another database server.

The final result is obtained by running the preceding pipeline on Lambda. When the workflow is complete, the result is provided as a response to the REST API request.

The following diagram illustrates the solution architecture.

Example solution architecture

Conclusion

In this post, the AWS and Cisco teams unveiled a new methodical approach that addresses the challenges of enterprise-grade SQL generation. The teams were able to reduce the complexity of the NL2SQL process while delivering higher accuracy and better overall performance.

Though we’ve walked you through an example use case focused on answering questions about Olympic athletes, this versatile pattern can be seamlessly adapted to a wide range of business applications and use cases. The demo code is available in the GitHub repository. We invite you to leave any questions and feedback in the comments.


About the authors

Author image

Renuka Kumar is a Senior Engineering Technical Lead at Cisco, where she has architected and led the development of Cisco’s Cloud Security BU’s AI/ML capabilities in the last 2 years, including launching first-to-market innovations in this space. She has over 20 years of experience in several cutting-edge domains, with over a decade in security and privacy. She holds a PhD from the University of Michigan in Computer Science and Engineering.

Author image

Toby Fotherby is a Senior AI and ML Specialist Solutions Architect at AWS, helping customers use the latest advances in AI/ML and generative AI to scale their innovations. He has over a decade of cross-industry expertise leading strategic initiatives and master’s degrees in AI and Data Science. Toby also leads a program training the next generation of AI Solutions Architects.

author image

Shweta Keshavanarayana is a Senior Customer Solutions Manager at AWS. She works with AWS Strategic Customers and helps them in their cloud migration and modernization journey. Shweta is passionate about solving complex customer challenges using creative solutions. She holds an undergraduate degree in Computer Science & Engineering. Beyond her professional life, she volunteers as a team manager for her sons’ U9 cricket team, while also mentoring women in tech and serving the local community.

author imageThomas Matthew is an AL/ML Engineer at Cisco. Over the past decade, he has worked on applying methods from graph theory and time series analysis to solve detection and exfiltration problems found in Network security. He has presented his research and work at Blackhat and DevCon. Currently, he helps integrate generative AI technology into Cisco’s Cloud Security product offerings.

Daniel Vaquero is a Senior AI/ML Specialist Solutions Architect at AWS. He helps customers solve business challenges using artificial intelligence and machine learning, creating solutions ranging from traditional ML approaches to generative AI. Daniel has more than 12 years of industry experience working on computer vision, computational photography, machine learning, and data science, and he holds a PhD in Computer Science from UCSB.

author imageAtul Varshneya is a former Principal AI/ML Specialist Solutions Architect with AWS. He currently focuses on developing solutions in the areas of AI/ML, particularly in generative AI. In his career of 4 decades, Atul has worked as the technology R&D leader in multiple large companies and startups.

author imageJessica Wu is an Associate Solutions Architect at AWS. She helps customers build highly performant, resilient, fault-tolerant, cost-optimized, and sustainable architectures.

Read More

AWS Field Experience reduced cost and delivered low latency and high performance with Amazon Nova Lite foundation model

AWS Field Experience reduced cost and delivered low latency and high performance with Amazon Nova Lite foundation model

AWS Field Experience (AFX) empowers Amazon Web Services (AWS) sales teams with generative AI solutions built on Amazon Bedrock, improving how AWS sellers and customers interact. The AFX team uses AI to automate tasks and provide intelligent insights and recommendations, streamlining workflows for both customer-facing roles and internal support functions. Their approach emphasizes operational efficiency and practical enhancements to daily processes.

Last year, AFX introduced Account Summaries as the first in a forthcoming lineup of tools designed to support and streamline sales workflows. By integrating structured and unstructured data—from sales collateral and customer engagements to external insights and machine learning (ML) outputs—the tool delivers summarized insights that offer a comprehensive view of customer accounts. These summaries provide concise overviews and timely updates, enabling teams to make informed decisions during customer interactions.

The following screenshot shows an example of Account Summary for a customer account, including an executive summary, company overview, and recent account changes.

An image showing account summary of AnyCompany Air Lines, including financial metrics, AWS usage, support cases, and recent updates.

Migration to the Amazon Nova Light foundation model

Initially, AFX selected a range of models available on Amazon Bedrock, each chosen for its specific capabilities tailored to the diverse requirements of various summary sections. This was done to optimize accuracy, response time, and cost efficiency. However, following the introduction of state-of-the-art Amazon Nova foundation models in December 2024, the AFX team consolidated all its generative AI workload onto the Nova Lite model to capitalize on its industry-leading price performance and optimized latency.

Since moving to the Nova Lite model, the AFX team has achieved a remarkable 90% reduction in inference costs. This has empowered them to scale operations and deliver greater business value that directly supports their mission of creating efficient, high-performing sales processes.

Because Account Summaries are often used by sellers during on-the-go customer engagements, response speed is critical for maintaining seller efficiency. The Nova Lite model’s ultra-low latency helps ensure that sellers receive fast, reliable responses, without compromising on the quality of the insights.

The AFX team also highlighted the seamless migration experience, noting that their existing prompting, reasoning, and evaluation criteria transferred smoothly for the Amazon Nova Lite model without requiring significant modifications. The combination of tailored prompt controls and authorized reference content creates a bounded response framework, minimizing hallucinations elements and inaccuracies.

Overall impact

Since using the Nova Lite model, over 15,600 summaries have been generated by 3,600 sellers—with 1,500 of those sellers producing more than four summaries each. Impressively, the generative AI Account Summaries have achieved a 72% favorability rate, underscoring strong seller confidence and widespread approval.

AWS sellers report saving an average of 35 minutes per summary, a benefit that significantly boosts productivity and allocates more time for customer engagements. Additionally, about one-third of surveyed sellers noted that the summaries positively influenced their customer interactions, and those using generative AI Account Summaries experienced a 4.9% increase in the value of opportunities created.

A member of the AFX team explained, “The Amazon Nova Lite model has significantly reduced our costs without compromising performance. It allowed us to get fast, reliable account summaries, making customer interaction more productive and impactful.”

Conclusion

The AFX team’s product migration to the Nova Lite model has delivered tangible enterprise value by enhancing sales workflows. By migrating to the Amazon Nova Lite model, the team has not only achieved significant cost savings and reduced latency, but has also empowered sellers with a leading intelligent and reliable solution. This process has translated into real-world benefits—saving time, simplifying research, and bolstering customer engagement—laying a solid foundation for ongoing business goals and sustained success.

Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.


About the Authors

Anuj Jauhari is a Senior Product Marketing Manager at Amazon Web Services, where he helps customers realize value from innovations in generative AI.

Ashwin Nadagoudar is a Software Development Manager at Amazon Web Services, leading go-to-market (GTM) strategies and user journey initiatives with generative AI.

Sonciary Perez is a Principal Product Manager at Amazon Web Services, supporting the transformation of AWS Sales through AI-powered solutions that drive seller productivity and accelerate revenue growth.

Read More

Combine keyword and semantic search for text and images using Amazon Bedrock and Amazon OpenSearch Service

Combine keyword and semantic search for text and images using Amazon Bedrock and Amazon OpenSearch Service

Customers today expect to find products quickly and efficiently through intuitive search functionality. A seamless search journey not only enhances the overall user experience, but also directly impacts key business metrics such as conversion rates, average order value, and customer loyalty. According to a McKinsey study, 78% of consumers are more likely to make repeat purchases from companies that provide personalized experiences. As a result, delivering exceptional search functionality has become a strategic differentiator for modern ecommerce services. With ever expanding product catalogs and increasing diversity of brands, harnessing advanced search technologies is essential for success.

Semantic search enables digital commerce providers to deliver more relevant search results by going beyond keyword matching. It uses an embeddings model to create vector embeddings that capture the meaning of the input query. This helps the search be more resilient to phrasing variations and to accept multimodal inputs such as text, image, audio, and video. For example, a user inputs a query containing text and an image of a product they like, and the search engine translates both into vector embeddings using a multimodal embeddings model and retrieves related items from the catalog using embeddings similarities. To learn more about semantic search and how Amazon Prime Video uses it to help customers find their favorite content, see Amazon Prime Video advances search for sports using Amazon OpenSearch Service.

While semantic search provides contextual understanding and flexibility, keyword search remains a crucial component for a comprehensive ecommerce search solution. At its core, keyword search provides the essential baseline functionality of accurately matching user queries to product data and metadata, making sure explicit product names, brands, or attributes can be reliably retrieved. This matching capability is vital, because users often have specific items in mind when initiating a search, and meeting these explicit needs with precision is important to deliver a satisfactory experience.

Hybrid search combines the strengths of keyword search and semantic search, enabling retailers to deliver more accurate and relevant results to their customers. Based on OpenSearch blog post, hybrid search improves result quality by 8–12% compared to keyword search and by 15% compared to natural language search. However, combining keyword search and semantic search presents significant complexity because different query types provide scores on different scales. Using Amazon OpenSearch Service hybrid search, customers can seamlessly integrate these approaches by combining relevance scores from multiple search types into one unified score.

OpenSearch Service is the AWS recommended vector database for Amazon Bedrock. It’s a fully managed service that you can use to deploy, operate, and scale OpenSearch on AWS. OpenSearch is a distributed open-source search and analytics engine composed of a search engine and vector database. OpenSearch Service can help you deploy and operate your search infrastructure with native vector database capabilities delivering as low as single-digit millisecond latencies for searches across billions of vectors, making it ideal for real-time AI applications. To learn more, see Improve search results for AI using Amazon OpenSearch Service as a vector database with Amazon Bedrock.

Multimodal embedding models like Amazon Titan Multimodal Embeddings G1, available through Amazon Bedrock, play a critical role in enabling hybrid search functionality. These models generate embeddings for both text and images by representing them in a shared semantic space. This allows systems to retrieve relevant results across modalities such as finding images using text queries or combining text with image inputs.

In this post, we walk you through how to build a hybrid search solution using OpenSearch Service powered by multimodal embeddings from the Amazon Titan Multimodal Embeddings G1 model through Amazon Bedrock. This solution demonstrates how you can enable users to submit both text and images as queries to retrieve relevant results from a sample retail image dataset.

Overview of solution

In this post, you will build a solution that you can use to search through a sample image dataset in the retail space, using a multimodal hybrid search system powered by OpenSearch Service. This solution has two key workflows: a data ingestion workflow and a query workflow.

Data ingestion workflow

The data ingestion workflow generates vector embeddings for text, images, and metadata using Amazon Bedrock and the Amazon Titan Multimodal Embeddings G1 model. Then, it stores the vector embeddings, text, and metadata in an OpenSearch Service domain.

In this workflow, shown in the following figure, we use a SageMaker JupyterLab notebook to perform the following actions:

  1. Read text, images, and metadata from an Amazon Simple Storage Service (Amazon S3) bucket, and encode images in Base64 format.
  2. Send the text, images, and metadata to Amazon Bedrock using its API to generate embeddings using the Amazon Titan Multimodal Embeddings G1 model.
  3. The Amazon Bedrock API replies with embeddings to the Jupyter notebook.
  4. Store both the embeddings and metadata in an OpenSearch Service domain.

Query workflow

In the query workflow, an OpenSearch search pipeline is used to convert the query input to embeddings using the embeddings model registered with OpenSearch. Then, within the OpenSearch search pipeline results processor, results of semantic search and keyword search are combined using the normalization processor to provide relevant search results to users. Search pipelines take away the heavy lifting of building score results normalization and combination outside your OpenSearch Service domain.

The workflow consists of the following steps shown in the following figure:

  1. The client submits a query input containing text, a Base64 encoded image, or both to OpenSearch Service. Text submitted is used for both semantic and keyword search, and the image is used for semantic search.
  2. The OpenSearch search pipeline performs the keyword search using textual inputs and a neural search using vector embeddings generated by Amazon Bedrock using Titan Multimodal Embeddings G1 model.
  3. The normalization processor within the pipeline scales search results using techniques like min_max and combines keyword and semantic scores using arithmetic_mean.
  4. Ranked search results are returned to the client.

Walkthrough overview

To deploy the solution, complete the following high-level steps:

  1. Create a connector for Amazon Bedrock in OpenSearch Service.
  2. Create an OpenSearch search pipeline and enable hybrid search.
  3. Create an OpenSearch Service index for storing the multimodal embeddings and metadata.
  4. Ingest sample data to the OpenSearch Service index.
  5. Create OpenSearch Service query functions to test search functionality.

Prerequisites

For this walkthrough, you should have the following prerequisites:

The code is open source and hosted on GitHub.

Create a connector for Amazon Bedrock in OpenSearch Service

To use OpenSearch Service machine learning (ML) connectors with other AWS services, you need to set up an IAM role allowing access to that service. In this section, we demonstrate the steps to create an IAM role and then create the connector.

Create an IAM role

Complete the following steps to set up an IAM role to delegate Amazon Bedrock permissions to OpenSearch Service:

  1. Add the following policy to the new role to allow OpenSearch Service to invoke the Amazon Titan Multimodal Embeddings G1 model:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "bedrock:InvokeModel",
                "Resource": "arn:aws:bedrock:region:account-id:foundation-model/amazon.titan-embed-image-v1"
            }
        ]
    }
  1. Modify the role trust policy as follows. You can follow the instructions in IAM role management to edit the trust relationship of the role.
    {
    	"Version": "2012-10-17",
    	"Statement": [
    		{
    			"Effect": "Allow",
    			"Principal": {
    			"Service": "opensearchservice.amazonaws.com"
    		},
    			"Action": "sts:AssumeRole"
    		}
    	]
    }

Connect an Amazon Bedrock model to OpenSearch

After you create the role, you can use the Amazon Resource Name (ARN) of the role to define the constant in the SageMaker notebook along with the OpenSearch domain endpoint. Complete the following steps:

  1. Register a model group. Note the model group ID returned in the response to register a model in a later step.
  2. Create a connector, which facilitates registering and deploying external models in OpenSearch. The response will contain the connector ID.
  3. Register the external model to the model group and deploy the model. In this step, you register and deploy the model at the same time—by setting up deploy=true, the registered model is deployed as well.

Create an OpenSearch search pipeline and enable hybrid search

A search pipeline runs inside the OpenSearch Service domain and can have three types of processors: search request processor, search response processor, and search phase result processor. For our search pipeline, we use the search phase result processor, which runs between the search phases at the coordinating node level. The processor uses the normalization processor and normalizes the score from keyword and semantic search. For hybrid search, min-max normalization and arithmetic_mean combination techniques are preferred, but you can also try L2 normalization and geometric_mean or harmonic_mean combination techniques depending on your data and use case.

payload={
	"phase_results_processors": [
		{
			"normalization-processor": {
				"normalization": {
					"technique": "min_max"
				},
				"combination": {
					"technique": "arithmetic_mean",
					"parameters": {
						"weights": [
							OPENSEARCH_KEYWORD_WEIGHT,
							1 - OPENSEARCH_KEYWORD_WEIGHT
						]
					}
				}
			}
		}
	]
}
response = requests.put(
url=f"{OPENSEARCH_ENDPOINT}/_search/pipeline/"+OPENSEARCH_SEARCH_PIPELINE_NAME,
		json=payload,
		headers={"Content-Type": "application/json"},
		auth=open_search_auth
)

Create an OpenSearch Service index for storing the multimodal embeddings and metadata

For this post, we use the Amazon Berkley Objects Dataset, which is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. In this example, we only use Shoes and listings that are in en_US as shown in section Prepare listings dataset for Amazon OpenSearch ingestion of the notebook.

Use the following code to create an OpenSearch index to ingest the sample data:

response = opensearch_client.indices.create(
	index=OPENSEARCH_INDEX_NAME,
	body={
		"settings": {
			"index.knn": True,
			"number_of_shards": 2
		},
		"mappings": {
			"properties": {
				"amazon_titan_multimodal_embeddings": {
					"type": "knn_vector",
					"dimension": 1024,
					"method": {
						"name": "hnsw",
						"engine": "lucene",
						"parameters": {}
					}
				}
			}
		}
	}
)

Ingest sample data to the OpenSearch Service index

In this step, you select the relevant features used for generating embeddings. The images are converted to Base64. The combination of a selected feature and a Base64 image is used to generate multimodal embeddings, which are stored in the OpenSearch Service index along with the metadata using a OpenSearch bulk operation, and ingest listings in batches.

Create OpenSearch Service query functions to test search functionality

With the sample data ingested, you can run queries against this data to test the hybrid search functionality. To facilitate this process, we created helper functions to perform the queries in the query workflow section of the notebook. In this section, you explore specific parts of the functions that differentiate the search methods.

Keyword search

For keyword search, send the following payload to the OpenSearch domain search endpoint:

payload = {
	"query": {
		"multi_match": { 
			"query": query_text,
		}
	},
}

Semantic search

For semantic search, you can send the text and image as part of the payload. Model_id in the request is the external embeddings model that you connected earlier. OpenSearch will invoke the model and convert text and image to embeddings.

payload = {
	"query": {
		"neural": {
			"vector_embedding": {
				"query_text": query_text,
				"query_image": query_jpg_image,
				"model_id": model_id,
				"k": 5
			}
		}
	}
}

Hybrid search

This method uses the OpenSearch pipeline you created. The payload has both the semantic and neural search.

payload = {
"query": {
	"hybrid": {
		"queries": [
				{
					"multi_match": { 
							"query": query_text,
						}
				},
				{
					"neural": {
						"vector_embedding": {
							"query_text": query_text,
							"query_image": query_jpg_image,
							"model_id": model_id,
							"k": 5
						}
					}
				}
			]
		}
	}
}

Test search methods

To compare the multiple search methods, you can query the index using query_text which provides specific information about the desired output, and query_jpg_image which provides the overall abstraction of the desired style of the output.

query_text = "leather sandals in Petal Blush"
search_image_path = '16/16e48774.jpg'

Keyword search

The following output lists the top three keyword search results. The keyword search successfully located leather sandals in the color Petal Blush, but it didn’t take the desired style into consideration.

--------------------------------------------------------------------------------------------------------------------------------
Score: 8.4351 	 Item ID: B01MYDNG7C
Item Name: Amazon Brand - The Fix Women's Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather	 Material: None 	 Color: Petal Blush	 Style: Cantu Ruffle Ankle Wrap Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 8.4351 	 Item ID: B06XH8M37Q
Item Name: Amazon Brand - The Fix Women's Farah Single Buckle Platform Dress Sandal, Petal Blush, 6.5 B US
Fabric Type: 100% Leather	 Material: None 	 Color: Petal Blush	 Style: Farah Single Buckle Platform Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 8.4351 	 Item ID: B01MSCV2YB
Item Name: Amazon Brand - The Fix Women's Conley Lucite Heel Dress Sandal,Petal Blush,7.5 B US
Fabric Type: Leather	 Material: Suede 	 Color: Petal Blush	 Style: Conley Lucite Heel Sandal
--------------------------------------------------------------------------------------------------------------------------------

 

Semantic search

Semantic search successfully located leather sandal and considered the desired style. However, the similarity to the provided images took priority over the specific color provided in query_text.

--------------------------------------------------------------------------------------------------------------------------------
Score: 0.7072 	 Item ID: B01MZF96N7
Item Name: Amazon Brand - The Fix Women's Bonilla Block Heel Cutout Tribal Dress Sandal, Havana Tan, 7 B US
Fabric Type: Leather	 Material: Suede 	 Color: Havana Tan	 Style: Bonilla Block Heel Cutout Tribal Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 0.7018 	 Item ID: B01MUG3C0Q
Item Name: Amazon Brand - The Fix Women's Farrell Triangle-Cutout Square Toe Flat Dress Sandal, Light Rose/Gold, 7.5 B US
Fabric Type: Synthetic	 Material: Leather 	 Color: Light Rose/Gold	 Style: Farrell Cutout Tribal Square Toe Flat Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 0.6858 	 Item ID: B01MYDNG7C
Item Name: Amazon Brand - The Fix Women's Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather	 Material: None 	 Color: Petal Blush	 Style: Cantu Ruffle Ankle Wrap Sandal
--------------------------------------------------------------------------------------------------------------------------------

 

Hybrid search

Hybrid search returned similar results to the semantic search because they use the same embeddings model. However, by combining the output of keyword and semantic searches, the ranking of the Petal Blush sandal that most closely matches query_jpg_image increases, moving it the top of the results list.

--------------------------------------------------------------------------------------------------------------------------------
Score: 0.6838 	 Item ID: B01MYDNG7C
Item Name: Amazon Brand - The Fix Women's Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather	 Material: None 	 Color: Petal Blush	 Style: Cantu Ruffle Ankle Wrap Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 0.6 	 Item ID: B01MZF96N7
Item Name: Amazon Brand - The Fix Women's Bonilla Block Heel Cutout Tribal Dress Sandal, Havana Tan, 7 B US
Fabric Type: Leather	 Material: Suede 	 Color: Havana Tan	 Style: Bonilla Block Heel Cutout Tribal Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 0.5198 	 Item ID: B01MUG3C0Q
Item Name: Amazon Brand - The Fix Women's Farrell Triangle-Cutout Square Toe Flat Dress Sandal, Light Rose/Gold, 7.5 B US
Fabric Type: Synthetic	 Material: Leather 	 Color: Light Rose/Gold	 Style: Farrell Cutout Tribal Square Toe Flat Sandal
--------------------------------------------------------------------------------------------------------------------------------

 

Clean up

After you complete this walkthrough, clean up all the resources you created as part of this post. This is an important step to make sure you don’t incur any unexpected charges. If you used an existing OpenSearch Service domain, in the Cleanup section of the notebook, we provide suggested cleanup actions, including delete the index, un-deploy the model, delete the model, delete the model group, and delete the Amazon Bedrock connector. If you created an OpenSearch Service domain exclusively for this exercise, you can bypass these actions and delete the domain.

Conclusion

In this post, we explained how to implement multimodal hybrid search by combining keyword and semantic search capabilities using Amazon Bedrock and Amazon OpenSearch Service. We showcased a solution that uses Amazon Titan Multimodal Embeddings G1 to generate embeddings for text and images, enabling users to search using both modalities. The hybrid approach combines the strengths of keyword search and semantic search, delivering accurate and relevant results to customers.

We encourage you to test the notebook in your own account and get firsthand experience with hybrid search variations. In addition to the outputs shown in this post, we provide a few variations in the notebook. If you’re interested in using custom embeddings models in Amazon SageMaker AI instead, see Hybrid Search with Amazon OpenSearch Service. If you want a solution that offers semantic search only, see Build a contextual text and image search engine for product recommendations using Amazon Bedrock and Amazon OpenSearch Serverless and Build multimodal search with Amazon OpenSearch Service.


About the Authors

Renan Bertolazzi is an Enterprise Solutions Architect helping customers realize the potential of cloud computing on AWS. In this role, Renan is a technical leader advising executives and engineers on cloud solutions and strategies designed to innovate, simplify, and deliver results.

Birender Pal is a Senior Solutions Architect at AWS, where he works with strategic enterprise customers to design scalable, secure and resilient cloud architectures. He supports digital transformation initiatives with a focus on cloud-native modernization, machine learning, and Generative AI. Outside of work, Birender enjoys experimenting with recipes from around the world.

Sarath Krishnan is a Senior Solutions Architect with Amazon Web Services. He is passionate about enabling enterprise customers on their digital transformation journey. Sarath has extensive experience in architecting highly available, scalable, cost-effective, and resilient applications on the cloud. His area of focus includes DevOps, machine learning, MLOps, and generative AI.

Read More

Build an AI-powered document processing platform with open source NER model and LLM on Amazon SageMaker

Build an AI-powered document processing platform with open source NER model and LLM on Amazon SageMaker

Archival data in research institutions and national laboratories represents a vast repository of historical knowledge, yet much of it remains inaccessible due to factors like limited metadata and inconsistent labeling. Traditional keyword-based search mechanisms are often insufficient for locating relevant documents efficiently, requiring extensive manual review to extract meaningful insights.

To address these challenges, a U.S. National Laboratory has implemented an AI-driven document processing platform that integrates named entity recognition (NER) and large language models (LLMs) on Amazon SageMaker AI. This solution improves the findability and accessibility of archival records by automating metadata enrichment, document classification, and summarization. By using Mixtral-8x7B for abstractive summarization and title generation, alongside a BERT-based NER model for structured metadata extraction, the system significantly improves the organization and retrieval of scanned documents.

Designed with a serverless, cost-optimized architecture, the platform provisions SageMaker endpoints dynamically, providing efficient resource utilization while maintaining scalability. The integration of modern natural language processing (NLP) and LLM technologies enhances metadata accuracy, enabling more precise search functionality and streamlined document management. This approach supports the broader goal of digital transformation, making sure that archival data can be effectively used for research, policy development, and institutional knowledge retention.

In this post, we discuss how you can build an AI-powered document processing platform with open source NER and LLMs on SageMaker.

Solution overview

The NER & LLM Gen AI Application is a document processing solution built on AWS that combines NER and LLMs to automate document analysis at scale. The system addresses the challenges of processing large volumes of textual data by using two key models: Mixtral-8x7B for text generation and summarization, and a BERT NER model for entity recognition.

The following diagram illustrates the solution architecture.

The architecture implements a serverless design with dynamically managed SageMaker endpoints that are created on demand and destroyed after use, optimizing performance and cost-efficiency. The application follows a modular structure with distinct components handling different aspects of document processing, including extractive summarization, abstractive summarization, title generation, and author extraction. These modular pieces can be removed, replaced, duplicated, and patterned against for optimal reusability.

The processing workflow begins when documents are detected in the Extracts Bucket, triggering a comparison against existing processed files to prevent redundant operations. The system then orchestrates the creation of necessary model endpoints, processes documents in batches for efficiency, and automatically cleans up resources upon completion. Multiple specialized Amazon Simple Storage Service Buckets (Amazon S3 Bucket) store different types of outputs.

Click here to open the AWS console and follow along. 

Solution Components

Storage architecture

The application uses a multi-bucket Amazon S3 storage architecture designed for clarity, efficient processing tracking, and clear separation of document processing stages. Each bucket serves a specific purpose in the pipeline, providing organized data management and simplified access control. Amazon DynamoDB is used to track the processing of each document.

The bucket types are as follows:

  • Extracts – Source documents for processing
  • Extractive summary – Key sentence extractions
  • Abstractive summary – LLM-generated summaries
  • Generated titles – LLM-generated titles
  • Author information – Name extraction using NER
  • Model weights – ML model storage

SageMaker endpoints

The SageMaker endpoints in this application represent a dynamic, cost-optimized approach to machine learning (ML) model deployment. Rather than maintaining constantly running endpoints, the system creates them on demand when document processing begins and automatically stops them upon completion. Two primary endpoints are managed: one for the Mixtral-8x7B LLM, which handles text generation tasks including abstractive summarization and title generation, and another for the BERT-based NER model responsible for author extraction. This endpoint based architecture provides decoupling between the other processing, allowing independent scaling, versioning, and maintenance of each component. The decoupled nature of the endpoints also provides flexibility to update or replace individual models without impacting the broader system architecture.

The endpoint lifecycle is orchestrated through dedicated AWS Lambda functions that handle creation and deletion. When processing is triggered, endpoints are automatically initialized and model artifacts are downloaded from Amazon S3. The LLM endpoint is provisioned on ml.p4d.24xlarge (GPU) instances to provide sufficient computational power for the LLM operations. The NER endpoint is deployed on a ml.c5.9xlarge instance (CPU), which is sufficient to support this language model. To maximize cost-efficiency, the system processes documents in batches while the endpoints are active, allowing multiple documents to be processed during a single endpoint deployment cycle and maximizing the usage of the endpoints.

For usage awareness, the endpoint management system includes notification mechanisms through Amazon Simple Notification Service (Amazon SNS). Users receive notifications when endpoints are destroyed, providing visibility that a large instance is destroyed and not idling. The entire endpoint lifecycle is integrated into the broader workflow through AWS Step Functions, providing coordinated processing across all components of the application.

Step Functions workflow

The following figure illustrates the Step Functions workflow.

The application implements a processing pipeline through AWS Step Functions, orchestrating a series of Lambda functions that handle distinct aspects of document analysis. Multiple documents are processed in batches while endpoints are active, maximizing resource utilization. When processing is complete, the workflow automatically triggers endpoint deletion, preventing unnecessary resource consumption.

The highly modular Lambda functions are designed for flexibility and extensibility, enabling their adaptation for diverse use cases beyond their default implementations. For example, the abstractive summarization can be reused to do QnA or other forms of generation, and the NER model can be used to recognize other entity types such as organizations or locations.

Logical flow

The document processing workflow orchestrates multiple stages of analysis that operate both in parallel and sequential patterns. The Step Functions coordinates the movement of documents through extractive summarization, abstractive summarization, title generation, and author extraction processes. Each stage is managed as a discrete step, with clear input and output specifications, as illustrated in the following figure.

In the following sections, we look at each step of the logical flow in more detail.

Extractive summarization:

The extractive summarization process employs the TextRank algorithm, powered by sumy and NLTK libraries, to identify and extract the most significant sentences from source documents. This approach treats sentences as nodes within a graph structure, where the importance of each sentence is determined by its relationships and connections to other sentences. The algorithm analyzes these interconnections to identify key sentences that best represent the document’s core content, functioning similarly to how an editor would select the most important passages from a text. This method preserves the original wording while reducing the document to its most essential components.

Generate title:

The title generation process uses the Mixtral-8x7B model but focuses on creating concise, descriptive titles that capture the document’s main theme. It uses the extractive summary as input to provide efficiency and focus on key content. The LLM is prompted to analyze the main topics and themes present in the summary and generate an appropriate title that effectively represents the document’s content. This approach makes sure that generated titles are both relevant and informative, providing users with a quick understanding of the document’s subject matter without needing to read the full text.

Abstractive summarization:

Abstractive summarization also uses the Mixtral-8x7B LLM to generate entirely new text that captures the essence of the document. Unlike extractive summarization, this method doesn’t simply select existing sentences, but creates new content that paraphrases and restructures the information. The process takes the extractive summary as input, which helps reduce computation time and costs by focusing on the most relevant content. This approach results in summaries that read more naturally and can effectively condense complex information into concise, readable text.

Extract author:

Author extraction employs a BERT NER model to identify and classify author names within documents. The process specifically focuses on the first 1,500 characters of each document, where author information typically appears. The system follows a three-stage process: first, it detects potential name tokens with confidence scoring; second, it assembles related tokens into complete names; and finally, it validates the assembled names to provide proper formatting and eliminate false positives. The model can recognize various entity types (PER, ORG, LOC, MISC) but is specifically tuned to identify person names in the context of document authorship.

Cost and Performance

The solution achieves remarkable throughput by processing 100,000 documents within a 12-hour window. Key architectural decisions drive both performance and cost optimization. By implementing extractive summarization as an initial step, the system reduces input tokens by 75-90% (depending on the size of the document), substantially decreasing the workload for downstream LLM processing. The implementation of a dedicated NER model for author extraction yields an additional 33% reduction in LLM calls by bypassing the need for the more resource-intensive language model. These strategic optimizations create a compound effect – accelerating processing speeds while simultaneously reducing operational costs – establishing the platform as an efficient and cost-effective solution for enterprise-scale document processing needs. To estimate cost for processing 100,000 documents, multiply 12 by the cost per hour of the ml.p4d.24xlarge instance in your AWS region. It’s important to note that instance costs vary by region and may change over time, so current pricing should be consulted for accurate cost projections.

Deploy the Solution

To deploy follow along the instruction in the GitHub repo.

Clean up

Clean up instructions can be found in this section.

Conclusion

The NER & LLM Gen AI Application represents an organizational advancement in automated document processing, using powerful language models in an efficient serverless architecture. Through its implementation of both extractive and abstractive summarization, named entity recognition, and title generation, the system demonstrates the practical application of modern AI technologies in handling complex document analysis tasks. The application’s modular design and flexible architecture enable organizations to adapt and extend its capabilities to meet their specific needs, while the careful management of AWS resources through dynamic endpoint creation and deletion maintains cost-effectiveness. As organizations continue to face growing demands for efficient document processing, this solution provides a scalable, maintainable and customizable framework for automating and streamlining these workflows.

References:


About the Authors

Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.

Dr. Ian Lunsford is an Aerospace Cloud Consultant at AWS Professional Services. He integrates cloud services into aerospace applications. Additionally, Ian focuses on building AI/ML solutions using AWS services.

Max Rathmann is a Senior DevOps Consultant at Amazon Web Services, where she specializes in architecting cloud-native, server-less applications. She has a background in operationalizing AI/ML solutions and designing MLOps solutions with AWS Services.

Michael Massey is a Cloud Application Architect at Amazon Web Services, where he specializes in building frontend and backend cloud-native applications. He designs and implements scalable and highly-available solutions and architectures that help customers achieve their business goals.

Jeff Ryan is a DevOps Consultant at AWS Professional Services, specializing in AI/ML, automation, and cloud security implementations. He focuses on helping organizations leverage AWS services like Bedrock, Amazon Q, and SageMaker to build innovative solutions. His expertise spans MLOps, GenAI, serverless architectures, and Infrastructure as Code (IaC).

Dr. Brian Weston is a research manager at the Center for Applied Scientific Computing, where he is the AI/ML Lead for the Digital Twins for Additive Manufacturing Strategic Initiative, a project focused on building digital twins for certification and qualification of 3D printed components. He also holds a program liaison role between scientists and IT staff, where Weston champions the integration of cloud computing with digital engineering transformation, driving efficiency and innovation for mission science projects at the laboratory.

Ian Thompson is a Data Engineer at Enterprise Knowledge, specializing in graph application development and data catalog solutions. His experience includes designing and implementing graph architectures that improve data discovery and analytics across organizations. He is also the #1 Square Off player in the world.

Anna D’Angela is a Data Engineer at Enterprise Knowledge within the Semantic Engineering and Enterprise AI practice. She specializes in the design and implementation of knowledge graphs.

Read More

Protect sensitive data in RAG applications with Amazon Bedrock

Protect sensitive data in RAG applications with Amazon Bedrock

Retrieval Augmented Generation (RAG) applications have become increasingly popular due to their ability to enhance generative AI tasks with contextually relevant information. Implementing RAG-based applications requires careful attention to security, particularly when handling sensitive data. The protection of personally identifiable information (PII), protected health information (PHI), and confidential business data is crucial because this information flows through RAG systems. Failing to address these security considerations can lead to significant risks and potential data breaches. For healthcare organizations, financial institutions, and enterprises handling confidential information, these risks can result in regulatory compliance violations and breach of customer trust. See the OWASP Top 10 for Large Language Model Applications to learn more about the unique security risks associated with generative AI applications.

Developing a comprehensive threat model for your generative AI applications can help you identify potential vulnerabilities related to sensitive data leakage, prompt injections, unauthorized data access, and more. To assist in this effort, AWS provides a range of generative AI security strategies that you can use to create appropriate threat models.

Amazon Bedrock Knowledge Bases is a fully managed capability that simplifies the management of the entire RAG workflow, empowering organizations to give foundation models (FMs) and agents contextual information from your private data sources to deliver more relevant and accurate responses tailored to your specific needs. Additionally, with Amazon Bedrock Guardrails, you can implement safeguards in your generative AI applications that are customized to your use cases and responsible AI policies. You can redact sensitive information such as PII to protect privacy using Amazon Bedrock Guardrails.

RAG workflow: Converting data to actionable knowledge

RAG consists of two major steps:

  • Ingestion – Preprocessing unstructured data, which includes converting the data into text documents and splitting the documents into chunks. Document chunks are then encoded with an embedding model to convert them to document embeddings. These encoded document embeddings along with the original document chunks in the text are then stored to a vector store, such as Amazon OpenSearch Service.
  • Augmented retrieval – At query time, the user’s query is first encoded with the same embedding model to convert the query into a query embedding. The generated query embedding is then used to perform a similarity search on the stored document embeddings to find and retrieve semantically similar document chunks to the query. After the document chunks are retrieved, the user prompt is augmented by passing the retrieved chunks as additional context, so that the text generation model can answer the user query using the retrieved context. If sensitive data isn’t sanitized before ingestion, this might lead to retrieving sensitive data from the vector store and inadvertently leak the sensitive data to unauthorized users as part of the model response.

The following diagram shows the architectural workflow of a RAG system, illustrating how a user’s query is processed through multiple stages to generate an informed response

Bedrock Knowledge Base Flow

Solution overview

In this post we present two architecture patterns: data redaction at storage level and role-based access, for protecting sensitive data when building RAG-based applications using Amazon Bedrock Knowledge Bases.

Data redaction at storage level – Identifying and redacting (or masking) sensitive data before storing them to the vector store (ingestion) using Amazon Bedrock Knowledge Bases. This zero-trust approach to data sensitivity reduces the risk of sensitive information being inadvertently disclosed to unauthorized users.

Role-based access to sensitive data – Controlling selective access to sensitive information based on user roles and permissions during retrieval. This approach is best in situations where sensitive data needs to be stored in the vector store, such as in healthcare settings with distinct user roles like administrators (doctors) and non-administrators (nurses or support personnel).

For all data stored in Amazon Bedrock, the AWS shared responsibility model applies.

Let’s dive in to understand how to implement the data redaction at storage level and role-based access architecture patterns effectively.

Scenario 1: Identify and redact sensitive data before ingesting into the vector store

The ingestion flow implements a four-step process to help protect sensitive data when building RAG applications with Amazon Bedrock:

  1. Source document processing – An AWS Lambda function monitors the incoming text documents landing to a source Amazon Simple Storage Service (Amazon S3) bucket and triggers an Amazon Comprehend PII redaction job to identify and redact (or mask) sensitive data in the documents. An Amazon EventBridge rule triggers the Lambda function every 5 minutes. The document processing pipeline described here only processes text documents. To handle documents containing embedded images, you should implement additional preprocessing steps to extract and analyze images separately before ingestion.
  2. PII identification and redaction – The Amazon Comprehend PII redaction job analyzes the text content to identify and redact PII entities. For example, the job identifies and redacts sensitive data entities like name, email, address, and other financial PII entities.
  3. Deep security scanning – After redaction, documents move to another folder where Amazon Macie verifies redaction effectiveness and identifies any remaining sensitive data objects. Documents flagged by Macie go to a quarantine bucket for manual review, while cleared documents move to a redacted bucket ready for ingestion. For more details on data ingestion, see Sync your data with your Amazon Bedrock knowledge base.
  4. Secure knowledge base integration – Redacted documents are ingested into the knowledge base through a data ingestion job. In case of multi-modal content, for enhanced security, consider implementing:
    • A dedicated image extraction and processing pipeline.
    • Image analysis to detect and redact sensitive visual information.
    • Amazon Bedrock Guardrails to filter inappropriate image content during retrieval.

This multi-layered approach focuses on securing text content while highlighting the importance of implementing additional safeguards for image processing. Organizations should evaluate their multi-modal document requirements and extend the security framework accordingly.

Ingestion flow

The following illustration demonstrates a secure document processing pipeline for handling sensitive data before ingestion into Amazon Bedrock Knowledge Bases.

Scenario 1 - Ingestion Flow

The high-level steps are as follows:

  1. The document ingestion flow begins when documents containing sensitive data are uploaded to a monitored inputs folder in the source bucket. An EventBridge rule triggers a Lambda function (ComprehendLambda).
  2. The ComprehendLambda function monitors for new files in the inputs folder of the source bucket and moves landed files to a processing folder. It then launches an asynchronous Amazon Comprehend PII redaction analysis job and records the job ID and status in an Amazon DynamoDB JobTracking table for monitoring job completion. The Amazon Comprehend PII redaction job automatically redacts and masks sensitive elements such as names, addresses, phone numbers, Social Security numbers, driver’s license IDs, and banking information with the entity type. The job replaces these identified PII entities with placeholder tokens, such as [NAME], [SSN] etc. The entities to mask can be configured using RedactionConfig. For more information, see Redacting PII entities with asynchronous jobs (API). The MaskMode in RedactionConfig is set to REPLACE_WITH_PII_ENTITY_TYPE instead of MASK; redacting with a MaskCharacter would affect the quality of retrieved documents because many documents could contain the same MaskCharacter, thereby affecting the retrieval quality. After completion, the redacted files move to the for_macie_scan folder for secondary scanning.
  3. The secondary verification phase employs Macie for additional sensitive data detection on the redacted files. Another Lambda function (MacieLambda) monitors the completion of the Amazon Comprehend PII redaction job. When the job is complete, the function triggers a Macie one-time sensitive data detection job with files in the for_macie_scan folder.
  4. The final stage integrates with the Amazon Bedrock knowledge base. The findings from Macie determine the next steps: files with high severity ratings (3 or higher) are moved to a quarantine folder for human review by authorized personnel with appropriate permissions and access controls, whereas files with low severity ratings are moved to a designated redacted bucket, which then triggers a data ingestion job to the Amazon Bedrock knowledge base.

This process helps prevent sensitive details from being exposed when the model generates responses based on retrieved data.

Augmented retrieval flow

The augmented retrieval flow diagram shows how user queries are processed securely. It illustrates the complete workflow from user authentication through Amazon Cognito to response generation with Amazon Bedrock, including guardrail interventions that help prevent policy violations in both inputs and outputs.

Scenario 1 - Retrieval Flow

The high-level steps are as follows:

  1. For our demo, we use a web application UI built using Streamlit. The web application launches with a login form with user name and password fields.
  2. The user enters the credentials and logs in. User credentials are authenticated using Amazon Cognito user pools. Amazon Cognito acts as our OpenID connect (OIDC) identity provider (IdP) to provide authentication and authorization services for this application. After authentication, Amazon Cognito generates and returns identity, access and refresh tokens in JSON web token (JWT) format back to the web application. Refer to Understanding user pool JSON web tokens (JWTs) for more information.
  3. After the user is authenticated, they are logged in to the web application, where an AI assistant UI is presented to the user. The user enters their query (prompt) in the assistant’s text box. The query is then forwarded using a REST API call to an Amazon API Gateway endpoint along with the access tokens in the header.
  4. API Gateway forwards the payload along with the claims included in the header to a conversation orchestrator Lambda function.
  5. The conversation orchestrator Lambda function processes the user prompt and model parameters received from the UI and calls the RetrieveAndGenerate API to the Amazon Bedrock knowledge base. Input guardrails are first applied to this request to perform input validation on the user query.
    • The guardrail evaluates and applies predefined responsible AI policies using content filters, denied topic filters and word filters on user input. For more information on creating guardrail filters, see Create a guardrail.
    • If the predefined input guardrail policies are triggered on the user input, the guardrails intervene and return a preconfigured message like, “Sorry, your query violates our usage policy.”
    • Requests that don’t trigger a guardrail policy will retrieve the documents from the knowledge base and generate a response using the RetrieveAndGenerate. Optionally, if users choose to run Retrieve separately, guardrails can also be applied at this stage. Guardrails during document retrieval can help block sensitive data returned from the vector store.
  6. During retrieval, Amazon Bedrock Knowledge Bases encodes the user query using the Amazon Titan Text v2 embeddings model to generate a query embedding.
  7. Amazon Bedrock Knowledge Bases performs a similarity search with the query embedding against the document embeddings in the OpenSearch Service vector store and retrieves top-k chunks. Optionally, post-retrieval, you can incorporate a reranking model to improve the retrieved results quality from the OpenSearch vector store. Refer to Improve the relevance of query responses with a reranker model in Amazon Bedrock for more details.
  8. Finally, the user prompt is augmented with the retrieved document chunks from the vector store as context and the final prompt is sent to an Amazon Bedrock foundation model (FM) for inference. Output guardrail policies are again applied post-response generation. If the predefined output guardrail policies are triggered, the model generates a predefined response like “Sorry, your query violates our usage policy.” If no policies are triggered, then the large language model (LLM) generated response is sent to the user.

To deploy Scenario 1, find the instructions here on Github

Scenario 2: Implement role-based access to PII data during retrieval

In this scenario, we demonstrate a comprehensive security approach that combines role-based access control (RBAC) with intelligent PII guardrails for RAG applications. It integrates Amazon Bedrock with AWS identity services to automatically enforce security through different guardrail configurations for admin and non-admin users.

The solution uses the metadata filtering capabilities of Amazon Bedrock Knowledge Bases to dynamically filter documents during similarity searches using metadata attributes assigned before ingestion. For example, admin and non-admin metadata attributes are created and attached to relevant documents before the ingestion process. During retrieval, the system returns only the documents with metadata matching the user’s security role and permissions and applies the relevant guardrail policies to either mask or block sensitive data detected on the LLM output.

This metadata-driven approach, combined with features like custom guardrails, real-time PII detection, masking, and comprehensive access logging creates a robust framework that maintains the security and utility of the RAG application while enforcing RBAC.

The following diagram illustrates how RBAC works with metadata filtering in the vector database.

Amazon Bedrock Knowledge Bases metadata filtering

For a detailed understanding of how metadata filtering works, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy.

Augmented retrieval flow

The augmented retrieval flow diagram shows how user queries are processed securely based on role-based access.

Scenario 2 - Retrieval flow

The workflow consists of the following steps:

  1. The user is authenticated using an Amazon Cognito user pool. It generates a validation token after successful authentication.
  2. The user query is sent using an API call along with the authentication token through Amazon API Gateway.
  3. Amazon API Gateway forwards the payload and claims to an integration Lambda function.
  4. The Lambda function extracts the claims from the header and checks for user role and determines whether to use an admin guardrail or a non-admin guardrail based on the access level.
  5. Next, the Amazon Bedrock Knowledge Bases RetrieveAndGenerate API is invoked along with the guardrail applied on the user input.
  6. Amazon Bedrock Knowledge Bases embeds the query using the Amazon Titan Text v2 embeddings model.
  7. Amazon Bedrock Knowledge Bases performs similarity searches on the OpenSearch Service vector database and retrieves relevant chunks (optionally, you can improve the relevance of query responses using a reranker model in the knowledge base).
  8. The user prompt is augmented with the retrieved context from the previous step and sent to the Amazon Bedrock FM for inference.
  9. Based on the user role, the LLM output is evaluated against defined Responsible AI policies using either admin or non-admin guardrails.
  10. Based on guardrail evaluation, the system either returns a “Sorry! Cannot Respond” message if the guardrail intervenes, or delivers an appropriate response with no masking on the output for admin users or sensitive data masked for non-admin users.

To deploy Scenario 2, find the instructions here on Github

This security architecture combines Amazon Bedrock guardrails with granular access controls to automatically manage sensitive information exposure based on user permissions. The multi-layered approach makes sure organizations maintain security compliance while fully utilizing their knowledge base, proving security and functionality can coexist.

Customizing the solution

The solution offers several customization points to enhance its flexibility and adaptability:

  • Integration with external APIs – You can integrate existing PII detection and redaction solutions with this system. The Lambda function can be modified to use custom APIs for PHI or PII handling before calling the Amazon Bedrock Knowledge Bases API.
  • Multi-modal processing – Although the current solution focuses on text, it can be extended to handle images containing PII by incorporating image-to-text conversion and caption generation. For more information about using Amazon Bedrock for processing multi-modal content during ingestion, see Parsing options for your data source.
  • Custom guardrails – Organizations can implement additional specialized security measures tailored to their specific use cases.
  • Structured data handling – For queries involving structured data, the solution can be customized to include Amazon Redshift as a structured data store as opposed to OpenSearch Service. Data masking and redaction on Amazon Redshift can be achieved by applying dynamic data masking (DDM) policies, including fine-grained DDM policies like role-based access control and column-level policies using conditional dynamic data masking.
  • Agentic workflow integration – When incorporating an Amazon Bedrock knowledge base with an agentic workflow, additional safeguards can be implemented to protect sensitive data from external sources, such as API calls, tool use, agent action groups, session state, and long-term agentic memory.
  • Response streaming support – The current solution uses a REST API Gateway endpoint that doesn’t support streaming. For streaming capabilities, consider WebSocket APIs in API Gateway, Application Load Balancer (ALB), or custom solutions with chunked responses using client-side reassembly or long-polling techniques.

With these customization options, you can tailor the solution to your specific needs, providing a robust and flexible security framework for your RAG applications. This approach not only protects sensitive data but also maintains the utility and efficiency of the knowledge base, allowing users to interact with the system while automatically enforcing role-appropriate information access and PII handling.

Shared security responsibility: The customer’s role

At AWS, security is our top priority and security in the cloud is a shared responsibility between AWS and our customers. With AWS, you control your data by using AWS services and tools to determine where your data is stored, how it is secured, and who has access to it. Services such as AWS Identity and Access Management (IAM) provide robust mechanisms for securely controlling access to AWS services and resources.

To enhance your security posture further, services like AWS CloudTrail and Amazon Macie offer advanced compliance, detection, and auditing capabilities. When it comes to encryption, AWS CloudHSM and AWS Key Management Service (KMS) enable you to generate and manage encryption keys with confidence.

For organizations seeking to establish governance and maintain data residency controls, AWS Control Tower offers a comprehensive solution. For more information on Data protection and Privacy, refer to Data Protection and Privacy at AWS.

While our solution demonstrates the use of PII detection and redaction techniques, it does not provide an exhaustive list of all PII types or detection methods. As a customer, you bear the responsibility for implementing the appropriate PII detection types and redaction methods using AWS services, including Amazon Bedrock Guardrails and other open-source libraries. The regular expressions configured in Bedrock Guardrails within this solution serve as a reference example only and do not cover all possible variations for detecting PII types. For instance, date of birth (DOB) formats can vary widely. Therefore, it falls on you to configure Bedrock Guardrails and policies to accurately detect the PII types relevant to your use case. Amazon Bedrock maintains strict data privacy standards. The service does not store or log your prompts and completions, nor does it use them to train AWS models or share them with third parties. We implement this through our Model Deployment Account architecture – each AWS Region where Amazon Bedrock is available has a dedicated deployment account per model provider, managed exclusively by the Amazon Bedrock service team. Model providers have no access to these accounts. When a model is delivered to AWS, Amazon Bedrock performs a deep copy of the provider’s inference and training software into these controlled accounts for deployment, making sure that model providers cannot access Amazon Bedrock logs or customer prompts and completions.

Ultimately, while we provide the tools and infrastructure, the responsibility for securing your data using AWS services rests with you, the customer. This shared responsibility model makes sure that you have the flexibility and control to implement security measures that align with your unique requirements and compliance needs, while we maintain the security of the underlying cloud infrastructure. For comprehensive information about Amazon Bedrock security, please refer to the Amazon Bedrock Security documentation.

Conclusion

In this post, we explored two approaches for securing sensitive data in RAG applications using Amazon Bedrock. The first approach focused on identifying and redacting sensitive data before ingestion into an Amazon Bedrock knowledge base, and the second demonstrated a fine-grained RBAC pattern for managing access to sensitive information during retrieval. These solutions represent just two possible approaches among many for securing sensitive data in generative AI applications.

Security is a multi-layered concern that requires careful consideration across all aspects of your application architecture. Looking ahead, we plan to dive deeper into RBAC for sensitive data within structured data stores when used with Amazon Bedrock Knowledge Bases. This can provide additional granularity and control over data access patterns while maintaining security and compliance requirements. Securing sensitive data in RAG applications requires ongoing attention to evolving security best practices, regular auditing of access patterns, and continuous refinement of your security controls as your applications and requirements grow.

To enhance your understanding of Amazon Bedrock security implementation, explore these additional resources:

The complete source code and deployment instructions for these solutions are available in our GitHub repository.

We encourage you to explore the repository for detailed implementation guidance and customize the solutions based on your specific requirements using the customization points discussed earlier.


About the authors

Praveen Chamarthi brings exceptional expertise to his role as a Senior AI/ML Specialist at Amazon Web Services, with over two decades in the industry. His passion for Machine Learning and Generative AI, coupled with his specialization in ML inference on Amazon SageMaker and Amazon Bedrock, enables him to empower organizations across the Americas to scale and optimize their ML operations. When he’s not advancing ML workloads, Praveen can be found immersed in books or enjoying science fiction films. Connect with him on LinkedIn to follow his insights.

Srikanth Reddy is a Senior AI/ML Specialist with Amazon Web Services. He is responsible for providing deep, domain-specific expertise to enterprise customers, helping them use AWS AI and ML capabilities to their fullest potential. You can find him on LinkedIn.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.

Vivek Bhadauria is a Principal Engineer at Amazon Bedrock with almost a decade of experience in building AI/ML services. He now focuses on building generative AI services such as Amazon Bedrock Agents and Amazon Bedrock Guardrails. In his free time, he enjoys biking and hiking.

Brandon Rooks Sr. is a Cloud Security Professional with 20+ years of experience in the IT and Cybersecurity field. Brandon joined AWS in 2019, where he dedicates himself to helping customers proactively enhance the security of their cloud applications and workloads. Brandon is a lifelong learner, and holds the CISSP, AWS Security Specialty, and AWS Solutions Architect Professional certifications. Outside of work, he cherishes moments with his family, engaging in various activities such as sports, gaming, music, volunteering, and traveling.

Vikash Garg is a Principal Engineer at Amazon Bedrock with almost 4 years of experience in building AI/ML services. He has a decade of experience in building large-scale systems. He now focuses on building the generative AI service AWS Bedrock Guardrails. In his free time, he enjoys hiking and traveling.

Read More

Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Today, we’re excited to announce the launch of Amazon SageMaker Large Model Inference (LMI) container v15, powered by vLLM 0.8.4 with support for the vLLM V1 engine. This version now supports the latest open-source models, such as Meta’s Llama 4 models Scout and Maverick, Google’s Gemma 3, Alibaba’s Qwen, Mistral AI, DeepSeek-R, and many more. Amazon SageMaker AI continues to evolve its generative AI inference capabilities to meet the growing demands in performance and model support for foundation models (FMs).

This release introduces significant performance improvements, expanded model compatibility with multimodality (that is, the ability to understand and analyze text-to-text, images-to-text, and text-to-images data), and provides built-in integration with vLLM to help you seamlessly deploy and serve large language models (LLMs) with the highest performance at scale.

What’s new?

LMI v15 brings several enhancements that improve throughput, latency, and usability:

  1. An async mode that directly integrates with vLLM’s AsyncLLMEngine for improved request handling. This mode creates a more efficient background loop that continuously processes incoming requests, enabling it to handle multiple concurrent requests and stream outputs with higher throughput than the previous Rolling-Batch implementation in v14.
  2. Support for the vLLM V1 engine, which delivers up to 111% higher throughput compared to the previous V0 engine for smaller models at high concurrency. This performance improvement comes from reduced CPU overhead, optimized execution paths, and more efficient resource utilization in the V1 architecture. LMI v15 supports both V1 and V0 engines, with V1 being the default. If you have a need to use V0, you can use the V0 engine by specifying VLLM_USE_V1=0. vLLM V1’s engine also comes with a core re-architecture of the serving engine with simplified scheduling, zero-overhead prefix caching, clean tensor-parallel inference, efficient input preparation, and advanced optimizations with torch.compile and Flash Attention 3. For more information, see the vLLM Blog.
  3. Expanded API schema support with three flexible options to allow seamless integration with applications built on popular API patterns:
    1. Message format compatible with the OpenAI Chat Completions API.
    2. OpenAI Completions format.
    3. Text Generation Inference (TGI) schema to support backward compatibility with older models.
  4. Multimodal support, with enhanced capabilities for vision-language models including optimizations such as multimodal prefix caching
  5. Built-in support for function calling and tool calling, enabling sophisticated agent-based workflows.

Enhanced model support

LMI v15 supports an expanding roster of state-of-the-art models, including the latest releases from leading model providers. The container offers ready-to-deploy compatibility for but not limited to:

  • Llama 4 – Llama-4-Scout-17B-16E and Llama-4-Maverick-17B-128E-Instruct
  • Gemma 3 – Google’s lightweight and efficient models, known for their strong performance despite smaller size
  • Qwen 2.5 – Alibaba’s advanced models including QwQ 2.5 and Qwen2-VL with multimodal capabilities
  • Mistral AI models – High-performance models from Mistral AI that offer efficient scaling and specialized capabilities
  • DeepSeek-R1/V3 – State of the art reasoning models

Each model family can be deployed using the LMI v15 container by specifying the appropriate model ID, for example, meta-llama/Llama-4-Scout-17B-16E, and configuration parameters as environment variables, without requiring custom code or optimization work.

Benchmarks

Our benchmarks demonstrate the performance advantages of LMI v15’s V1 engine compared to previous versions:

Model Batch size Instance type LMI v14 throughput [tokens/s] (V0 engine) LMI v15 throughput [tokens/s] (V1 engine) Improvement
1 deepseek-ai/DeepSeek-R1-Distill-Llama-70B 128 p4d.24xlarge 1768 2198 24%
2 meta-llama/Llama-3.1-8B-Instruct 64 ml.g6e.2xlarge 1548 2128 37%
3 mistralai/Mistral-7B-Instruct-v0.3 64 ml.g6e.2xlarge 942 1988 111%

DeepSeek-R1 Llama 70B for various levels of concurrency

Llama 3.1 8B Instruct for various level of concurrency

Mistral 7B for various levels of concurrency

The async engine in LMI v15 shows strength in high-concurrency scenarios, where multiple simultaneous requests benefit from the optimized request handling. These benchmarks highlight that the V1 engine in async mode delivers between 24% and 111% higher throughput compared to LMI v14 using rolling batch in the models tested in high concurrency scenarios for batch size of 64 and 128. We suggest to keep in mind the following considerations for optimal performance:

  • Higher batch sizes increase concurrency but come with a natural tradeoff in terms of latency
  • Batch sizes of 4 and 8 provide the best latency for most use cases
  • Batch sizes up to 64 and 128 achieve maximum throughput with acceptable latency trade-offs

API formats

LMI v15 supports three API schemas: OpenAI Chat Completions, OpenAI Completions, and TGI.

  • Chat Completions – Message format is compatible with OpenAI Chat Completions API. Use this schema for tool calling, reasoning, and multimodal use cases. Here is a sample of the invocation with the Messages API:
    body = {
        "messages": [
            {"role": "user", "content": "Name popular places to visit in London?"}
        ],
        "temperature": 0.9,
        "max_tokens": 256,
        "stream": True,
    }

  • OpenAI Completions format – The Completions API endpoint is no longer receiving updates:
    body = {
     "prompt": "Name popular places to visit in London?",
     "temperature": 0.9,
     "max_tokens": 256,
     "stream": True,
    } 

  • TGI – Supports backward compatibility with older models:
    body = {
    "inputs": "Name popular places to visit in London?",
    "parameters": {
    "max_new_tokens": 256,
    "temperature": 0.9,
    },
    "stream": True,
    }

Getting started with LMI v15

Getting started with LMI v15 is seamless, and you can deploy with LMI v15 in only a few lines of code. The container is available through Amazon Elastic Container Registry (Amazon ECR), and deployments can be managed through SageMaker AI endpoints. To deploy models, you need to specify the Hugging Face model ID, instance type, and configuration options as environment variables.

For optimal performance, we recommend the following instances:

  • Llama 4 Scout: ml.p5.48xlarge
  • DeepSeek R1/V3: ml.p5e.48xlarge
  • Qwen 2.5 VL-32B: ml.g5.12xlarge
  • Qwen QwQ 32B: ml.g5.12xlarge
  • Mistral Large: ml.g6e.48xlarge
  • Gemma3-27B: ml.g5.12xlarge
  • Llama 3.3-70B: ml.p4d.24xlarge

To deploy with LMI v15, follow these steps:

  1. Clone the notebook to your Amazon SageMaker Studio notebook or to Visual Studio Code (VS Code). You can then run the notebook to do the initial setup and deploy the model from the Hugging Face repository to the SageMaker AI endpoint. We walk through the key blocks here.
  2. LMI v15 maintains the same configuration pattern as previous versions, using environment variables in the form OPTION_<CONFIG_NAME>. This consistent approach makes it straightforward for users familiar with earlier LMI versions to migrate to v15.
    vllm_config = {
        "HF_MODEL_ID": "meta-llama/Llama-4-Scout-17B-16E",
        "HF_TOKEN": "entertoken",
        "OPTION_MAX_MODEL_LEN": "250000",
        "OPTION_MAX_ROLLING_BATCH_SIZE": "8",
        "OPTION_MODEL_LOADING_TIMEOUT": "1500",
        "SERVING_FAIL_FAST": "true",
        "OPTION_ROLLING_BATCH": "disable",
        "OPTION_ASYNC_MODE": "true",
        "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service"
    }

    • HF_MODEL_ID sets the model id from Hugging Face. You can also download model from Amazon Simple Storage Service (Amazon S3).
    • HF_TOKEN sets the token to download the model. This is required for gated models like Llama-4
    • OPTION_MAX_MODEL_LEN. This is the max model context length.
    • OPTION_MAX_ROLLING_BATCH_SIZE sets the batch size for the model.
    • OPTION_MODEL_LOADING_TIMEOUT sets the timeout value for SageMaker to load the model and run health checks.
    • SERVING_FAIL_FAST=true. We recommend setting this flag because it allows SageMaker to gracefully restart the container when an unrecoverable engine error occurs.
    • OPTION_ROLLING_BATCH= disable disables the rolling batch implementation of LMI, which was the default offering in LMI V14. We recommend using async instead as this latest implementation and provides better performance
    • OPTION_ASYNC_MODE=true enables async mode.
    • OPTION_ENTRYPOINT provides the entrypoint for vLLM’s async integrations
  3. Set the latest container (in this example we used 0.33.0-lmi15.0.0-cu128), AWS Region (us-east-1), and create a model artifact with all the configurations. To review the latest available container version, see Available Deep Learning Containers Images.
  4. Deploy the model to the endpoint using model.deploy().
    CONTAINER_VERSION = '0.33.0-lmi15.0.0-cu128'
    REGION = 'us-east-1'
    # Construct container URI
    container_uri = f'763104351884.dkr.ecr.{REGION}.amazonaws.com/djl-inference:{CONTAINER_VERSION}'
    
    # Select instance type
    instance_type = "ml.p5.48xlarge"
    
    model = Model(image_uri=container_uri,
                  role=role,
                  env=vllm_config)
    endpoint_name = sagemaker.utils.name_from_base("Llama-4")
    
    print(endpoint_name)
    model.deploy(
        initial_instance_count=1,
        instance_type=instance_type,
        endpoint_name=endpoint_name,
        container_startup_health_check_timeout = 1800
    )

  5. Invoke the model, SageMaker inference provides two APIs to invoke the model- InvokeEndpoint and InvokeEndpointWithResponseStream. You can choose either option based on your needs.
    # Create SageMaker Runtime client
    smr_client = boto3.client('sagemaker-runtime')
    ##Add your endpoint here 
    endpoint_name = ''
    
    # Invoke with messages format
    body = {
    "messages": [
    {"role": "user", "content": "Name popular places to visit in London?"}
    ],
    "temperature": 0.9,
    "max_tokens": 256,
    "stream": True,
    }
    
    # Invoke with endpoint streaming
    resp = smr_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(body),
    ContentType="application/json",
    )

To run multi-modal inference with Llama-4 Scout, see the notebook for the full code sample to run inference requests with images.

Conclusion

Amazon SageMaker LMI container v15 represents a significant step forward in large model inference capabilities. With the new vLLM V1 engine, async operating mode, expanded model support, and optimized performance, you can deploy cutting-edge LLMs with greater performance and flexibility. The container’s configurable options give you the flexibility to fine-tune deployments for your specific needs, whether optimizing for latency, throughput, or cost.

We encourage you to explore this release for deploying your generative AI models.

Check out the provided example notebooks to start deploying models with LMI v15.


About the authors

Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Siddharth Venkatesan is a Software Engineer in AWS Deep Learning. He currently focusses on building solutions for large model inference. Prior to AWS he worked in the Amazon Grocery org building new payment features for customers world-wide. Outside of work, he enjoys skiing, the outdoors, and watching sports.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the SageMaker machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.

Read More

Accuracy evaluation framework for Amazon Q Business – Part 2

Accuracy evaluation framework for Amazon Q Business – Part 2

In the first post of this series, we introduced a comprehensive evaluation framework for Amazon Q Business, a fully managed Retrieval Augmented Generation (RAG) solution that uses your company’s proprietary data without the complexity of managing large language models (LLMs). The first post focused on selecting appropriate use cases, preparing data, and implementing metrics to support a human-in-the-loop evaluation process.

In this post, we dive into the solution architecture necessary to implement this evaluation framework for your Amazon Q Business application. We explore two distinct evaluation solutions:

  • Comprehensive evaluation workflow – This ready-to-deploy solution uses AWS CloudFormation stacks to set up an Amazon Q Business application, complete with user access, a custom UI for review and evaluation, and the supporting evaluation infrastructure
  • Lightweight AWS Lambda based evaluation – Designed for users with an existing Amazon Q Business application, this streamlined solution employs an AWS Lambda function to efficiently assess the application’s accuracy

By the end of this post, you will have a clear understanding of how to implement an evaluation framework that aligns with your specific needs with a detailed walkthrough, so your Amazon Q Business application delivers accurate and reliable results.

Challenges in evaluating Amazon Q Business

Evaluating the performance of Amazon Q Business, which uses a RAG model, presents several challenges due to its integration of retrieval and generation components. It’s crucial to identify which aspects of the solution need evaluation. For Amazon Q Business, both the retrieval accuracy and the quality of the answer output are important factors to assess. In this section, we discuss key metrics that need to be included for a RAG generative AI solution.

Context recall

Context recall measures the extent to which all relevant content is retrieved. High recall provides comprehensive information gathering but might introduce extraneous data.

For example, a user might ask the question “What can you tell me about the geography of the United States?” They could get the following responses:

  • Expected: The United States is the third-largest country in the world by land area, covering approximately 9.8 million square kilometers. It has a diverse range of geographical features.
  • High context recall: The United States spans approximately 9.8 million square kilometers, making it the third-largest nation globally by land area. country’s geography is incredibly diverse, featuring the Rocky Mountains stretching from New Mexico to Alaska, the Appalachian Mountains along the eastern states, the expansive Great Plains in the central region, arid deserts like the Mojave in the southwest.
  • Low context recall: The United States features significant geographical landmarks. Additionally, the country is home to unique ecosystems like the Everglades in Florida, a vast network of wetlands.

The following diagram illustrates the context recall workflow.

Context precision

Context precision assesses the relevance and conciseness of retrieved information. High precision indicates that the retrieved information closely matches the query intent, reducing irrelevant data.

For example, “Why Silicon Valley is great for tech startups?”might give the following answers:

  • Ground truth answer: Silicon Valley is famous for fostering innovation and entrepreneurship in the technology sector.
  • High precision context: Many groundbreaking startups originate from Silicon Valley, benefiting from a culture that encourages innovation, risk-taking
  • Low precision context: Silicon Valley experiences a Mediterranean climate, with mild, wet, winters and warm, dry summers, contributing to its appeal as a place to live and works

The following diagram illustrates the context precision workflow.

Answer relevancy

Answer relevancy evaluates whether responses fully address the query without unnecessary details. Relevant answers enhance user satisfaction and trust in the system.

For example, a user might ask the question “What are the key features of Amazon Q Business Service, and how can it benefit enterprise customers?” They could get the following answers:

  • High relevance answer: Amazon Q Business Service is a RAG Generative AI solution designed for enterprise use. Key features include a fully managed Generative AI solutions, integration with enterprise data sources, robust security protocols, and customizable virtual assistants. It benefits enterprise customers by enabling efficient information retrieval, automating customer support tasks, enhancing employee productivity through quick access to data, and providing insights through analytics on user interactions.
  • Low relevance answer: Amazon Q Business Service is part of Amazon’s suite of cloud services. Amazon also offers online shopping and streaming services.

The following diagram illustrates the answer relevancy workflow.

Truthfulness

Truthfulness verifies factual accuracy by comparing responses to verified sources. Truthfulness is crucial to maintain the system’s credibility and reliability.

For example, a user might ask “What is the capital of Canada?” They could get the following responses:

  • Context: Canada’s capital city is Ottawa, located in the province of Ontario. Ottawa is known for its historic Parliament Hill, the center of government, and the scenic Rideau Canal, a UNESCO World Heritage site
  • High truthfulness answer: The capital of Canada is Ottawa
  • Low truthfulness answer: The capital of Canada is Toronto

The following diagram illustrates the truthfulness workflow.

Evaluation methods

Deciding on who should conduct the evaluation can significantly impact results. Options include:

  • Human-in-the-Loop (HITL) – Human evaluators manually assess the accuracy and relevance of responses, offering nuanced insights that automated systems might miss. However, it is a slow process and difficult to scale.
  • LLM-aided evaluation – Automated methods, such as the Ragas framework, use language models to streamline the evaluation process. However, these might not fully capture the complexities of domain-specific knowledge.

Each of these preparatory and evaluative steps contributes to a structured approach to evaluating the accuracy and effectiveness of Amazon Q Business in supporting enterprise needs.

Solution overview

In this post, we explore two different solutions to provide you the details of an evaluation framework, so you can use it and adapt it for your own use case.

Solution 1: End-to-end evaluation solution

For a quick start evaluation framework, this solution uses a hybrid approach with Ragas (automated scoring) and HITL evaluation for robust accuracy and reliability. The architecture includes the following components:

  • User access and UI – Authenticated users interact with a frontend UI to upload datasets, review RAGAS output, and provide human feedback
  • Evaluation solution infrastructure – Core components include:
  • Ragas scoring – Automated metrics provide an initial layer of evaluation
  • HITL review – Human evaluators refine Ragas scores through the UI, providing nuanced accuracy and reliability

By integrating a metric-based approach with human validation, this architecture makes sure Amazon Q Business delivers accurate, relevant, and trustworthy responses for enterprise users. This solution further enhances the evaluation process by incorporating HITL reviews, enabling human feedback to refine automated scores for higher precision.

A quick video demo of this solution is shown below:

Solution architecture

The solution architecture is designed with the following core functionalities to support an evaluation framework for Amazon Q Business:

  1. User access and UI – Users authenticate through Amazon Cognito, and upon successful login, interact with a Streamlit-based custom UI. This frontend allows users to upload CSV datasets to Amazon Simple Storage Service (Amazon S3), review Ragas evaluation outputs, and provide human feedback for refinement. The application exchanges the Amazon Cognito token for an AWS IAM Identity Center token, granting scoped access to Amazon Q Business.UI
  2. infrastructure – The UI is hosted behind an Application Load Balancer, supported by Amazon Elastic Compute Cloud (Amazon EC2) instances running in an Auto Scaling group for high availability and scalability.
  3. Upload dataset and trigger evaluation – Users upload a CSV file containing queries and ground truth answers to Amazon S3, which triggers an evaluation process. A Lambda function reads the CSV, stores its content in a DynamoDB table, and initiates further processing through a DynamoDB stream.
  4. Consuming DynamoDB stream – A separate Lambda function processes new entries from the DynamoDB stream, and publishes messages to an SQS queue, which serves as a trigger for the evaluation Lambda function.
  5. Ragas scoring – The evaluation Lambda function consumes SQS messages, sending queries (prompts) to Amazon Q Business for generating answers. It then evaluates the prompt, ground truth, and generated answer using the Ragas evaluation framework. Ragas computes automated evaluation metrics such as context recall, context precision, answer relevancy, and truthfulness. The results are stored in DynamoDB and visualized in the UI.

HITL review – Authenticated users can review and refine RAGAS scores directly through the UI, providing nuanced and accurate evaluations by incorporating human insights into the process.

This architecture uses AWS services to deliver a scalable, secure, and efficient evaluation solution for Amazon Q Business, combining automated and human-driven evaluations.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Additionally, make sure that all the resources you deploy are in the same AWS Region.

Deploy the CloudFormation stack

Complete the following steps to deploy the CloudFormation stack:

  1. Clone the repository or download the files to your local computer.
  2. Unzip the downloaded file (if you used this option).
  3. Using your local computer command line, use the ‘cd’ command and change directory into ./sample-code-for-evaluating-amazon-q-business-applications-using-ragas-main/end-to-end-solution
  4. Make sure the ./deploy.sh script can run by executing the command chmod 755 ./deploy.sh.
  5. Execute the CloudFormation deployment script provided as follows:
    ./deploy.sh -s [CNF_STACK_NAME] -r [AWS_REGION]

You can follow the deployment progress on the AWS CloudFormation console. It takes approximately 15 minutes to complete the deployment, after which you will see a similar page to the following screenshot.

Add users to Amazon Q Business

You need to provision users for the pre-created Amazon Q Business application. Refer to Setting up for Amazon Q Business for instructions to add users.

Upload the evaluation dataset through the UI

In this section, you review and upload the following CSV file containing an evaluation dataset through the deployed custom UI.

This CSV file contains two columns: prompt and ground_truth. There are four prompts and their associated ground truth in this dataset:

  • What are the index types of Amazon Q Business and the features of each?
  • I want to use Q Apps, which subscription tier is required to use Q Apps?
  • What is the file size limit for Amazon Q Business via file upload?
  • What data encryption does Amazon Q Business support?

To upload the evaluation dataset, complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the evals stack that you already launched.
  3. On the Outputs tab, take note of the user name and password to log in to the UI application, and choose the UI URL.

The custom UI will redirect you to the Amazon Cognito login page for authentication.

The UI application authenticates the user with Amazon Cognito, and initiates the token exchange workflow to implement a secure Chatsync API call with Amazon Q Business.

  1. Use the credentials you noted earlier to log in.

For more information about the token exchange flow between IAM Identity Center and the identity provider (IdP), refer to Building a Custom UI for Amazon Q Business.

  • After you log in to the custom UI used for Amazon Q evaluation, choose Upload Dataset, then upload the dataset CSV file.

After the file is uploaded, the evaluation framework will send the prompt to Amazon Q Business to generate the answer, and then send the prompt, ground truth, and answer to Ragas to evaluate. During this process, you can also review the uploaded dataset (including the four questions and associated ground truth) on the Amazon Q Business console, as shown in the following screenshot.

After about 7 minutes, the workflow will finish, and you should see the evaluation result for first question.

Perform HITL evaluation

After the Lambda function has completed its execution, Ragas scoring will be shown in the custom UI. Now you can review metric scores generated using Ragas (an-LLM aided evaluation method), and you can provide human feedback as an evaluator to provide further calibration. This human-in-the-loop calibration can further improve the evaluation accuracy, because the HITL process is particularly valuable in fields where human judgment, expertise, or ethical considerations are crucial.

Let’s review the first question: “What are the index types of Amazon Q Business and the features of each?” You can read the question, Amazon Q Business generated answers, ground truth, and context.

Next, review the evaluation metrics scored by using Ragas. As discussed earlier, there are four metrics:

  • Answer relevancy – Measures relevancy of answers. Higher scores indicate better alignment with the user input, and lower scores are given if the response is incomplete or includes redundant information.
  • Truthfulness – Verifies factual accuracy by comparing responses to verified sources. Higher scores indicate a better consistency with verified sources.
  • Context precision – Assesses the relevance and conciseness of retrieved information. Higher scores indicate that the retrieved information closely matches the query intent, reducing irrelevant data.
  • Context recall – Measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out.

For this question, all metrics showed Amazon Q Business achieved a high-quality response. It’s worthwhile to compare your own evaluation with these scores generated by Ragas.

Next, let’s review a question that returned with a low answer relevancy score. For example: “I want to use Q Apps, which subscription tier is required to use Q Apps?”

Analyzing both question and answer, we can consider the answer relevant and aligned with the user question, but the answer relevancy score from Ragas doesn’t reflect this human analysis, showing a lower score than expected. It’s important to calibrate Ragas evaluation judgement as Human in the Lopp. You should read the question and answer carefully, and make necessary changes of the metric score to reflect the HITL analysis. Finally, the results will be updated in DynamoDB.

Lastly, save the metric score in the CSV file, and you can download and review the final metric scores.

Solution 2: Lambda based evaluation

If you’re already using Amazon Q Business, AmazonQEvaluationLambda allows for quick integration of evaluation methods into your application without setting up a custom UI application. It offers the following key features:

  • Evaluates responses from Amazon Q Business using Ragas against a predefined test set of questions and ground truth data
  • Outputs evaluation metrics that can be visualized directly in Amazon CloudWatch
  • Both solutions provide you results based on the input dataset and the responses from the Amazon Q Business application, using Ragas to evaluate four key evaluation metrics (context recall, context precision, answer relevancy, and truthfulness).

This solution provides you sample code to evaluate the Amazon Q Business application response. To use this solution, you need to have or create a working Amazon Q Business application integrated with IAM Identity Center or Amazon Cognito as an IdP. This Lambda function works in the same way as the Lambda function in the end-to-end evaluation solution, using RAGAS against a test set of questions and ground truth. This lightweight solution doesn’t have a custom UI, but it can provide result metrics (context recall, context precision, answer relevancy, truthfulness), for visualization in CloudWatch. For deployment instructions, refer to the following GitHub repo.

Using evaluation results to improve Amazon Q Business application accuracy

This section outlines strategies to enhance key evaluation metrics—context recall, context precision, answer relevance, and truthfulness—for a RAG solution in the context of Amazon Q Business.

Context recall

Let’s examine the following problems and troubleshooting tips:

  1. Aggressive query filtering – Overly strict search filters or metadata constraints might exclude relevant records. You should review the metadata filters or boosting settings applied in Amazon Q Business to make sure they don’t unnecessarily restrict results.
  2. Data source ingestion errors – Documents from certain data sources aren’t successfully ingested into Amazon Q Business. To address this, check the document sync history report in Amazon Q Business to confirm successful ingestion and resolve ingestion errors.

Context precision

Consider the following potential issues:

  • Over-retrieval of documents – Large top-K values might retrieve semi-related or off-topic passages, which the LLM might incorporate unnecessarily. To address this, refine metadata filters or apply boosting to improve passage relevance and reduce noise in the retrieved context.
  1. Poor query specificity – Broad or poorly formed user queries can yield loosely related results. You should make sure user queries are clear and specific. Train users or implement query refinement mechanisms to optimize query quality.

Answer relevance

Consider the following troubleshooting methods:

  • Partial coverage – Retrieved context addresses parts of the question but fails to cover all aspects, especially in multi-part queries. To address this, decompose complex queries into sub-questions. Instruct the LLM or a dedicated module to retrieve and answer each sub-question before composing the final response. For example:
    • Break down the query into sub-questions.
    • Retrieve relevant passages for each sub-question.
    • Compose a final answer addressing each part.
  • Context/answer mismatch – The LLM might misinterpret retrieved passages, omit relevant information, or merge content incorrectly due to hallucination. You can use prompt engineering to guide the LLM more effectively. For example, for the original query “What are the top 3 reasons for X?” you can use the rewritten prompt “List the top 3 reasons for X clearly labeled as #1, #2, and #3, based strictly on the retrieved context.”

Truthfulness

Consider the following:

  • Stale or inaccurate data sources – Outdated or conflicting information in the knowledge corpus might lead to incorrect answers. To address this, compare the retrieved context with verified sources to provide accuracy. Collaborate with SMEs to validate the data.
  • LLM hallucination – The model might fabricate or embellish details, even with accurate retrieved context. Although Amazon Q Business is a RAG generative AI solution, and should significantly reduce the hallucination, it’s not possible to eliminate hallucination totally. You can measure the frequency of low context precision answers to identify patterns and quantify the impact of hallucinations to gain an aggregated view with the evaluation solution.

By systematically examining and addressing the root causes of low evaluation metrics, you can optimize your Amazon Q Business application. From document retrieval and ranking to prompt engineering and validation, these strategies will help enhance the effectiveness of your RAG solution.

Clean up

Don’t forget to go back to the CloudFormation console and delete the CloudFormation stack to delete the underlying infrastructure that you set up, to avoid additional costs on your AWS account.

Conclusion

In this post, we outlined two evaluation solutions for Amazon Q Business: a comprehensive evaluation workflow and a lightweight Lambda based evaluation. These approaches combine automated evaluation approaches such as Ragas with human-in-the-loop validation, providing reliable and accurate assessments.

By using our guidance on how to improve evaluation metrics, you can continuously optimize your Amazon Q Business application to meet enterprise needs with Amazon Q Business. Whether you’re using the end-to-end solution or the lightweight approach, these frameworks provide a scalable and efficient path to improve accuracy and relevance.

To learn more about Amazon Q Business and how to evaluate Amazon Q Business results, explore these hands-on workshops:


About the authors

Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.

Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She is specialized in Generative AI, Applied Data Science and IoT architecture. Currently she is part of the Amazon Bedrock team, and a Gold member/mentor in Machine Learning Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome generative AI solutions. She is particularly passionate about leveraging Large Language Models for advanced data analytics and exploring practical applications that address real-world challenges.

Amit GuptaAmit Gupta is a Senior Q Business Solutions Architect Solutions Architect at AWS. He is passionate about enabling customers with well-architected generative AI solutions at scale.

Neil Desai is a technology executive with over 20 years of experience in artificial intelligence (AI), data science, software engineering, and enterprise architecture. At AWS, he leads a team of Worldwide AI services specialist solutions architects who help customers build innovative Generative AI-powered solutions, share best practices with customers, and drive product roadmap. He is passionate about using technology to solve real-world problems and is a strategic thinker with a proven track record of success.

Ricardo Aldao is a Senior Partner Solutions Architect at AWS. He is a passionate AI/ML enthusiast who focuses on supporting partners in building generative AI solutions on AWS.

Read More

Use Amazon Bedrock Intelligent Prompt Routing for cost and latency benefits

Use Amazon Bedrock Intelligent Prompt Routing for cost and latency benefits

In December, we announced the preview availability for Amazon Bedrock Intelligent Prompt Routing, which provides a single serverless endpoint to efficiently route requests between different foundation models within the same model family. To do this, Amazon Bedrock Intelligent Prompt Routing dynamically predicts the response quality of each model for a request and routes the request to the model it determines is most appropriate based on cost and response quality, as shown in the following figure.

Today, we’re happy to announce the general availability of Amazon Bedrock Intelligent Prompt Routing. Over the past several months, we drove several improvements in intelligent prompt routing based on customer feedback and extensive internal testing. Our goal is to enable you to set up automated, optimal routing between large language models (LLMs) through Amazon Bedrock Intelligent Prompt Routing and its deep understanding of model behaviors within each model family, which incorporates state-of-the-art methods for training routers for different sets of models, tasks and prompts.

In this blog post, we detail various highlights from our internal testing, how you can get started, and point out some caveats and best practices. We encourage you to incorporate Amazon Bedrock Intelligent Prompt Routing into your new and existing generative AI applications. Let’s dive in!

Highlights and improvements

Today, you can either use Amazon Bedrock Intelligent Prompt Routing with the default prompt routers provided by Amazon Bedrock or configure your own prompt routers to adjust for performance linearly between the performance of the two candidate LLMs. Default prompt routers—pre-configured routing systems to map performance to the more performant of the two models while lowering costs by sending easier prompts to the cheaper model—are provided by Amazon Bedrock for each model family. These routers come with predefined settings and are designed to work out-of-the-box with specific foundation models. They provide a straightforward, ready-to-use solution without needing to configure any routing settings. Customers who tested Amazon Bedrock Intelligent Prompt Routing in preview (thank you!), you could choose models in the Anthropic and Meta families. Today, you can choose more models from within the Amazon Nova, Anthropic, and Meta families, including:

  • Anthropic’s Claude family: Haiku, Sonnet3.5 v1, Haiku 3.5, Sonnet 3.5 v2
  • Llama family: Llama 3.1 8b, 70b, 3.2 11B, 90B and 3.3 70B
  • Nova family: Nova Pro and Nova lite

You can also configure your own prompt routers to define your own routing configurations tailored to specific needs and preferences. These are more suitable when you require more control over how to route your requests and which models to use. In GA, you can configure your own router by selecting any two models from the same model family and then configuring the response quality difference of your router.

Adding components before invoking the selected LLM with the original prompt can add overhead. We reduced overhead of added components by over 20% to approximately 85 ms (P90). Because the router preferentially invokes the less expensive model while maintaining the same baseline accuracy in the task, you can expect to get an overall latency and cost benefit compared to always hitting the larger/ more expensive model, despite the additional overhead. This is discussed further in the following benchmark results section.

We conducted several internal tests with proprietary and public data to evaluate Amazon Bedrock Intelligent Prompt Routing metrics. First, we used average response quality gain under cost constraints (ARQGC), a normalized (0–1) performance metric for measuring routing system quality for various cost constraints, referenced against a reward model, where 0.5 represents random routing and 1 represents optimal oracle routing performance. We also captured the cost savings with intelligent prompt routing relative to using the largest model in the family, and estimated latency benefit based on average recorded time to first token (TTFT) to showcase the advantages and report them in the following table.

Model family Router overall performance Performance when configuring the router to match performance of the strong model
Average ARQGC Cost savings (%) Latency benefit (%)
Nova 0.75 35% 9.98%
Anthropic 0.86 56% 6.15%
Meta 0.78 16% 9.38%

How to read this table?

It’s important to pause and understand these metrics. First, results shown in the preceding table are only meant for comparing against random routing within the family (that is, improvement in ARQGC over 0.5) and not across families. Second, the results are relevant only within the family of models and are different than other model benchmarks that you might be familiar with that are used to compare models. Third, because the real cost and price change frequently and are dependent on the input and output token counts, it’s challenging to compare the real cost. To solve this problem, we define the cost savings metric as the maximum cost saved compared to the strongest LLM cost for a router to achieve a certain level of response quality. Specifically, in the example shown in the table, there’s an average 35% cost savings using the Nova family router compared to using Nova Pro for all prompts without the router.

You can expect to see varying levels of benefit based on your use case. For example, in an internal test with hundreds of prompts, we achieve 60% cost savings using Amazon Bedrock Intelligent Prompt Routing with the Anthropic family, with the response quality matching that of Claude Sonnet3.5 V2.

What is response quality difference?

The response quality difference measures the disparity between the responses of the fallback model and the other models. A smaller value indicates that the responses are similar. A higher value indicates a significant difference in the responses between the fallback model and the other models. The choice of what you use as a fallback model is important. When configuring a response quality difference of 10% with Anthropic’s Claude 3 Sonnet as the fallback model, the router dynamically selects an LLM to achieve an overall performance with a 10% drop in the response quality from Claude 3 Sonnet. Conversely, if you use a less expensive model such as Claude 3 Haiku as the fallback model, the router dynamically selects an LLM to achieve an overall performance with a more than 10% increase from Claude 3 Haiku.

In the following figure, you can see that the response quality difference is set at 10% with Haiku as the fallback model. If customers want to explore optimal configurations beyond the default settings described previously, they can experiment with different response quality difference thresholds, analyze the router’s response quality, cost, and latency on their development dataset, and select the configuration that best fits their application’s requirements.

When configuring your own prompt router, you can set the threshold for response quality difference as shown in the following image of the Configure prompt router page, under Response quality difference (%) in the Amazon Bedrock console. To do this by using APIs, see How to use intelligent prompt routing.

Benchmark results

When using different model pairings, the ability of the smaller model to service a larger number of input prompts will have significant latency and cost benefits, depending on the model choice and the use case. For example, when comparing between usage of Claude 3 Haiku and Claude 3.5 Haiku along with Claude 3.5 Sonnet, we observe the following with one of our internal datasets:

Case 1: Routing between Claude 3 Haiku and Claude 3.5 Sonnet V2: Cost savings of 48% while maintaining the same response quality as Claude 3.5 Sonnet v2

Case 2: Routing between Claude 3.5 Haiku and Claude 3.5 Sonnet V2: Cost savings of 56% while maintaining the same response quality as Claude 3.5 Sonnet v2

As you can see in case 1 and case 2, as model capabilities for less expensive models improve with respect to more expensive models in the same family (for example Claude 3 Haiku to 3.5 Haiku), you can expect more complex tasks to be reliably solved by them, therefore causing a higher percentage of routing to the less expensive model while still maintaining the same overall accuracy in the task.

We encourage you to test the effectiveness of Amazon Bedrock Intelligent Prompt Routing on your specialized task and domain because results can vary. For example, when we tested Amazon Bedrock Intelligent Prompt Routing with open source and internal Retrieval Augmented Generation (RAG) datasets, we saw an average 63.6% cost savings because of a higher percentage (87%) of prompts being routed to Claude 3.5 Haiku while still maintaining the baseline accuracy with the larger/ more expensive model (Sonnet 3.5 v2 in the following figure) alone, averaged across RAG datasets.

Getting started

You can get started using the AWS Management Console for Amazon Bedrock. As mentioned earlier, you can create your own router or use a default router:

Use the console to configure a router:

  1. In the Amazon Bedrock console, choose Prompt Routers in the navigation pane, and then choose Configure prompt router.
  2. You can then use a previously configured router or a default router in the console-based playground. For example, in the following figure, we attached a 10K document from Amazon.com and asked a specific question about the cost of sales.
  3. Choose the router metrics icon (next to the refresh icon) to see which model the request was routed to. Because this is a nuanced question, Amazon Bedrock Intelligent Prompt Routing correctly routes to Claude 3.5 Sonnet V2 in this case, as shown in the following figure.

You can also use AWS Command Line Interface (AWS CLI) or API, to configure and use a prompt router.

To use the AWS CLI or API to configure a router:

AWS CLI:

aws bedrock create-prompt-router 
    --prompt-router-name my-prompt-router
    --models '[{"modelArn": "arn:aws:bedrock:<region>::foundation-model/<modelA>"}]'
    --fallback-model '[{"modelArn": "arn:aws:bedrock:<region>::foundation-model/<modelB>"}]'
    --routing-criteria '{"responseQualityDifference": 0.5}'

Boto3 SDK:

response = client.create_prompt_router(
    promptRouterName='my-prompt-router',
    models=[
        {
            'modelArn': 'arn:aws:bedrock:<region>::foundation-model/<modelA>'
        },
        {
            'modelArn': 'arn:aws:bedrock:<region>::foundation-model/<modelB>'
        },
    ],
    description='string',
    routingCriteria={
        'responseQualityDifference':0.5
    },
    fallbackModel={
        'modelArn': 'arn:aws:bedrock:<region>::foundation-model/<modelA>'
    },
    tags=[
        {
            'key': 'string',
            'value': 'string'
        },
    ]
)

Caveats and best practices

When using intelligent prompt routing in Amazon Bedrock, note that:

  • Amazon Bedrock Intelligent Prompt Routing is optimized for English prompts for typical chat assistant use cases. For use with other languages or customized use cases, conduct your own tests before implementing prompt routing in production applications or reach out to your AWS account team for help designing and conducting these tests.
  • You can select only two models to be part of the router (pairwise routing), with one of these two models being the fallback model. These two models have to be in the same AWS Region.
  • When starting with Amazon Bedrock Intelligent Prompt Routing, we recommend that you experiment using the default routers provided by Amazon Bedrock before trying to configure custom routers. After you’ve experimented with default routers, you can configure your own routers as needed for your use cases, evaluate the response quality in the playground, and use them for production application if they meet your requirements.
  • Amazon Bedrock Intelligent Prompt Routing can’t adjust routing decisions or responses based on application-specific performance data currently and might not always provide the most optimal routing for unique or specialized, domain-specific use cases. Contact your AWS account team for customization help on specific use cases.

Conclusion

In this post, we explored Amazon Bedrock Intelligent Prompt Routing, highlighting its ability to help optimize both response quality and cost by dynamically routing requests between different foundation models. Benchmark results demonstrate significant cost savings while maintaining high-quality responses and reduced latency benefits across model families. Whether you implement the pre-configured default routers or create custom configurations, Amazon Bedrock Intelligent Prompt Routing offers a powerful way to balance performance and efficiency in generative AI applications. As you implement this feature in your workflows, testing its effectiveness for specific use cases is recommended to take full advantage of the flexibility it provides. To get started, see Understanding intelligent prompt routing in Amazon Bedrock


About the authors

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Balasubramaniam Srinivasan is a Senior Applied Scientist at Amazon AWS, working on post training methods for generative AI models. He enjoys enriching ML models with domain-specific knowledge and inductive biases to delight customers. Outside of work, he enjoys playing and watching tennis and football (soccer).

Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.

Haibo Ding is a senior applied scientist at Amazon Machine Learning Solutions Lab. He is broadly interested in Deep Learning and Natural Language Processing. His research focuses on developing new explainable machine learning models, with the goal of making them more efficient and trustworthy for real-world problems. He obtained his Ph.D. from University of Utah and worked as a senior research scientist at Bosch Research North America before joining Amazon. Apart from work, he enjoys hiking, running, and spending time with his family.

Read More

How Infosys improved accessibility for Event Knowledge using Amazon Nova Pro, Amazon Bedrock and Amazon Elemental Media Services

How Infosys improved accessibility for Event Knowledge using Amazon Nova Pro, Amazon Bedrock and Amazon Elemental Media Services

This post is co-written with Saibal Samaddar, Tanushree Halder, and Lokesh Joshi from Infosys Consulting.

Critical insights and expertise are concentrated among thought leaders and experts across the globe. Language barriers often hinder the distribution and comprehension of this knowledge during crucial encounters. Workshops, conferences, and training sessions serve as platforms for collaboration and knowledge sharing, where the attendees can understand the information being conveyed in real-time and in their preferred language.

Infosys, a leading global IT services and consulting organization, used its digital expertise to tackle this challenge by pioneering, Infosys Event AI, an innovative AI-based event assistant. Infosys Event AI is designed to make knowledge universally accessible, making sure that valuable insights are not lost and can be efficiently utilized by individuals and organizations across diverse industries both during the event and after the event has concluded. The absence of such a system hinders effective knowledge sharing and utilization, limiting the overall impact of events and workshops. By transforming ephemeral event content into a persistent and searchable knowledge asset, Infosys Event AI seeks to enhance knowledge utilization and impact.

Some of the challenges in capturing and accessing event knowledge include:

  • Knowledge from events and workshops is often lost due to inadequate capture methods, with traditional note-taking being incomplete and subjective.
  • Reviewing lengthy recordings to find specific information is time-consuming and inefficient, creating barriers to knowledge retention and sharing.
  • People who miss events face significant obstacles accessing the knowledge shared, impacting sectors like education, media, and public sector where information recall is crucial.

To address these challenges, Infosys partnered with Amazon Web Services (AWS) to develop the Infosys Event AI to unlock the insights generated during events. In this post, we explain how Infosys built the Infosys Event AI solution using several AWS services including:

Solution Architecture

In this section, we present an overview of Event AI, highlighting its key features and workflow. Event AI delivers these core functionalities, as illustrated in the architecture diagram that follows:

  1. Seamless live stream acquisition from on-premises sources
  2. Real-time transcription processing for speech-to-text conversion
  3. Post-event processing and knowledge base indexing for structured information retrieval
  4. Automated generation of session summaries and key insights to enhance accessibility
  5. AI-powered chat-based assistant for interactive Q&A and efficient knowledge retrieval from the event session

Solution walkthrough

Next, we break down each functionality in detail. The services used in the solution are granted least-privilege permissions through AWS Identity and Access Management (IAM) policies for security purposes.

Seamless live stream acquisition

The solution begins with an IP-enabled camera capturing the live event feed, as shown in the following section of the architecture diagram. This stream is securely and reliably transported to the cloud using the Secure Reliable Transport (SRT) protocol through MediaConnect. The ingested stream is then received and processed by MediaLive, which encodes the video in real time and generates the necessary outputs.

The workflow follows these steps:

  1. Use an IP-enabled camera or ground encoder to convert non-IP streams into IP streams and transmit them through SRT protocol to MediaConnect for live event ingestion.
  2. MediaConnect securely transmits the stream to MediaLive for processing.

Real-time transcription processing

To facilitate real-time accessibility, the system uses MediaLive to isolate audio from the live video stream. This audio-only stream is then forwarded to a real-time transcriber module. The real-time transcriber module, hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance, uses the Amazon Transcribe stream API to generate transcriptions with minimal latency. These real-time transcriptions are subsequently delivered to an on-premises web client through secure WebSocket connections. The following screenshot shows a brief demo based on a fictitious scenario to illustrate Event AI’s real-time streaming capability.

The workflow steps for this part of the solution follows these steps:

  1. MediaLive extracts the audio from the live stream and creates an audio-only stream, which it then sends to the real-time transcriber module running on an EC2 instance. MediaLive also extracts the audio-only output and stores it in an Amazon Simple Storage Service (Amazon S3) bucket, facilitating a subsequent postprocessing workflow.
  2. The real-time transcriber module receives the audio-only stream and employs the Amazon Transcribe stream API to produce real-time transcriptions with low latency.
  3. The real-time transcriber module uses a secure WebSocket to transmit the transcribed text.
  4. The on-premises web client receives the transcribed text through a secure WebSocket connection through Amazon CloudFront and displays it on the web client’s UI.

The below diagram shows the live-stream acquisition and real-time transcription.

Post-event processing and knowledge base indexing

After the event concludes, recorded media and transcriptions are securely stored in Amazon S3 for further analysis. A serverless, event-driven workflow using Amazon EventBridge and AWS Lambda automates the post-event processing. Amazon Transcribe processes the recorded content to generate the final transcripts, which are then indexed and stored in an Amazon Bedrock knowledge base for seamless retrieval. Additionally, Amazon Nova Pro enables multilingual translation of the transcripts, providing global accessibility when needed. With its quality and speed, Amazon Nova Pro is ideally suited for this global use case.

The workflow for this part of the process follows these steps:

  1. After the event concludes, MediaLive sends a channel stopped notification to EventBridge
  2. Lambda function, subscribed to the channel stopped event, triggers post-event transcription using Amazon Transcribe
  3. The transcribed content is processed and stored in an S3 bucket
  4. (Optional) Amazon Nova Pro translates transcripts into multiple languages for broader accessibility using Amazon Bedrock
  5. Amazon Transcribe generates a transcription complete event and sends it to EventBridge
  6. A Lambda function, subscribed to the transcription complete event, triggers the synchronization process with Amazon Bedrock Knowledge Bases
  7. The knowledge is then indexed and stored in Amazon Bedrock knowledge base for efficient retrieval

These steps are shown in the following diagram.

Automated generation of session summaries and key insights

To enhance user experience, the solution uses Amazon Bedrock to analyze the transcriptions to generate concise session summaries and key insights. These insights help users quickly understand the essence of the event without going through lengthy transcripts. The below screenshot shows Infosys Event AI’s summarization capability.

The workflow for this part of the solution follows these steps:

  1. Users authenticate in to the web client portal using Amazon Cognito. Once authenticated, the user selects option in the portal UI to view the summaries and key insights.
  2. The user request is delegated to the AI assistant module, where it fetches the complete transcript from the S3 bucket.
  3. The transcript undergoes processing through Amazon Bedrock Pro, which is guided by Amazon Bedrock Guardrails. In line with responsible AI policies, this process results in the generation of secure summaries and the creation of key insights that are safeguarded for the user.

AI-powered chat-based assistant

A key feature of this architecture is an AI-powered chat assistant, which is used to interactively query the event knowledge base. The chat assistant is powered by Amazon Bedrock and retrieves information from the Amazon OpenSearch Serverless index, enabling seamless access to session insights.

The workflow for this part of the solution follows these steps:

  1. Authenticated users engage with the chat assistant using natural language to request specific event messaging details from the client web portal.
  2. The user prompt is directed to the AI assistant module for processing.
  3. The AI assistant module queries Amazon Bedrock Knowledge Bases for relevant answers.
  4. The transcript is processed by Amazon Nova Pro, guided by Amazon Bedrock Guardrails, to generate secure summaries and safeguard key insights. The integration of Amazon Bedrock Guardrails promotes professional, respectful interactions by working to block undesirable and harmful content during user interactions aligned with responsible AI policies.

The following demo demonstrates Event AI’s Q&A capability.

The steps for automated generation of insights and AI-chat assistant are shown in the following diagram.

Results and Impact

Infosys EventAI Assistant was launched on February 2025 during a responsible AI conference event in Bangalore, India, hosted by Infosys in partnership with the British High Commission.

  • Infosys Event AI was used by more than 800 conference attendees
  • It was used by around 230 people every minute during the event
  • The intelligent chat assistant was queried an average of 57 times every minute during the event
  • A total of more than 9,000 event session summaries were generated during the event

By using the solution, Infosys was able to realize the following key benefits for their internal users and for their customers:

  • Enhanced knowledge retention – During the events, Infosys Event AI was accessible from both mobile and laptop devices, providing an immersive participation experience for both the online and offline event.
  • Improved accessibility – Session knowledge became quickly accessible after the event through transcripts, summaries, and the intelligent chat assistant. The event information is readily available for attendees and for those who couldn’t attend. Furthermore, Infosys Event AI aggregates the session information from previous events, creating a knowledge archival system for information retrieval.
  • Increased engagement – The interactive chat assistant provides deeper engagement during the event sessions, which means users can ask specific questions and receive immediate, contextually relevant answers.
  • Time efficiency – Quick access to summaries and chat responses saves time compared to reviewing full session recordings or manual notes when seeking specific information.

Impacting Multiple Industries

Infosys is positioned to accelerate the adoption of Infosys Event AI across diverse industries:

  • AI-powered meeting management for the enterprises – Businesses can use the system for generating meeting minutes, creating training documentation from workshops, and facilitating knowledge sharing within teams. Summaries provide quick recaps of meetings for executives, and transcripts offer detailed records for compliance and reference.
  • Improved transparency and accessibility in the public sector – Parliamentary debates, public hearings, and government briefings are made accessible to the general public through transcripts and summaries, improving transparency and citizen engagement. The platform enables searchable archives of parliamentary proceedings for researchers, policymakers, and the public, creating accessible records for historical reference.
  • Accelerated learnings and knowledge retention in the education sector – Students effectively review lectures, seminars, and workshops through transcripts and summaries, reinforcing learning, and improving knowledge retention. The chat assistant allows for interactive learning and clarification of doubts, acting as a virtual teaching assistant. This is particularly useful in online and hybrid learning environments.
  • Improved media reporting and efficiency in the media industry – Journalists can use Infosys Event AI to rapidly transcribe press conferences, speeches, and interviews, accelerating news cycles and improving reporting accuracy. Summaries provide quick overviews of events, enabling faster news dissemination. The chat assistant facilitates quick fact-checking (with source citation) and information retrieval from event recordings.
  • Improved accessibility and inclusivity across the industry – Real-time transcription provides accessibility for hearing-challenged individuals. Multilingual translation of event transcripts allows participation by attendees for whom the event sessions aren’t in their native language. This promotes inclusivity and a wider participation during events for the purposes of knowledge sharing.

Conclusion

In this post, we explored how Infosys developed Infosys Event AI to unlock the insights generated from events and conferences. Through its suite of features—including real-time transcription, intelligent summaries, and an interactive chat assistant—Infosys Event AI makes event knowledge accessible and provides an immersive engagement solution for the attendees, during and after the event.

Infosys is planning to offer the Infosys Event AI solution to their internal teams and global customers in two versions: as a multi-tenanted, software-as-a-service (SaaS) solution and as a single-deployment solution. Infosys is also adding capabilities to include an event catalogue, knowledge lake, and event archival system to make the event information accessible beyond the scope of the current event. By using AWS managed services, Infosys has made Event AI a readily available, interactive, immersive and valuable resource for students, journalists, policymakers, enterprises, and the public sector. As organizations and institutions increasingly rely on events for knowledge dissemination, collaboration, and public engagement, Event AI is well positioned to unlock the full potential of the events.

Stay updated with new Amazon AI features and releases to advance your AI journey on AWS.


About the Authors

Aparajithan Vaidyanathan is a Principal Enterprise Solutions Architect at AWS. He supports enterprise customers migrate and modernize their workloads on AWS cloud. He is a Cloud Architect with 24+ years of experience designing and developing enterprise, large-scale and distributed software systems. He specializes in Generative AI & Machine Learning with focus on Data and Feature Engineering domain. He is an aspiring marathon runner and his hobbies include hiking, bike riding and spending time with his wife and two boys.

Maheshwaran G is a Specialist Solution Architect working with Media and Entertainment supporting Media companies in India to accelerate growth in an innovative fashion leveraging the power of cloud technologies. He is passionate about innovation and currently holds 8 USPTO and 8 IPO granted patents in diversified domains.

Saibal Samaddar is a senior principal consultant at Infosys Consulting and heads the AI Transformation Consulting (AIX) practice in India. He has over eighteen years of business consulting experience, including 11 years in PwC and KPMG, helping organizations drive strategic transformation by harnessing Digital and AI technologies. Known to be a visionary who can navigate complex transformations and make things happen, he has played a pivotal role in winning multiple new accounts for Infosys Consulting (IC).

Tanushree Halder is a principal consultant with Infosys Consulting and is the Lead – CX and Gen AI capability for AI Transformation Consulting (AIX). She has 11 years of experience working with clients in their transformational journeys. She has travelled to over 10 countries to provide her advisory services in AI with clients in BFSI, retail and logistics, hospitality, healthcare and shared services.

Lokesh Joshi is a consultant at Infosys Consulting. He has worked with multiple clients to strategize and integrate AI based solutions for workflow enhancements. He has over 4 years of experience in AI/ML, GenAI development, full Stack development, and cloud services. He specializes in Machine Learning and Data Science with a focus on Deep Learning and NLP. A fitness enthusiast, his hobbies include programming, hiking, and traveling.

Read More