A generative AI prototype with Amazon Bedrock transforms life sciences and the genome analysis process

A generative AI prototype with Amazon Bedrock transforms life sciences and the genome analysis process

It takes biopharma companies over 10 years, at a cost of over $2 billion and with a failure rate of over 90%, to deliver a new drug to patients. The Market to Molecule (M2M) value stream process, which biopharma companies must apply to bring new drugs to patients, is resource-intensive, lengthy, and highly risky. Nine out of ten biopharma companies are AWS customers, and helping them streamline and transform the M2M processes can help deliver drugs to patients faster, reduce risk, and bring value to our customers.

Pharmaceutical companies are taking a new approach to drug discovery, looking for variants in the human genome and linking them to diseases. This genetic validation approach can improve the success ratio in the M2M value stream process by focusing on the root cause of disease and the gene variants.

As depicted in the following M2M value stream diagram, the Research process (and the Basic Research sub-process) is critical to downstream processes where linking the gene variant to a disease occurs, and is instrumental in defining the target molecule. This can be a critical step in expediting and reducing the cost of delivering a new drug to patients.

To transform the M2M value stream process, our customer has been working on associating genes with diseases by using their large dataset of over 2 million sequenced exomes (genes that are expressed into proteins). To accomplish this, the customer’s clinical scientists have to develop methods to navigate through the enormous dataset by using online genome browsers, a mechanical data-first experience that doesn’t fully meet the needs of users. Starting with a search query to get results, the typical interactions of navigating levels, filtering, waiting, and repeating the search can be time-consuming and tedious. Simplifying the UI from the traditional human browser to a conversational AI assistant can enhance the user experience in the clinical research process.

Generative AI is a promising next step in the evolutionary process of leading this change. As generative AI started to make significant impact in healthcare and life sciences, this use case was primed for generative AI experimentation. In collaboration with the customer, AWS built a custom approach of posting a question or a series of questions, allowing scientists to have more flexibility and agility for exploring the genome. Our customer aimed at saving researchers countless hours of work using a new generative AI-enabled gene assistant. By asking a question, or a series of questions, scientists have more flexibility and agility in exploring the genome. Identifying variants and their potential correlation with diseases can be done more efficiently using words, rather than filters, settings, and buttons. With a more streamlined research process, we can help increase the likelihood of leading to new breakthroughs.

This post explores deploying a text-to-SQL pipeline using generative AI models and Amazon Bedrock to ask natural language questions to a genomics database. We demonstrate how to implement an AI assistant web interface with AWS Amplify and explain the prompt engineering strategies adopted to generate the SQL queries. Finally, we present instructions to deploy the service in your own AWS account. Amazon Bedrock is a fully managed service that provides access to large language models (LLMs) and other foundation models (FMs) from leading AI companies through a single API, allowing you to use it instantly without much effort, saving developers valuable time. We used the AWS HealthOmics variant stores to store the Variant Call Format (VCF) files with omics data. A VCF file is typically the output of a bioinformatics pipeline. VCFs encode Single Nucleotide Polymorphisms (SNPs) and other structural genetic variants. The format is further described on the 1000 Genomes project website. We used the AWS HealthOmics – End to End workshop to deploy the variants and annotation stores.

Although this post focuses on a text-to-SQL approach to an omics database, the generative AI approaches discussed here can be applied to a variety of complex schemas of relational databases.

Text-to-SQL for genomics data

Text-to-SQL is a task in natural language processing (NLP) to automatically convert natural language text into SQL queries. This involves translating the written text into a structured format and using it to generate an accurate SQL query that can run on a database. The task is difficult because there are big differences between human language, which is flexible, ambiguous, and dependent on context, and SQL, which is structured.

Before LLMs for text-to-SQL, user queries had to be preprocessed to match specific templates, which were then used to rephrase the queries. This approach was use case-specific and required data preparation and manual work. Now, with LLMs, the text-to-SQL task has undergone a major transformation. LLMs continue to showcase key performance improvements in generating valid SQL queries from natural language queries. Relying on pre-trained models trained on massive datasets, LLMs can identify the relationships between words in language and accurately predict the next ones to be used.

However, although LLMs have remarkable performance in many text-to-SQL problems, they have limitations that lead to hallucinations. This post describes the main approaches used to overcome these limitations.

There are two key strategies to achieve high accuracy in text-to-SQL services:

  • Prompt engineering – The prompt is structured to annotate different components, such as pointing to columns and schemas, and then instructing the model on which type of SQL to create. These annotations act as instructions that guide the model in formatting the SQL output correctly. For example, a prompt might contain annotations showing specific table columns and guiding the model to generate a SQL query. This approach allows for more control over the model’s output by explicitly specifying the desired structure and format of the SQL query.
  • Fine-tuning – You can start with a pre-trained model on a large general text corpus and then proceed with an instruction-based fine-tuning with labeled examples to improve the model’s performance on text-to-SQL tasks. This process adapts the model to the target task by directly training it on the end task, but it requires a substantial number of text-SQL examples.

This post focuses on the prompt engineering strategy for SQL generation. AWS customers deploy prompt engineering strategies first because they’re efficient in returning high-quality results and require a less complex infrastructure and process. For more details and best practices on when to follow each approach, refer to Best practices to build generative AI applications on AWS.

We experimented with prompt engineering using chain-of-thought and tree-of-thought approaches to improve the reasoning and SQL generation capabilities. The chain-of-thought prompting technique guides the LLMs to break down a problem into a series of intermediate steps or reasoning steps, explicitly expressing their thought process before arriving at a definitive answer or output.

Using prompts, we compelled the LLM to generate a series of statements about its own reasoning, allowing the LLM to articulate its reasoning process to produce accurate and understandable outputs. The tree-of-thought approach introduces a structured branching approach to the reasoning process. Instead of a linear chain, we prompt the LLM to generate a tree-like structure, where each node represents a sub-task, sub-question, or intermediate step in the overall problem-solving process.

Solution Overview

The following architecture depicts the solution and AWS services we used to accomplish the prototype.

The workflow consists of the following steps:

  1. A scientist submits a natural language question or request to a chat web application connected through Amplify and integrated with an AWS AppSync GraphQL API.
  2. The request is submitted to Amazon API Gateway, which transfers the request to an AWS Lambda function that contains the text-to-SQL implementation. We recommend the implementation of a second helper Lambda function to fetch variants data, or gene names, or ClinVar listed diseases, to simplify the user experience and facilitate the SQL generation process.
  3. The text-to-SQL Lambda function receives the natural language request, merges the input question with the prompt template, and submits to Amazon Bedrock to generate the SQL.
    • Our implementation also adds a step to simplify the incoming history into a single request. We submit a request to Amazon Bedrock to transform the historical inputs from that user session into a simplified natural language request. This step is optional.
  4. With the generated SQL, the Lambda function submits the query to Amazon Athena to retrieve the genomic data from the Amazon Simple Storage Service (Amazon S3) bucket.
    • If successful, the Lambda function updates the user session stored in Amazon DynamoDB through an AWS AppSync request. That change will automatically appear on the UI that is subscribed to changes to the session table.
    • If an error occurs, the code attempts to re-generate the SQL query, passing the returned error as input and requesting it to fix the error. The Lambda function then reruns the re-generated SQL against Athena and returns the result.

Generative AI approaches to text-to-SQL

We tested the following prompt-engineering strategies:

  • LLM SQL agents
  • LLM with Retrieval Augmented Generation (RAG) to detect tables and columns of interest
  • Prompt engineering with full description of tables and columns of interest
  • Prompt engineering with chain-of-thought and tree-of-thought approaches
  • Prompt engineering with a dynamic few-shot approach

We didn’t achieve good results with SQL agents. We experimented with LangChain SQL agents. It was difficult for the agent to use contextual information from the dataset to generate accurate and syntactically correct SQL. A big challenge in omics data is that certain columns are arrays of structs or maps. At the time of building this project, the agents were incapable of detecting these nuances and failed to generate relevant SQL.

We experimented with a RAG approach to retrieve relevant tables and columns, given a user question. Then we informed the LLM by prompting it to generate a SQL query using only those tables and columns. A motivation behind this experiment is that a RAG approach can deal well with hundreds or thousands of columns or tables. However, this approach also didn’t return good results. This RAG approach returned too many irrelevant variables to be used in each SQL generation.

The next three approaches were successful, and we used them in combination to get the highest accuracy on synthetically correct SQL generation.

A first prompt idea we tested was to provide a full description of the main tables and columns to be used in the SQL generation given a user question. In the following example, we show a snapshot of the prompts used to describe the 1000 Genome variants tables. The goal of the prompt with database tables and column descriptions is to teach the LLM how to use the schema to generate queries. We approached it as if teaching a new developer that will write queries to that database, with examples of SQL queries to extract the correct dataset, how to filter the data, and only using the most relevant columns.

<table>
       <table_name>
       variants
       </table_name>
       <table_description>
       This table contains information about genetic variants.
       </table_description>
       <column>
              <column_name>contigname</column_name>
              <column_description>
This column specifies the name of the contig (a contiguous sequence of DNA) or chromosome where the variant is located. It is typicauy prefixed with "chr". If the user asks for variants at the chromossome 22, use `chr22` to access variants in this table.
              </column_description>
              <example_use>
                      setect *
                      from variants
                      wnere contigname = 'chr22'
                      and start between 45509414 and 45509418;
              </example_use>
       </column>
       <column>
              <column_name>start</column_name>
              <column_description>
                      The start position of the variant on the chromosome. This should
                      be used to compose the primary key of the variant, along with the
                      following tables: `contigname`, `end`, `referenceallele`, `alternatealleles`.
              </column_description>
              <example_use>
                      SELECT * FROM variants WHERE start > 100000 and end < 200000;
              </example_use>
       </column>
</table>

The team also worked with the creation of a prompt that used the concept of chain-of-thought and its evolution tree-of-thought to improve the reasoning and SQL generation capabilities.

The chain-of-thought prompting technique encourages LLMs to break down a problem into a series of intermediate steps, explicitly expressing their thought process before arriving at a definitive answer or output. This approach takes inspiration from the way humans often break down problems into smaller, manageable parts.

Through the use of prompts, we compelled the LLM to generate a chain-of-thought, letting the LLM articulate its reasoning process and produce more accurate and understandable outputs. This technique has the potential to improve performance on tasks that require multi-step reasoning, such as SQL generation from open-ended natural language questions. This approach presented excellent results with the FM that we tested.

As a next step in our experimentation, we used the tree-of-thought technique to generate even better results than the chain-of-thought approach. The tree-of-thought approach introduces a more structured and branching approach to the reasoning process. Instead of a linear chain, we prompt the LLM to generate a tree-like structure, where each node represents a sub-task, sub-question, or intermediate step in the overall problem-solving process. The following example presents how we used these two approaches in the prompt template:

Imagine three different experts are answering this question. All experts will write down 1 step 
of their thinking, then share it with the group. Then all experts will go on to the next step, etc. 
If any expert realises they're wrong at any point then they leave. Each of the three experts should 
explain their thinking along with the generated SQL statement. Your final step is to review the 
generated SQL code for syntax errors. Pay close attention to any use of the UNNEST function - it 
MUST be immediately followed by 'AS t(unpacked)' rather than 'AS t' . If you find a syntax error 
with the generated SQL, produce a corrected version within <SQL_FIXED> tags. Only produce 
the <SQL_FIXED> code if you find a syntax problem in the <SQL_QUERY> tags.

Finally, we tested a few-shot and a dynamic few-shot approach. The few-shot approach is a prompting technique used in prompt engineering for LLMs. It involves providing the LLM with a few examples or demonstrations, along with the input prompt, to guide the model’s generation or output. In the few-shot setting, the prompt comprises the following:

  • An instruction or task description
  • A few examples or demonstrations of the desired output, given a specific input
  • The new input for which the LLM will generate an output

By exposing the LLM to these examples, the model recognizes better patterns and infers the underlying rules or mappings between the input and desired output.

The dynamic few-shot approach extends the few-shot prompting technique. It introduces the concept of dynamically generating or selecting the examples or demonstrations used in the prompt, based on the specific input or context. In this approach, instead of providing a fixed set of examples, the prompt generation process involves:

  • Analyzing the input or context
  • Creating embeddings of the examples and of the input, and retrieving or generating relevant examples or demonstrations tailored to the specific input by applying a semantic search
  • Constructing the prompt with the selected examples and the input

Conclusion

This post demonstrated how to implement a text-to-SQL solution to democratize the access to omics data for users that aren’t data analytics specialists. The approach used HealthOmics and Amazon Bedrock to generate SQL based on natural language queries. This approach has the potential to provide access to omics data to a larger audience than what is available today.

The code is available in the accompanying GitHub repo. The deployment instructions for the HealthOmics variants and annotation store can be found in the AWS HealthOmics – End to End workshop. The deployment instructions for the text-to-SQL project are available in the README file.

We would like to acknowledge Thomaz Silva and Saeed Elnaj for their contributions to this blog. It couldn’t have been done without them.


About the Authors

Ganesh Raam Ramadurai is a Senior Technical Program Manager at Amazon Web Services (AWS), where he leads the PACE (Prototyping and Cloud Engineering) team. He specializes in delivering innovative, AI/ML and Generative AI-driven prototypes that help AWS customers explore emerging technologies and unlock real-world business value. With a strong focus on experimentation, scalability, and impact, Ganesh works at the intersection of strategy and engineering—accelerating customer innovation and enabling transformative outcomes across industries.

Jeff Harman is a Senior Prototyping Architect on the Amazon Web Services (AWS) Prototyping and Cloud Engineering team, he specializes in developing innovative solutions that leverage AWS’s cloud infrastructure to meet complex business needs. Jeff Harman is a seasoned technology professional with over three decades of experience in software engineering, enterprise architecture, and cloud computing. Prior to his tenure at AWS, Jeff held various leadership roles at Webster Bank, including Vice President of Platform Architecture for Core Banking, Vice President of Enterprise Architecture, and Vice President of Application Architecture. During his time at Webster Bank, he was instrumental in driving digital transformation initiatives and enhancing the bank’s technological capabilities. He holds a Master of Science degree from the Rochester Institute of Technology, where he conducted research on creating a Java-based, location-independent desktop environment—a forward-thinking project that anticipated the growing need for remote computing solutions. Based in Unionville, Connecticut, Jeff continues to be a driving force in the field of cloud computing, applying his extensive experience to help organizations harness the full potential of AWS technologies.

Kosal Sen is a Design Technologist on the Amazon Web Services (AWS) Prototyping and Cloud Engineering team. Kosal specializes in creating solutions that bridge the gap between technology and actual human needs. As an AWS Design Technologist, that means building prototypes on AWS cloud technologies, and ensuring they bring empathy and value into the real world. Kosal has extensive experience spanning design, consulting, software development, and user experience. Prior to AWS, Kosal held various roles where he combined technical skillsets with human-centered design principles across enterprise-scale projects.

Read More

Gemma 3 27B model now available on Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

Gemma 3 27B model now available on Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

We are excited to announce the availability of Gemma 3 27B Instruct models through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, developers and data scientists can now deploy Gemma 3, a 27-billion-parameter language model, along with its specialized instruction-following versions, to help accelerate building, experimentation, and scalable deployment of generative AI solutions on AWS.

In this post, we show you how to get started with Gemma 3 27B Instruct on both Amazon Bedrock Marketplace and SageMaker JumpStart, and how to use the model’s powerful instruction-following capabilities in your applications.

Overview of Gemma 3 27B

Gemma 3 27B is a high-performance, open-weight, multimodal language model by Google designed to handle both text and image inputs with efficiency and contextual understanding. It introduces a redesigned attention architecture, enhanced multilingual support, and extended context capabilities. With its optimized memory usage and support for large input sequences, it is well-suited for complex reasoning tasks, long-form interactions, and vision-language applications. With 27 billion parameters and training on up to 6 trillion tokens of text, these models are optimized for tasks requiring advanced reasoning, multilingual capabilities, and instruction following. According to Google, Gemma3 27B Instruct models are ideal for developers, researchers, and businesses looking to build generative AI applications such as chatbots, virtual assistants, and automated content generation tools. The following are its key features:

  • Multimodal input – Processes text, images, and short videos for unified reasoning across modalities
  • Long context support – Handles up to 128,000 tokens, enabling seamless processing of long documents, conversations, and multimedia transcripts
  • Multilingual support – Offers out-of-the-box support for over 35 languages, with pre-training exposure to more than 140 languages in total
  • Function calling – Facilitates building agentic workflows by using natural‐language interfaces to APIs
  • Memory-efficient inference – Offers architectural updates that reduce KV-cache usage and introduce QK-norm for faster and more accurate outputs

Key use cases for Gemma3, as described by Google, include:

  • Q&A and summarization – Processing and condensing long documents or articles
  • Visual understanding – Image captioning, object identification, visual Q&A, and document understanding
  • Multilingual applications – Building AI assistants and tools across over 140 languages
  • Document processing – Analyzing multi-page articles or extracting information from large texts
  • Automated workflows – Using function calling to create AI agents that can interact with other systems

There are two primary methods for deploying Gemma 3 27B in AWS: The first approach involves using Amazon Bedrock Marketplace, which offers a streamlined way of accessing Amazon Bedrock APIs (Invoke and Converse) and tools such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, Amazon Bedrock Flows, Amazon Bedrock Guardrails, and model evaluation. The second approach is using SageMaker JumpStart, a machine learning (ML) hub, with foundation models (FMs), built-in algorithms, and pre-built ML solutions. You can deploy pre-trained models using either the Amazon SageMaker console or SDK.

Deploy Gemma 3 27B Instruct on Amazon Bedrock Marketplace

Amazon Bedrock Marketplace offers access to over 150 specialized FMs, including Gemma 3 27B Instruct.

Prerequisites

To try the Gemma 3 27B Instruct model using Amazon Bedrock Marketplace, you need the following:

  • An AWS account that will contain all your AWS resources
  • Access to accelerated instances (GPUs) for hosting the large language models (LLMs)

Deploy the model

To deploy the model using Amazon Bedrock Marketplace, complete the following steps:

  1. On the Amazon Bedrock console, under Foundation models in the navigation pane, select Model catalog.
  2. Filter for Gemma as the provider and choose Gemma 3 27B Instruct.

Information about Gemma3’s features, costs, and setup instructions can be found on its model overview page. This resource includes integration examples, API documentation, and programming samples. The model excels at a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. You can also access deployment guidelines and license details to begin implementing Gemma3 into your projects.

  1. Review the model details, pricing, and deployment guidelines, and choose Deploy to start the deployment process.

  1. For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters) or leave it as the default name that is pre-populated.
  2. For Number of instances, enter a number of instances (between 1–100).
  3. Select your preferred instance type, with GPU-powered options like ml.g5.48xlarge being particularly well-suited for running Gemma 3 efficiently.

Although default configurations are typically sufficient for basic needs, you have the option to customize security features such as virtual private cloud (VPC) networking, role-based permissions, and data encryption. These advanced settings might require adjustment for production environments to maintain compliance with your organization’s security protocols.

Prior to deploying Gemma 3, verify that your AWS account has sufficient quota allocation for ml.g5.48xlarge instances. A quota set to 0 will trigger deployment failures, as shown in the following screenshot.

To request a quota increase, open the AWS Service Quotas console and search for SageMaker. Locate ml.g5.48xlarge for endpoint usage and choose Request quota increase, then specify your required limit value.

  1. While the deployment is in progress, you can choose Managed deployments in the navigation pane to monitor the deployment status.
  2. When deployment is complete, you can test Gemma 3’s capabilities directly in the Amazon Bedrock playground by selecting the managed deployment and choosing Open in playground.

You can now use the playground to interact with Gemma 3.

For detailed steps and example code for invoking the model using Amazon Bedrock APIs, refer to Submit prompts and generate response using the API and the following code:

import boto3
bedrock_runtime = boto3.client("bedrock-runtime")
endpoint_arn = "arn:aws:sagemaker:us-east-2:061519324070:endpoint/endpoint-quick-start-3t7kp"
response = bedrock_runtime.converse(
    modelId=endpoint_arn,
    messages=[
        {
            "role": "user",
            "content": [{"text": "What is Amazon doing in the field of generative AI?"}]
        }
    ],
    inferenceConfig={
        "maxTokens": 256,
        "temperature": 0.1,
        "topP": 0.999
    }
)
print(response["output"]["message"]["content"][0]["text"])

Deploy Gemma 3 27B Instruct with SageMaker JumpStart

SageMaker JumpStart offers access to a broad selection of publicly available FMs. These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can use state-of-the-art model architectures—such as language models, computer vision models, and more—without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker inference instances and can be isolated within your VPC. After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of Amazon SageMaker AI, including SageMaker inference for deploying models and container logs for improved observability. With SageMaker AI, you can streamline the entire model deployment process.

There are two ways to deploy the Gemma 3 model using SageMaker JumpStart:

  • Through the user-friendly SageMaker JumpStart interface
  • Using the SageMaker Python SDK for programmatic deployment

We examine both deployment methods to help you determine which approach aligns best with your requirements.

Prerequisites

To try the Gemma 3 27B Instruct model in SageMaker JumpStart, you need the following prerequisites:

Deploy the model through the SageMaker JumpStart UI

SageMaker JumpStart provides a user-friendly interface for deploying pre-built ML models with just a few clicks. Through the SageMaker JumpStart UI, you can select, customize, and deploy a wide range of models for various tasks such as image classification, object detection, and natural language processing, without the need for extensive coding or ML expertise.

  1. On the SageMaker AI console, choose Studio in the navigation pane.
  2. First-time users will be prompted to create a domain.
  3. On the SageMaker Studio console, choose JumpStart in the navigation pane.

The model browser displays available models, with details like the provider name and model capabilities.

  1. Search for Gemma 3 to view the Gemma 3 model card. Each model card shows key information, including:
    • Model name
    • Provider name
    • Task category (for example, Text Generation)
    • The Bedrock Ready badge (if applicable), indicating that this model can be registered with Amazon Bedrock, so you can use Amazon Bedrock APIs to invoke the model

  1. Choose the model card to view the model details page.

The model details page includes the following information:

    • The model name and provider information
    • The Deploy button to deploy the model
    • About and Notebooks tabs with detailed information. The About tab includes important details, such as:
    • Model description
    • License information
    • Technical specifications
    • Usage guidelines

Before you deploy the model, we recommended you review the model details and license terms to confirm compatibility with your use case.

  1. Choose Deploy to proceed with deployment.
  2. For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters) or leave it as default.
  3. For Instance type, choose an instance type (default: ml.g5.48xlarge).
  4. For Initial instance count, enter the number of instances (default: 1).

Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed. Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.

  1. Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
  2. Choose Deploy to deploy the model.

The deployment process can take several minutes to complete.

Deploy the model programmatically using the SageMaker Python SDK

To use Gemma 3 with the SageMaker Python SDK, first make sure you have installed the SDK and set up your AWS permissions and environment correctly. The following is a code example showing how to programmatically deploy and run inference with Gemma 3:

import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker import Session, image_uris
import boto3
# Initialize SageMaker session
session = sagemaker.Session()
role = sagemaker.get_execution_role()
# Specify model parameters
model_id = "huggingface-vlm-gemma-3-27b-instruct"  # or "huggingface-llm-gemma-2b" for the smaller version
instance_type = "ml.g5.48xlarge"  # Choose appropriate instance based on your needs
# Create and deploy the model
model = JumpStartModel(
    model_id=model_id,
    role=role,
    instance_type=instance_type,
    model_version="*",  # Latest version
)
# Deploy the model
predictor = model.deploy(
    initial_instance_count=1,
    accept_eula=True  # Required for deploying foundation models
)

Run inference using the SageMaker API

With your Gemma 3 model successfully deployed as a SageMaker endpoint, you’re now ready to start making predictions. The SageMaker SDK provides a straightforward way to interact with your model endpoint for inference tasks. The following code demonstrates how to format your input and make API calls to the endpoint. The code handles both sending requests to the model and processing its responses, making it straightforward to integrate Gemma 3 into your applications.

import json
import boto3
# Initialize AWS session (ensure your AWS credentials are configured)
session = boto3.Session()
sagemaker_runtime = session.client("sagemaker-runtime")
# Define the SageMaker endpoint name (replace with your deployed endpoint name)
endpoint_name = "hf-vlm-gemma-3-27b-instruct-2025-05-07-18-09-16-221"

payload = {
    "inputs": "What is Amazon doing in the field of generative AI?",
    "parameters": {
        "max_new_tokens": 256,
        "temperature": 0.1,
        "top_p": 0.9,
        "return_full_text": False
    }
}

# Run inference
try:
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload)
    )
    # Parse the response
    result = json.loads(response["Body"].read().decode("utf-8"))
    generated_text = result[0]["generated_text"].strip()
    print("Generated Response:")
    print(generated_text)
except Exception as e:
    print(f"Error during inference: {e}")

Clean up

To avoid incurring ongoing charges for AWS resources used during exploration of Gemma3 27B Instruct models, it’s important to clean up deployed endpoints and associated resources. Complete the following steps:

  1. Delete SageMaker endpoints:
    1. On the SageMaker console, in the navigation pane, choose Endpoints under Inference.
    2. Select the endpoint associated with the Gemma3 27B Instruct model (for example, gemma3-27b-instruct-endpoint).
    3. Choose Delete and confirm the deletion. This stops the endpoint and prevents further compute charges.
  2. Delete SageMaker models (if applicable):
  3. On the SageMaker console, choose Models under Inference.
  4. Select the model associated with your endpoint and choose Delete.
  5. Verify Amazon Bedrock Marketplace resources:
  6. On the Amazon Bedrock console, choose Model catalog in the navigation pane.
  7. Make sure no additional endpoints are running for the Gemma3 27B Instruct model deployed through Amazon Bedrock Marketplace.

Always verify that all endpoints are deleted after experimentation to optimize costs. Refer to the Amazon SageMaker documentation for additional guidance on managing resources.

Conclusion

The availability of Gemma3 27B Instruct models in Amazon Bedrock Marketplace and SageMaker JumpStart empowers developers, researchers, and businesses to build cutting-edge generative AI applications with ease. With their high performance, multilingual capabilities and efficient deployment on AWS infrastructure, these models are well-suited for a wide range of use cases, from conversational AI to code generation and content automation. By using the seamless discovery and deployment capabilities of SageMaker JumpStart and Amazon Bedrock Marketplace, you can accelerate your AI innovation while benefiting from the secure, scalable, and cost-effective AWS Cloud infrastructure.

We encourage you to explore the Gemma3 27B Instruct models today by visiting the SageMaker JumpStart console or Amazon Bedrock Marketplace. Deploy the model and experiment with sample prompts to meet your specific needs. For further learning, explore the AWS Machine Learning Blog, the SageMaker JumpStart GitHub repository, and the Amazon Bedrock documentation. Start building your next generative AI solution with Gemma3 27B Instruct models and unlock new possibilities with AWS!


About the Authors

Santosh Vallurupalli is a Sr. Solutions Architect at AWS. Santosh specializes in networking, containers, and migrations, and enjoys helping customers in their journey of cloud adoption and building cloud-based solutions for challenging issues. In his spare time, he likes traveling, watching Formula1, and watching The Office on repeat.

Aravind Singirikonda is an AI/ML Solutions Architect at AWS. He works with AWS customers in the healthcare and life sciences domain to provide guidance and technical assistance, helping them improve the value of their AI/ML solutions when using AWS.

Pawan Matta is a Sr. Solutions Architect at AWS. He works with AWS customers in the gaming industry and guides them to deploy highly scalable, performant architectures. His area of focus is management and governance. In his free time, he likes to play FIFA and watch cricket.

Ajit Mahareddy is an experienced Product and Go-To-Market (GTM) leader with over 20 years of experience in product management, engineering, and GTM. Prior to his current role, Ajit led product management building AI/ML products at leading technology companies, including Uber, Turing, and eHealth. He is passionate about advancing generative AI technologies and driving real-world impact with generative AI.

Read More

Building a multimodal RAG based application using Amazon Bedrock Data Automation and Amazon Bedrock Knowledge Bases

Building a multimodal RAG based application using Amazon Bedrock Data Automation and Amazon Bedrock Knowledge Bases

Organizations today deal with vast amounts of unstructured data in various formats including documents, images, audio files, and video files. Often these documents are quite large, creating significant challenges such as slower processing times and increased storage costs. Extracting meaningful insights from these diverse formats in the past required complex processing pipelines and significant development effort. Before generative AI, organizations had to rely on multiple specialized tools, custom-built solutions, and extensive manual review processes, making it time-consuming and error-prone to process and analyze these documents at scale. Generative AI technologies are revolutionizing this landscape by offering powerful capabilities to automatically process, analyze, and extract insights from these diverse document formats, significantly reducing manual effort while improving accuracy and scalability.

With Amazon Bedrock Data Automation and Amazon Bedrock Knowledge Bases, you can now build powerful multimodal RAG applications with minimal effort. Amazon Bedrock Data Automation provides automated workflows for efficiently processing various file formats at scale, while Amazon Bedrock Knowledge Bases creates a unified, searchable repository that can understand natural language queries. Together, they enable organizations to efficiently process, organize, and retrieve information from their multimodal content, transforming how they manage and use their unstructured data.

In this post, we walk through building a full-stack application that processes multimodal content using Amazon Bedrock Data Automation, stores the extracted information in an Amazon Bedrock knowledge base, and enables natural language querying through a RAG-based Q&A interface.

Real world use cases

The integration of Amazon Bedrock Data Automation and Amazon Bedrock Knowledge Bases enables powerful solutions for processing large volumes of unstructured data across various industries such as:

  • In healthcare, organizations deal with extensive patient records including medical forms, diagnostic images, and consultation recordings. Amazon Bedrock Data Automation automatically extracts and structures this information, while Amazon Bedrock Knowledge Bases enables medical staff to use natural language queries like “What was the patient’s last blood pressure reading?” or “Show me the treatment history for diabetes patients.”
  • Financial institutions process thousands of documents daily, from loan applications to financial statements. Amazon Bedrock Data Automation extracts key financial metrics and compliance information, while Amazon Bedrock Knowledge Bases allows analysts to ask questions like “What are the risk factors mentioned in the latest quarterly reports?” or “Show me all loan applications with high credit scores.”
  • Legal firms handle vast case files with court documents, evidence photos, and witness testimonies. Amazon Bedrock Data Automation processes these diverse sources, and Amazon Bedrock Knowledge Bases lets lawyers query “What evidence was presented about the incident on March 15?” or “Find all witness statements mentioning the defendant.”
  • Media companies can use this integration for intelligent contextual ad placement. Amazon Bedrock Data Automation processes video content, subtitles, and audio to understand scene context, dialogue, and mood, while simultaneously analyzing advertising assets and campaign requirements. Amazon Bedrock Knowledge Bases then enables sophisticated queries to match ads with appropriate content moments, such as “Find scenes with positive outdoor activities for sports equipment ads” or “Identify segments discussing travel for tourism advertisements.” This intelligent contextual matching offers more relevant and effective ad placements while maintaining brand safety.

These examples demonstrate how the extraction capabilities of Amazon Bedrock Data Automation combined with the natural language querying of Amazon Bedrock Knowledge Bases can transform how organizations interact with their unstructured data.

Solution overview

This comprehensive solution demonstrates the advanced capabilities of Amazon Bedrock for processing and analyzing multimodal content (documents, images, audio files, and video files) through three key components: Amazon Bedrock Data Automation, Amazon Bedrock Knowledge Bases, and foundation models available through Amazon Bedrock. Users can upload various types of content including audio files, images, videos, or PDFs for automated processing and analysis.

When you upload content, Amazon Bedrock Data Automation processes it using either standard or custom blueprints to extract valuable insights. The extracted information is stored as JSON in an Amazon Simple Storage Service (Amazon S3) bucket, while job status is tracked through Amazon EventBridge and maintained in Amazon DynamoDB. The solution performs custom parsing of the extracted JSON to create knowledge base-compatible documents, which are then stored and indexed in Amazon Bedrock Knowledge Bases.

Through an intuitive user interface, the solution displays both the uploaded content and its extracted information. Users can interact with the processed data through a Retrieval Augmented Generation (RAG)-based Q&A system, powered by Amazon Bedrock foundation models. This integrated approach enables organizations to efficiently process, analyze, and derive insights from diverse content formats while using a robust and scalable infrastructure deployed using the AWS Cloud Development Kit (AWS CDK).

Architecture

Architecture diagram

The preceding architecture diagram illustrates the flow of the solution:

  1. Users interact with the frontend application, authenticating through Amazon Cognito
  2. API requests are handled by Amazon API Gateway and AWS Lambda functions
  3. Files are uploaded to an S3 bucket for processing
  4. Amazon Bedrock Data Automation processes the files and extracts information
  5. EventBridge manages the job status and triggers post-processing
  6. Job status is stored in DynamoDB and processed content is stored in Amazon S3
  7. A Lambda function parses the processed content and indexed in Amazon Bedrock Knowledge Bases
  8. A RAG-based Q&A system uses Amazon Bedrock foundation models to answer user queries

Prerequisites

Backend

For the backend, you need to have the following prerequisites:

To use the Q&A feature, make sure that you enable access to the Amazon Bedrock foundation models that you’re planning to use, in the required AWS Regions.

  • For models in the dropdown list marked On demand, enable model access in the Region where you deployed this stack.
  • For models in the dropdown list marked CRIS, enable model access in every Region used by the system defined inference profile (cross Regions). For instance, to use Amazon Nova Pro - CRIS US, make sure you enable access to the Amazon Nova Pro model in every Region used by this inference profile: US East (Virginia) us-east-1, US West (Oregon) us-west-2, and US East (Ohio) us-east-2.
  • The models used in this solution include:
    • Anthropic’s Claude 3.5 Sonnet v2.0
    • Amazon Nova Pro v1.0
    • Anthropic’s Claude 3.7 Sonnet v1.0

Frontend

For the frontend, you need to have the following prerequisites:

  • Node/npm: v18.12.1
  • The deployed backend.
  • At least one user added to the appropriate Amazon Cognito user pool (required for authenticated API calls).

Everything you need is provided as open source code in our GitHub repository.

git clone https://github.com/aws-samples/generative-ai-cdk-constructs-samples.git

Deployment guide

This sample application codebase is organized into these key folders:

samples/bedrock-bda-media-solution

├── backend # Backend architecture CDK project
├── images # Images used for documentation
└── frontend # Frontend sample application

Deploy the backend

Use the following steps to deploy the backend AWS resources:

  • If you haven’t already done so, clone this repository:
    git clone https://github.com/aws-samples/generative-ai-cdk-constructs-samples.git

  • Enter the backend directory
    cd samples/multimodal-rag/backend

  • Create a virtualenv on MacOS and Linux:
    python3 -m venv .venv

  • Activate the virtualenv
    source .venv/bin/activate

  • After the virtualenv is activated, you can install the required dependencies.
    pip install -r requirements.txt

  • Bootstrap CDK. Bootstrapping is the process of preparing your AWS environment for use with the AWS CDK.
    cdk bootstrap

  • Run the AWS CDK Toolkit to deploy the backend stack with the runtime resources.
    cdk deploy

To help protect against unintended changes that affect your security posture, the AWS CDK Toolkit prompts you to approve security-related changes before deploying them. You need to answer yes to deploy the stack.

After the backend is deployed, you need to create a user. First, use the AWS CLI to locate the Amazon Cognito user pool ID:

$ aws cloudformation describe-stacks 
--stack-name BDAMediaSolutionBackendStack
--query "Stacks[0].Outputs[?contains(OutputKey, 'UserPoolId')].OutputValue"

[
    "OutputValue": "<region>_a1aaaA1Aa"
]

You can then go to the Amazon Cognito page in the AWS Management Console, search for the user pool, and add users.

Deploy the frontend

The repository provides a demo frontend application. With this, you can upload and review media files processed by the backend application. To deploy the UI, follow these steps:

  • Enter the frontend directory
    cd samples/multimodal-rag/frontend

  • Create a .env file by duplicating the included example.env and replace the property values with the values retrieved from the MainBackendStack outputs.
VITE_REGION_NAME=<BDAMediaSolutionBackendStack.RegionName>
VITE_COGNITO_USER_POOL_ID=<BDAMediaSolutionBackendStack.CognitoUserPoolId>
VITE_COGNITO_USER_POOL_CLIENT_ID=<2BDAMediaSolutionBackendStack.CognitoUserPoolClientId>
VITE_COGNITO_IDENTITY_POOL_ID=<BDAMediaSolutionBackendStack.CognitoIdentityPoolId>
VITE_API_GATEWAY_REST_API_ENDPOINT=<BDAMediaSolutionBackendStack.ApiGatewayRestApiEndpoint>
VITE_APP_NAME="Bedrock BDA Multimodal Media Solution"
VITE_S3_BUCKET_NAME=<BDAMediaSolutionBackendStack.BDAInputBucket>

You can run the following script is provided if you want to automate the preceding step:

./generate-dev-env.sh
  • Install the dependencies
    npm install

  • Start the web application
    npm run dev

A URL like http://localhost:5173/ will be displayed, so you can open the web application from your browser. Sign in to the application with the user profile you created in Amazon Cognito.

Set up Amazon Bedrock Data Automation

Before processing files, you need to set up an Amazon Bedrock Data Automation project and configure extraction patterns. The solution provides a control plane interface, shown in the following figure, where you can:

  • View existing Amazon Bedrock Data Automation projects in your account
  • Create new projects and blueprints
  • Select the appropriate project for processing

Setup bda

For specific documentation on how Amazon Bedrock Data Automation works, see How Bedrock Data Automation works.

After deciding the project to use, select it from the dropdown list in the list projects operation card. The selected project will be used for file processing.

Process multimodal content

To begin, go to the home page of the frontend application, shown in the following screenshot, and choose Choose file near the top right corner. Select a file. A tooltip will appear when you hover over the button, displaying the file requirements supported by Amazon Bedrock Data Automation. The application supports various file types that Amazon Bedrock Data Automation can process:

  1. PDF files
  2. Images
  3. Audio files
  4. Video files

Process multimodal content

For ready-to-use sample files, see the back-end/samples folder.

When you upload a file

The following process is triggered when a file is uploaded:

  1. The file is stored in an S3 bucket
  2. An Amazon Bedrock Data Automation job is initiated through the backend API
  3. The job status is tracked and updated in DynamoDB
  4. Extracted information is made available through the UI after processing completes

BDA analysis results

The processing time varies depending on the size of the file. You can check the status of processing tasks by choosing the refresh button. After a job is completed, you can select the file name in the table on the Home page to access the file details.

You can access the job details Amazon Bedrock Data Automation produced by navigating through the tabs on the right side of the screen. The Standard and Custom Output tabs provide details on the extracted information from Amazon Bedrock Data Automation.

Ask questions about your uploaded document

The Q&A tab will provide a chatbot to ask questions about the documents processed. You can select an Amazon Bedrock foundation model from the dropdown list and ask a question. Currently, the following models are supported:

  • Anthropic’s Claude 3.5 Sonnet v2.0
  • Amazon Nova Pro v1.0
  • Anthropic’s Claude 3.7 Sonnet v1.0

In the following image, an Amazon Bedrock foundation model is used to ask questions against the Amazon Bedrock knowledge base. Each processed document has been ingested and stored in the vector store.

bda-qa

Clean up

Delete the stack to avoid unexpected charges.

  1. First make sure to remove data from the S3 buckets created for this solution.
  2. Run CDK destroy
  3. Delete the S3 buckets.
  4. Delete the logs associated with this solution created by the different services in Amazon CloudWatch logs.

Conclusion

This solution demonstrates how the integration of Amazon Bedrock Data Automation and Amazon Bedrock Knowledge Bases represents a significant leap forward in how organizations can process and derive value from their multimodal content. This solution not only demonstrates the technical implementation but also showcases the transformative potential of combining automated content processing with intelligent querying capabilities. By using the AWS serverless architecture and the power of foundation models, you can now build scalable, cost-effective solutions that turn your unstructured data into actionable insights.

At the time of writing, this solution is available in the following AWS Regions: US East (N. Virginia), and US West (Oregon).


About the authors

Author - Lana Zhang Lana Zhang is a Senior Solutions Architect in the AWS World Wide Specialist Organization AI Services team, specializing in AI and generative AI with a focus on use cases including content moderation and media analysis. She’s dedicated to promoting AWS AI and generative AI solutions, demonstrating how generative AI can transform classic use cases by adding business value. She assists customers in transforming their business solutions across diverse industries, including social media, gaming, ecommerce, media, advertising, and marketing.

Alain Krok is a Senior Solutions Architect with a passion for emerging technologies. His experience includes designing and implementing IIoT solutions for the oil and gas industry and working on robotics projects. He enjoys pushing the limits and indulging in extreme sports when he’s not designing software.

Dinesh Sajwan is a Senior Prototyping Architect at AWS. He thrives on working with cutting-edge technologies and leverages his expertise to solve complex business challenges. His diverse technical background enables him to develop innovative solutions across various domains. When not exploring new technologies, he enjoys spending quality time with his family and indulging in binge-watching his favorite shows.

Read More

Tailoring foundation models for your business needs: A comprehensive guide to RAG, fine-tuning, and hybrid approaches

Tailoring foundation models for your business needs: A comprehensive guide to RAG, fine-tuning, and hybrid approaches

Foundation models (FMs) have revolutionised AI capabilities, but adopting them for specific business needs can be challenging. Organizations often struggle with balancing model performance, cost-efficiency, and the need for domain-specific knowledge. This blog post explores three powerful techniques for tailoring FMs to your unique requirements: Retrieval Augmented Generation (RAG), fine-tuning, and a hybrid approach combining both methods. We dive into the advantages, limitations, and ideal use cases for each strategy.

AWS provides a suite of services and features to simplify the implementation of these techniques. Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Knowledge Bases provides native support for RAG, streamlining the process of enhancing model outputs with domain-specific information. Amazon Bedrock also offers native features for model customizations through continued pre-training and fine-tuning. In addition, you can use Amazon Bedrock Custom Model Import to bring and use your customized models alongside existing FMs through a single serverless, unified API. Use Amazon Bedrock Model Distillation to use smaller, faster, more cost-effective models that deliver use-case specific accuracy that is comparable to the most advanced models in Amazon Bedrock.

For this post, we have used Amazon SageMaker AI for the fine-tuning and hybrid approach to maintain more control over the fine-tuning script and try different fine-tuning methods. In addition, we have used Amazon Bedrock Knowledge Bases for the RAG approach as shown in Figure 1.

To help you make informed decisions, we provide ready-to-use code in our Github repo, using these AWS services to experiment with RAG, fine-tuning, and hybrid approaches. You can evaluate their performance based on your specific use case and your dataset, and use the model that best fits to effectively customize FMs for your business needs.

Figure 1: Architecture diagram for RAG, fine-tuning and hybrid approaches

Retrieval Augmented Generation

RAG is a cost-effective way to enhance AI capabilities by connecting existing models to external knowledge sources. For example, an AI powered customer service chatbot using RAG can answer questions about current product features by first checking the product documentation knowledge base. If a customer asks a question, the system retrieves the specific details from the product knowledge base before composing its response, helping to make sure that the information is accurate and up-to-date.

A RAG approach gives AI models access to external knowledge sources for better responses and has two main steps: retrieval for finding the relevant information from connected data sources and generation using an FM to generate an answer based on the retrieved information.

Fine-tuning

Fine-tuning is a powerful way to customize FMs for specific tasks or domains using additional training data. In fine-tuning, you adjust the model’s parameters using a smaller, labelled dataset relevant to the target domain.

For example, to build an AI powered customer service chatbot, you can fine-tune an existing FM using your own dataset to handle questions about a company’s product features. By training the model on historical customer interactions and product specifications, the fine-tuned model learns the context and the company messaging tone to provide more accurate responses.

If the company launches a new product, the model should be fine-tuned again with new data to update its knowledge and maintain relevance. Fine-tuning helps make sure that the model can deliver precise, context-aware responses. However, it requires more computational resources and time compared to RAG, because the model itself needs to be retrained with the new data.

Hybrid approach

The hybrid approach combines the strengths of RAG and fine-tuning to deliver highly accurate, context-aware responses. Let’s consider an example, a company frequently updates the features of its products. They want to customize their FM using internal data, but keeping the model updated with changes in the product catalog is challenging. Because product features change monthly, keeping the model up to date would be costly and time-consuming.

By adopting a hybrid approach, the company can reduce costs and improve efficiency. They can fine-tune the model every couple of months to keep it aligned with the company’s overall tone. Meanwhile, RAG can retrieve the latest product information from the company’s knowledge base, helping to make sure that responses are up-to-date. Fine-tuning the model also enhances RAG’s performance during the generation phase, leading to more coherent and contextually relevant responses. If you want to further improve the retrieval phase, you can customize the embedding model, use a different search algorithm, or explore other retrieval optimization techniques.

The following sections provide the background for dataset creation and implementation of the three different approaches

Prerequisites

To deploy the solution, you need:

Dataset description

For the proof-of-concept, we created two synthetic datasets using Anthropic’s Claude 3 Sonnet on Amazon Bedrock.

Product catalog dataset

This dataset is your primary knowledge source in Amazon Bedrock. We created a product catalog which consists of 15 fictitious manufacturing products by prompting Anthropic’s Claude 3 Sonnet using example product catalogs. You should create your dataset in .txt format. The format in the example for this post has the following fields:

  • Product names
  • Product descriptions
  • Safety instructions
  • Configuration manuals
  • Operation instructions

Train and test the dataset

We use the same product catalog we created for the RAG approach as training data to run domain adaptation fine-tuning.

The test dataset consists of question-and-answer pairs about the product catalog dataset created earlier. We used this code in the Question-Answer Dataset Jupyter notebook section to generate the test dataset.

Implementation

We implemented three different approaches: RAG, fine-tuning, and hybrid. See the Readme file for instructions to deploy the whole solution.

RAG

The RAG approach uses Amazon Bedrock Knowledge Bases and consists of two main parts.

To set up the infrastructure:

  1. Update the config file with your required data (details in the Readme)
  2. Run the following commands in the infrastructure folder:
cd infrastructure
./prepare.sh
cdk bootstrap aws://<<ACCOUNT_ID>>/<<REGION>>
cdk synth
cdk deploy --all

Context retrieval and response generation:

  1. The system finds relevant information by searching the knowledge base with the user’s question
  2. It then sends both the user’s question and the retrieved information to Meta LLama 3.1 8b LLM on Amazon Bedrock
  3. The LLM will then generate a response based on the user’s question and retrieved information

Fine-tuning

We used Amazon SageMaker AI JumpStart to fine-tune the Meta Llama 3.1 8b Instruct model using domain adaptation method for 5 epochs. You can adjust the following parameters in the config.py file:

  • Fine-tuning method: You can change the fine-tuning method in the config file, the default is domain_adaptation.
  • Number of epochs: Adjust number of epochs in the config file according to your data size.
  • Fine-tuning template: Change the template based on your use-case. The current one prompts the LLM to answer a customer question.

Hybrid

The hybrid approach combines RAG and fine-tuning, and uses the following high-level steps:

  1. Retrieve the most relevant context based on the user’s question from the Knowledge Base
  2. The fine-tuned model generates answers using the retrieved context

You can customize the prompt template in the config.py file.

Evaluation

For this example, we use three evaluation metrics to measure performance. You can modify src/evaluation.py to implement your own metrics for your evaluation implementation.

Each metric helps you understand different aspects of how well each of the approaches works:

  • BERTScore: BERTScore tells you how similar the generated answers are to the correct answers using cosine similarities. It calculates precision, recall, and F1 measure. We used the F1 measure as the evaluation score.
  • LLM evaluator score: We use different language models from Amazon Bedrock to score the responses from RAG, fine-tuning, and Hybrid approaches. Each evaluation receives both the correct answers and the generated answers and gives a score between 0 and 1 (closer to 1 indicates higher similarity) for each generated answer. We then calculate the final score by averaging all the evaluation scores. The process is shown in the following figure.

Figure 2: LLM evaluator method

  • Inference latency: Response times are important in applications like chatbots, so depending on your use case, this metric might be important in your decision. For each approach, we averaged the time it took to receive a full response for each sample.
  • Cost analysis: To do a full cost analysis, we made the following assumptions:
    • We used one OpenSearch compute unit (OCU) for indexing and another for the search related to document indexing in RAG. See OpenSearch Serverless pricing for more details.
    • We assume an application that has 1,000 users, each of them conducting 10 requests per day with an average of 2,000 input tokens and 1,000 output tokens. See Amazon Bedrock pricing for more details.
    • We used ml.g5.12xlarge instance for fine-tuning and hosting the fine-tuned model. The fine-tuning job took 15 minutes to complete. See SageMaker AI pricing for more details.
    • For fine-tuning and the hybrid approach, we assume that the model instance is up 24/7, which might vary according to your use case.
    • The cost calculation is done for one month.

Based on those assumptions, the cost associated with each of the three approaches is calculated as follows:

  • For RAG: 
    • OpenSearch Serverless monthly costs = Cost of 1 OCU per hour * 2 OCUs * 24 hours * 30 days
    • Total invocations for Meta Llama 3.1 8b = 1000 user * 10 requests * (price per input token * 2,000 + price per output token * 1,000) * 30 days
  • For fine-tuning:
    • (Number of minutes used for the fine-tuning job / 60) * Hourly cost of an ml.g5.12xlarge instance
    • Hourly cost of an ml.g5.12xlarge instance hosting * 24 hours * 30 days
  • For hybrid:
    • OpenSearch Serverless monthly costs = Cost of 1 OCU per hour * 2 OCUs * 24 hours * 30 days
    • (Number of minutes used for the finetuning job / 60) * cost of an ml.g5.12xlarge instance
    • Hourly cost of ml.g5.12xlarge instance hosting * 24 hours * 30 days

Results

You can find detailed evaluation results in two places in the code repository. The individual scores for each sample are in the JSON files under data/output, and a summary of the results is in summary_results.csv in the same folder.

The results shown in the following table show:

  • How each approach (RAG, fine-tuning, and hybrid) performs
  • Their scores from both BERTScore and LLM evaluators
  • The cost analysis for each method calculated for the US East region
Approach Average BERTScore Average LLM evaluator score Average inference time (in seconds) Cost per month (US East region)
RAG 0.8999 0.8200 8.336 ~=350 + 198 ~= 548$
Finetuning 0.8660 0.5556 4.159 ~= 1.77 + 5105 ~= 5107$
Hybrid 0.8908 0.8556 17.700 ~= 350 + 1.77 + 5105 ~= 5457$

Note that the costs for both the fine-tuning and hybrid approach can decrease significantly depending on the traffic pattern if you set the real-time inference endpoint from SageMaker to scaledown to zero instances when not in use. 

Clean up

Follow the cleanup section in the Readme file in order to avoid paying for unused resources.

Conclusion

In this post, we showed you how to implement and evaluate three powerful techniques for tailoring FMs to your business needs: RAG, fine-tuning, and a hybrid approach combining both methods. We provided ready-to-use code to help you experiment with these approaches and make informed decisions based on your specific use case and dataset.

The results in this example were specific to the dataset that we used. For that dataset, RAG outperformed fine-tuning and achieved comparable results to the hybrid approach with a lower cost, but fine-tuning led to the lowest latency. Your results will vary depending on your dataset.

We encourage you to test these approaches using our code as a starting point:

  1. Add your own datasets in the data folder
  2. Fill out the config.py file
  3. Follow the rest of the readme instructions to run the full evaluation

About the Authors

Idil Yuksel is a Working Student Solutions Architect at AWS, pursuing her MSc. in Informatics with a focus on machine learning at the Technical University of Munich. She is passionate about exploring application areas of machine learning and natural language processing. Outside of work and studies, she enjoys spending time in nature and practicing yoga.

Karim Akhnoukh is a Senior Solutions Architect at AWS working with customers in the financial services and insurance industries in Germany. He is passionate about applying machine learning and generative AI to solve customers’ business challenges. Besides work, he enjoys playing sports, aimless walks, and good quality coffee.

Read More

NVIDIA’s Bartley Richardson on How Teams of AI Agents Provide Next-Level Automation

NVIDIA’s Bartley Richardson on How Teams of AI Agents Provide Next-Level Automation

Building effective agentic AI systems requires rethinking how technology interacts and delivers value across organizations.

Bartley Richardson, senior director of engineering and AI infrastructure at NVIDIA, joined the NVIDIA AI Podcast to discuss how enterprises can successfully deploy agentic AI systems.

“When I talk with people about agents and agentic AI, what I really want to say is automation,” Richardson said. “It is that next level of automation.”

Richardson explains that AI reasoning models play a critical role in these systems by “thinking out loud” and enabling better planning capabilities.

“Reasoning models have been trained and tuned in a very specific way to think — almost like thinking out loud,” Richardson said. “It’s kind of like when you’re brainstorming with your colleagues or family.”

What makes NVIDIA’s Llama Nemotron models distinctive is that they give users the ability to toggle reasoning on or off within the same model, optimizing for specific tasks.

Enterprise IT leaders must acknowledge the multi-vendor reality of modern environments,  Richardson explained, saying organizations will have agent systems from various sources working together simultaneously.

“You’re going to have all these agents working together, and the trick is discovering how to let them all mesh together in a somewhat seamless way for your employees,” Richardson said.

To address this challenge, NVIDIA developed the AI-Q Blueprint for developing advanced agentic AI systems. Teams can build AI agents to automate complex tasks, break down operational silos and drive efficiency across industries. The blueprint uses the open-source NVIDIA Agent Intelligence (AIQ) toolkit to evaluate and profile agent workflows, making it easier to optimize and ensure interoperability among agents, tools and data sources.

“We have customers that optimize their tool-calling chains and get 15x speedups through their pipeline using AI-Q,” Richardson said.

He also emphasized the importance of maintaining realistic expectations that still provide significant business value.

“Agentic systems will make mistakes,” Richardson added. “But if it gets you 60%, 70%, 80% of the way there, that’s amazing.”

Time Stamps

1:15 – Defining agentic AI as the next evolution of enterprise automation.

4:06 – How reasoning models enhance agentic system capabilities.

12:41 – Enterprise considerations for implementing multi-vendor agent systems.

19:33 – Introduction to the NVIDIA Agent Intelligence toolkit for observability and traceability.

You Might Also Like… 

NVIDIA’s Rama Akkiraju on How AI Platform Architects Help Bridge Business Vision and Technical Execution

Enterprises are exploring AI to rethink problem-solving and business processes. These initiatives require the right infrastructure, such as AI factories, which allow businesses to convert data into tokens and outcomes. Rama Akkiraju, vice president of IT for AI and machine learning at NVIDIA, joined the AI Podcast to discuss how enterprises can build the right foundations for AI success, and the critical role of AI platform architects in designing and building AI infrastructure based on specific business needs.

Roboflow Helps Unlock Computer Vision for Every Kind of AI Builder

Roboflow’s mission is to make the world programmable through computer vision. By simplifying computer vision development, the company helps bridge the gap between AI and people looking to harness it. Cofounder and CEO Joseph Nelson discusses how Roboflow empowers users in manufacturing, healthcare and automotive to solve complex problems with visual AI.

NVIDIA’s Jacob Liberman on Bringing Agentic AI to Enterprises

Agentic AI enables developers to create intelligent multi-agent systems that reason, act and execute complex tasks with a degree of autonomy. Jacob Liberman, director of product management at NVIDIA, explains how agentic AI bridges the gap between powerful AI models and practical enterprise applications.

Read More

How Rufus doubled their inference speed and handled Prime Day traffic with AWS AI chips and parallel decoding

How Rufus doubled their inference speed and handled Prime Day traffic with AWS AI chips and parallel decoding

Large language models (LLMs) have revolutionized the way we interact with technology, but their widespread adoption has been blocked by high inference latency, limited throughput, and high costs associated with text generation. These inefficiencies are particularly pronounced during high-demand events like Amazon Prime Day, where systems like Rufus—the Amazon AI-powered shopping assistant—must handle massive scale while adhering to strict latency and throughput requirements. Rufus is an AI-powered shopping assistant designed to help customers make informed purchasing decisions. Powered by LLMs, Rufus answers customer questions about a variety of shopping needs and products and simplifies the shopping experience, as shown in the following image.

A comprehensive mobile e-commerce interaction showcasing Amazon's Rufus AI shopping assistant interface. The conversation demonstrates Rufus's ability to provide detailed product comparisons and recommendations. The assistant explains the technical differences between trail and running shoes, focusing on tread patterns, terrain adaptation, and specific use cases. The interface displays curated product suggestions including the Salomon X Ultra 4 GTX Hiking Shoes with Gore-Tex membrane and the Saucony Peregrine 12 Trail Running Shoes with PWRTRAC technology, complete with product images and detailed specifications. The layout includes a search bar, navigation menu, and voice input functionality, illustrating Amazon's integrated shopping experience.

Rufus relies on many components to deliver its customer experience including a foundation LLM (for response generation) and a query planner (QP) model for query classification and retrieval enhancement. The model parses customer questions to understand their intent, whether keyword-based or conversational natural language. QP is on the critical path for Rufus because Rufus cannot initiate token generation until QP provides its full output. Thus, reducing QP’s end-to-end text generation latency is a critical requirement for reducing the first chunk latency in Rufus, which refers to the time taken to generate and send the first response to a user request. Lowering this latency improves perceived responsiveness and overall user experience. This post focuses on how the QP model used draft centric speculative decoding (SD)—also called parallel decoding—with AWS AI chips to meet the demands of Prime Day. By combining parallel decoding with AWS Trainium and Inferentia chips, Rufus achieved two times faster response times, a 50% reduction in inference costs, and seamless scalability during peak traffic.

Scaling LLMs for Prime Day

Prime Day is one of the most demanding events for the Amazon infrastructure, pushing systems to their limits. In 2024, Rufus faced an unprecedented engineering challenge: handling millions of queries per minute and generating billions of tokens in real-time, all while maintaining a 300 ms latency SLA for QP tasks and minimizing power consumption. These demands required a fundamental rethinking of how LLMs are deployed at scale conquering the cost and performance bottlenecks. The key challenges of Prime Day included:

  • Massive scale: Serving millions of tokens per minute to customers worldwide, with peak traffic surges that strain even the most robust systems.
  • Strict SLAs: Delivering real-time responsiveness with a hard latency limit of 300 ms, ensuring a seamless customer experience.
  • Cost efficiency: Minimizing the cost of serving LLMs at scale while reducing power consumption, a critical factor for sustainable and economical operations.

Traditional LLM text generation is inherently inefficient because of its sequential nature. Each token generation requires a full forward pass through the model, leading to high latency and underutilization of computational resources. While techniques like speculative decoding have been proposed to address these inefficiencies, their complexity and training overhead have limited their adoption.

AWS AI chips and parallel decoding

To overcome these challenges, Rufus adopted parallel decoding, a simple yet powerful technique for accelerating LLM generation. With parallel decoding, the sequential dependency is broken, making autoregressive generation faster. This approach introduces additional decoding heads to the base model eliminating the need for a separate draft model for speculated tokens. These heads predict multiple tokens in parallel for future positions before it knows the previous tokens, and this significantly improves generation efficiency.

To accelerate the performance of parallel decoding for online inference, Rufus used a combination of AWS solutions: Inferentia2 and Trainium AI Chips, Amazon Elastic Compute Cloud (Amazon EC2) and Application Load Balancer. In addition, the Rufus team partnered with NVIDIA to power the solution using NVIDIA’s Triton Inference Server, providing capabilities to host the model using AWS chips.

To get the maximum efficiency of parallel decoding on AWS Neuron Cores, we worked in collaboration with AWS Neuron team to add the architectural support of parallel decoding on a Neuronx-Distributed Inference (NxDI) framework for single batch size.

Rufus extended the base LLM with multiple decoding heads. These heads are a small neural network layer and are trained using the base model’s learned representations to predict the next tokens in parallel. These heads are trained together with the original model, keeping the base model unchanged.Because the tokens aren’t generated sequentially, they must be verified to make sure that all of the tokens fit together. To validate the tokens predicted by the draft heads, Rufus uses a tree-based attention mechanism to verify and integrate tokens. Each draft head produces several options for each position. These options are then organized into a tree-like structure to select the most promising combination. This allows multiple candidate tokens to be processed in parallel, reducing latency and increasing neuron core utilization. The following figure shows a sparse tree constructed using our calibration set, with a depth of four, indicating the involvement of four heads in the calculation process. Each node represents a token from a top-k prediction of a draft head, and the edges depict the connections between these nodes.

Expansive hierarchical tree diagram with four main branches, each containing multiple numbered nodes arranged in descending levels

Results of using parallel decoding

By integrating parallel decoding with AWS AI chips and NxDI framework, we doubled the speed of text generation compared to autoregressive decoding, making it an ideal solution for the high-demand environment of Prime Day. During Amazon Prime Day 2024, Rufus demonstrated the power of AWS AI chips with impressive performance metrics:

  • Two times faster generation: AWS AI chips, optimized for parallel decoding operations, enabled doubled the token generation speed compared to traditional processors. This parallel processing capability allowed multiple future tokens to be predicted simultaneously, delivering real-time interactions for millions of customers.
  • 50% lower inference costs: The combination of purpose-built AWS AI chips and parallel decoding optimization eliminated redundant computations, cutting inference costs by half while maintaining response quality.
  • Simplified deployment: AWS AI chips efficiently powered the model’s parallel decoding heads, enabling simultaneous token prediction without the complexity of managing separate draft models. This architectural synergy simplified the deployment while delivering efficient inference at scale.
  • Seamless scalability: The combination handled peak traffic without compromising performance and response quality.

These advances not only enhanced the customer experience but also showcased the potential of NxDI framework and the adaptability of AWS AI chips for optimizing large-scale LLM performance.

How to use parallel decoding on Trainium and Inferentia

The flexibility of NxDI combined with AWS Neuron chips makes it a powerful solution for LLM text generation in production. Whether you’re using Trainium or Inferentia for inference, NxDI provides a unified interface to implement parallel decoding optimizations. This integration reduces operational complexity and provides a straightforward path for organizations looking to deploy and scale their LLM applications efficiently.

You can explore parallel decoding technique such as Medusa to accelerate your inference workflows on INF2 or TRN1 instances. To get started, you’ll need a Medusa-compatible model (such as text-generation-inference/Mistral-7B-Instruct-v0.2-medusa) and a Medusa tree configuration. Enable Medusa by setting is_medusa=True, configuring your medusa_speculation_length, num_medusa_heads, and specifying your medusa_tree. When using the HuggingFace generate() API, set the assistant_model to your target model. Note that Medusa currently supports only a batch size of 1.

 def load_json_file(json_path):
    with open(json_path, "r") as f:
        return json.load(f)
medusa_tree = load_json_file("medusa_mc_sim_7b_63.json")
neuron_config = NeuronConfig(
    is_medusa=True,
    medusa_speculation_length=64,
    num_medusa_heads=4,
    medusa_tree=medusa_tree
)

Conclusion

Prime Day is a testament to the power of innovation to overcome technical challenges. By using AWS AI chips, Rufus not only met the stringent demands of Prime Day but also set a new standard for LLM efficiency. As LLMs continue to evolve, frameworks such as NxDI will play a crucial role in making them more accessible, scalable, and cost-effective. We’re excited to see how the community will build on the NxDI foundation and AWS AI chips to unlock new possibilities for LLM applications. Try it out today and experience the difference for yourself!

Acknowledgments

We extend our gratitude to the AWS Annapurna team responsible for AWS AI chips and framework development. Special thanks to the researchers and engineers whose contributions made this achievement possible. The improvements in latency, throughput, and cost efficiency achieved with parallel decoding compared to autoregressive decoding have set a new benchmark for LLM deployments at scale.


About the authors

Shruti Dubey is a Software Engineer on Amazon’s Core Search Team, where she optimizes LLM inference systems to make AI faster and more scalable. She’s passionate about Generative AI and loves turning cutting-edge research into real-world impact. Outside of work, you’ll find her running, reading, or trying to convince her dog that she’s the boss.

Shivangi Agarwal is an Applied Scientist on Amazon’s Prime Video team, where she focuses on optimizing LLM inference and developing intelligent ranking systems for Prime Videos using query-level signals. She’s driven by a passion for building efficient, scalable AI that delivers real-world impact. When she’s not working, you’ll likely find her catching a good movie, discovering new places, or keeping up with her adventurous 3-year-old kid.

Sukhdeep Singh Kharbanda is an Applied Science Manager at Amazon Core Search. In his current role, Sukhdeep is leading Amazon Inference team to build GenAI inference optimization solutions and inference system at scale for fast inference at low cost. Outside work, he enjoys playing with his kid and cooking different cuisines.

Rahul Goutam is an Applied Science Manager at Amazon Core Search, where he leads teams of scientists and engineers to build scalable AI solutions that power flexible and intuitive shopping experiences. When he’s off the clock, he enjoys hiking a trail or skiing down one.

Yang Zhou is a software engineer working on building and optimizing machine learning systems. His recent focus is enhancing the performance and cost efficiency of generative AI inference. Beyond work, he enjoys traveling and has recently discovered a passion for running long distances.

RJ is an Engineer within Amazon. He builds and optimizes systems for distributed systems for training and works on optimizing adopting systems to reduce latency for ML Inference. Outside work, he is exploring using Generative AI for building food recipes.

James Park is a Principal Machine Learning Specialist Solutions Architect at Amazon Web Services. He works with Amazon to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, experiences, and staying up to date with the latest technology trends.

Read More

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

Auscultation, particularly heart sound, is a non-invasive
technique that provides essential vital sign information.
Recently, self-supervised acoustic representation founda-
tion models (FMs) have been proposed to offer insights
into acoustics-based vital signs. However, there has been
little exploration of the extent to which auscultation is
encoded in these pre-trained FM representations. In this
work, using a publicly available phonocardioram (PCG)
dataset and a heart rate (HR) estimation model, we con-
duct a layer-wise investigation of six acoustic representa-
tion FMs: HuBERT, wav2vec2…Apple Machine Learning Research

Interleaved Reasoning for Large Language Models via Reinforcement Learning

Long chain-of-thought (CoT) significantly enhances large language models’ (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps…Apple Machine Learning Research