Tool choice with Amazon Nova models

Tool choice with Amazon Nova models

In many generative AI applications, a large language model (LLM) like Amazon Nova is used to respond to a user query based on the model’s own knowledge or context that it is provided. However, as use cases have matured, the ability for a model to have access to tools or structures that would be inherently outside of the model’s frame of reference has become paramount. This could be APIs, code functions, or schemas and structures required by your end application. This capability has developed into what is referred to as tool use or function calling.

To add fine-grained control to how tools are used, we have released a feature for tool choice for Amazon Nova models. Instead of relying on prompt engineering, tool choice forces the model to adhere to the settings in place.

In this post, we discuss tool use and the new tool choice feature, with example use cases.

Tool use with Amazon Nova

To illustrate the concept of tool use, we can imagine a situation where we provide Amazon Nova access to a few different tools, such as a calculator or a weather API. Based on the user’s query, Amazon Nova will select the appropriate tool and tell you how to use it. For example, if a user asks “What is the weather in Seattle?” Amazon Nova will use the weather tool.

The following diagram illustrates an example workflow between an Amazon Nova model, its available tools, and related external resources.

Tool use at the core is the selection of the tool and its parameters. The responsibility to execute the external functionality is left to application or developer. After the tool is executed by the application, you can return the results to the model for the generation of the final response.

Let’s explore some examples in more detail. The following diagram illustrates the workflow of an Amazon Nova model using a function call to access a weather API, and returning the response to the user.

The following diagram illustrates the workflow of an Amazon Nova model using a function call to access a calculator tool.

Tool choice with Amazon Nova

The toolChoice API parameter allows you to control when a tool is called. There are three supported options for this parameter:

  • Any – With tool choice Any, the model will select at least one of the available tools each time:
    {
       "toolChoice": {
            "any": {}
        }
    }

  • Tool – With tool choice Tool, the model will always use the requested tool:
    {
       "toolChoice": {
            "tool": {
                "name": "name_of_tool"
            }
        }
    }

  • Auto – Tool choice Auto is the default behavior and will leave the tool selection completely up to the model:
    {
       "toolChoice": {
            "auto": {}
        }
    }

A popular tactic to improve the reasoning capabilities of a model is to use chain of thought. When using the tool choice of auto, Amazon Nova will use chain of thought and the response of the model will include both the reasoning and the tool that was selected.

This behavior will differ depending on the use case. When tool or any are selected as the tool choice, Amazon Nova will output only the tools and not output chain of thought.

Use cases

In this section, we explore different use cases for tool choice.

Structured output/JSON mode

In certain scenarios, you might want Amazon Nova to use a specific tool to answer the user’s question, even if Amazon Nova believes it can provide a response without the use of a tool. A common use case for this approach is enforcing structured output/JSON mode. It’s often critical to have LLMs return structured output, because this enables downstream use cases to more effectively consume and process the generated outputs. In these instances, the tools employed don’t necessarily need to be client-side functions—they can be used whenever the model is required to return JSON output adhering to a predefined schema, thereby compelling Amazon Nova to use the specified tool.

When using tools for enforcing structured output, you provide a single tool with a descriptive JSON inputSchema. You specify the tool with {"tool" : {"name" : "Your tool name"}}. The model will pass the input to the tool, so the name of the tool and its description should be from the model’s perspective.

For example, consider a food website. When provided with a dish description, the website can extract the recipe details, such as cooking time, ingredients, dish name, and difficulty level, in order to facilitate user search and filtering capabilities. See the following example code:

import boto3
import json

tool_config = {
    "toolChoice": {
        "name": { "tool" : "extract_recipe"}
    },
    "tools": [
        {
            "toolSpec": {
                "name": "extract_recipe",
                "description": "Extract recipe for cooking instructions",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "recipe": {
                                "type": "object",
                                "properties": {
                                    "name": {
                                        "type": "string",
                                        "description": "Name of the recipe"
                                    },
                                    "description": {
                                        "type": "string",
                                        "description": "Brief description of the dish"
                                    },
                                    "prep_time": {
                                        "type": "integer",
                                        "description": "Preparation time in minutes"
                                    },
                                    "cook_time": {
                                        "type": "integer",
                                        "description": "Cooking time in minutes"
                                    },
                                    "servings": {
                                        "type": "integer",
                                        "description": "Number of servings"
                                    },
                                    "difficulty": {
                                        "type": "string",
                                        "enum": ["easy", "medium", "hard"],
                                        "description": "Difficulty level of the recipe"
                                    },
                                    "ingredients": {
                                        "type": "array",
                                        "items": {
                                            "type": "object",
                                            "properties": {
                                                "item": {
                                                    "type": "string",
                                                    "description": "Name of ingredient"
                                                },
                                                "amount": {
                                                    "type": "number",
                                                    "description": "Quantity of ingredient"
                                                },
                                                "unit": {
                                                    "type": "string",
                                                    "description": "Unit of measurement"
                                                }
                                            },
                                            "required": ["item", "amount", "unit"]
                                        }
                                    },
                                    "instructions": {
                                        "type": "array",
                                        "items": {
                                            "type": "string",
                                            "description": "Step-by-step cooking instructions"
                                        }
                                    },
                                    "tags": {
                                        "type": "array",
                                        "items": {
                                            "type": "string",
                                            "description": "Categories or labels for the recipe"
                                        }
                                    }
                                },
                                "required": ["name", "ingredients", "instructions"]
                            }
                        },
                        "required": ["recipe"]
                    }
                }
            }
        }
    ]
}

messages = [{
    "role": "user",
    "content": [
        {"text": input_text},
    ]
}]

inf_params = {"topP": 1, "temperature": 1}

client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.converse(
    modelId="us.amazon.nova-micro-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params,
    additionalModelRequestFields= {"inferenceConfig": { "topK": 1 } }
)
print(json.dumps(response['output']['message']['content'][0][], indent=2))

We can provide a detailed description of a dish as text input:

Legend has it that this decadent chocolate lava cake was born out of a baking mistake in New York's Any Kitchen back in 1987, when chef John Doe pulled a chocolate sponge cake out of the oven too early, only to discover that the dessert world would never be the same. Today I'm sharing my foolproof version, refined over countless dinner parties. Picture a delicate chocolate cake that, when pierced with a fork, releases a stream of warm, velvety chocolate sauce – it's pure theater at the table. While it looks like a restaurant-worthy masterpiece, the beauty lies in its simplicity: just six ingredients (good quality dark chocolate, unsalted butter, eggs, sugar, flour, and a pinch of salt) transform into individual cakes in under 15 minutes. The secret? Precise timing is everything. Pull them from the oven a minute too late, and you'll miss that magical molten center; too early, and they'll be raw. But hit that sweet spot at exactly 12 minutes, when the edges are set but the center still wobbles slightly, and you've achieved dessert perfection. I love serving these straight from the oven, dusted with powdered sugar and topped with a small scoop of vanilla bean ice cream that slowly melts into the warm cake. The contrast of temperatures and textures – warm and cold, crisp and gooey – makes this simple dessert absolutely unforgettable.

We can force Amazon Nova to use the tool extract_recipe, which will generate a structured JSON output that adheres to the predefined schema provided as the tool input schema:

 {
  "toolUseId": "tooluse_4YT_DYwGQlicsNYMbWFGPA",
  "name": "extract_recipe",
  "input": {
    "recipe": {
      "name": "Decadent Chocolate Lava Cake",
      "description": "A delicate chocolate cake that releases a stream of warm, velvety chocolate sauce when pierced with a fork. It's pure theater at the table.",
      "difficulty": "medium",
      "ingredients": [
        {
          "item": "good quality dark chocolate",
          "amount": 125,
          "unit": "g"
        },
        {
          "item": "unsalted butter",
          "amount": 125,
          "unit": "g"
        },
        {
          "item": "eggs",
          "amount": 4,
          "unit": ""
        },
        {
          "item": "sugar",
          "amount": 100,
          "unit": "g"
        },
        {
          "item": "flour",
          "amount": 50,
          "unit": "g"
        },
        {
          "item": "salt",
          "amount": 0.5,
          "unit": "pinch"
        }
      ],
      "instructions": [
        "Preheat the oven to 200u00b0C (400u00b0F).",
        "Melt the chocolate and butter together in a heatproof bowl over a saucepan of simmering water.",
        "In a separate bowl, whisk the eggs and sugar until pale and creamy.",
        "Fold the melted chocolate mixture into the egg and sugar mixture.",
        "Sift the flour and salt into the mixture and gently fold until just combined.",
        "Divide the mixture among six ramekins and bake for 12 minutes.",
        "Serve straight from the oven, dusted with powdered sugar and topped with a small scoop of vanilla bean ice cream."
      ],
      "prep_time": 10,
      "cook_time": 12,
      "servings": 6,
      "tags": [
        "dessert",
        "chocolate",
        "cake"
      ]
    }
  }
}

API generation

Another common scenario is to require Amazon Nova to select a tool from the available options no matter the context of the user query. One example of this is with API endpoint selection. In this situation, we don’t know the specific tool to use, and we allow the model to choose between the ones available.

With the tool choice of any, you can make sure that the model will always use at least one of the available tools. Because of this, we provide a tool that can be used for when an API is not relevant. Another example would be to provide a tool that allows clarifying questions.

In this example, we provide the model with two different APIs, and an unsupported API tool that it will select based on the user query:

import boto3
import json

tool_config = {
    "toolChoice": {
        "any": {}
    },
    "tools": [
         {
            "toolSpec": {
                "name": "get_all_products",
                "description": "API to retrieve multiple products with filtering and pagination options",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "sort_by": {
                                "type": "string",
                                "description": "Field to sort results by. One of: price, name, created_date, popularity",
                                "default": "created_date"
                            },
                            "sort_order": {
                                "type": "string",
                                "description": "Order of sorting (ascending or descending). One of: asc, desc",
                                "default": "desc"
                            },
                        },
                        "required": []
                    }
                }
            }
        },
        {
            "toolSpec": {
                "name": "get_products_by_id",
                "description": "API to retrieve retail products based on search criteria",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "product_id": {
                                "type": "string",
                                "description": "Unique identifier of the product"
                            },
                        },
                        "required": ["product_id"]
                    }
                }
            }
        },
        {
            "toolSpec": {
                "name": "unsupported_api",
                "description": "API to use when the user query does not relate to the other available APIs",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "reasoning": {
                                "type": "string",
                                "description": "The reasoning for why the user query did not have a valid API available"
                            },
                        },
                        "required": ["reasoning"]
                    }
                }
            }
        }
    ]
}


messages = [{
    "role": "user",
    "content": [
        {"text": input_text},
    ]
}]

inf_params = {"topP": 1, "temperature": 1}

client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.converse(
    modelId="us.amazon.nova-micro-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params,
    additionalModelRequestFields= {"inferenceConfig": { "topK": 1 } }
)

print(json.dumps(response['output']['message']['content'][0], indent=2))

A user input of “Can you get all of the available products?” would output the following:

{
  "toolUse": {
    "toolUseId": "tooluse_YCNbT0GwSAyjIYOuWnDhkw",
    "name": "get_all_products",
    "input": {}
  }
}

Whereas “Can you get my most recent orders?” would output the following:

{
  "toolUse": {
    "toolUseId": "tooluse_jpiZnrVcQDS1sAa-qPwIQw",
    "name": "unsupported_api",
    "input": {
      "reasoning": "The available tools do not support retrieving user orders. The user's request is for personal order information, which is not covered by the provided APIs."
    }
  }
}

Chat with search

The final option for tool choice is auto. This is the default behavior, so it is consistent with providing no tool choice at all.

Using this tool choice will allow the option of tool use or just text output. If the model selects a tool, there will be a tool block and text block. If the model responds with no tool, only a text block is returned. In the following example, we want to allow the model to respond to the user or call a tool if necessary:

import boto3
import json

tool_config = {
    "toolChoice": {
        "auto": {}
    },
    "tools": [
         {
            "toolSpec": {
                "name": "search",
                "description": "API that provides access to the internet",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "query": {
                                "type": "string",
                                "description": "Query to search by",
                            },
                        },
                        "required": ["query"]
                    }
                }
            }
        }
    ]
}

messages = [{
    "role": "user",
    "content": [
        {"text": input_text},
    ]
}]

system = [{
    "text": "ou are a helpful chatbot. You can use a tool if necessary or respond to the user query"
}]

inf_params = {"topP": 1, "temperature": 1}

client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.converse(
    modelId="us.amazon.nova-micro-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params,
    additionalModelRequestFields= {"inferenceConfig": { "topK": 1 } }
)


if (response["stopReason"] == "tool_use"):
    tool_use = next(
        block["toolUse"]
        for block in response["output"]["message"]["content"]
            if "toolUse" in block
    )
   print(json.dumps(tool_use, indent=2))
 else:
    pattern = r'<thinking>.*?</thinking>\n\n|<thinking>.*?</thinking>'
    text_response = response["output"]["message"]["content"][0]["text"]
    stripped_text = re.sub(pattern, '', text_response, flags=re.DOTALL)
    
    print(stripped_text)

A user input of “What is the weather in San Francisco?” would result in a tool call:

{
  "toolUseId": "tooluse_IwtBnbuuSoynn1qFiGtmHA",
  "name": "search",
  "input": {
    "query": "what is the weather in san francisco"
  }
}

Whereas asking the model a direct question like “How many months are in a year?” would respond with a text response to the user:

There are 12 months in a year.

Considerations

There are a few best practices that are required for tool calling with Nova models. The first is to use greedy decoding parameters. With Amazon Nova models, that requires setting a temperature, top p, and top k of 1. You can refer to the previous code examples for how to set these. Using greedy decoding parameters forces the models to produce deterministic responses and improves the success rate of tool calling.

The second consideration is the JSON schema you are using for the tool consideration. At the time of writing, Amazon Nova models support a limited subset of JSON schemas, so they might not be picked up as expected by the model. Common fields would be $def and $ref fields. Make sure that your schema has the following top-level fields set: type (must be object), properties, and required.

Lastly, for the most impact on the success of tool calling, you should optimize your tool configurations. Descriptions and names should be very clear. If there are nuances to when one tool should be called over the other, make sure to have that concisely included in the tool descriptions.

Conclusion

Using tool choice in tool calling workflows is a scalable way to control how a model invokes tools. Instead of relying on prompt engineering, tool choice forces the model to adhere to the settings in place. However, there are complexities to tool calling; for more information, refer to Tool use (function calling) with Amazon Nova, Tool calling systems, and Troubleshooting tool calls.

Explore how Amazon Nova models can enhance your generative AI use cases today.


About the Authors

Jean Farmer is a Generative AI Solutions Architect on the Amazon Artificial General Intelligence (AGI) team, specializing in agentic applications. Based in Seattle, Washington, she works at the intersection of autonomous AI systems and practical business solutions, helping to shape the future of AGI at Amazon.

Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.

Lulu Wong is an AI UX designer on the Amazon Artificial General Intelligence (AGI) team. With a background in computer science, learning design, and user experience, she bridges the technical and user experience domains by shaping how AI systems interact with humans, refining model input-output behaviors, and creating resources to make AI products more accessible to users.

Read More

Integrate generative AI capabilities into Microsoft Office using Amazon Bedrock

Integrate generative AI capabilities into Microsoft Office using Amazon Bedrock

Generative AI is rapidly transforming the modern workplace, offering unprecedented capabilities that augment how we interact with text and data. At Amazon Web Services (AWS), we recognize that many of our customers rely on the familiar Microsoft Office suite of applications, including Word, Excel, and Outlook, as the backbone of their daily workflows. In this blog post, we showcase a powerful solution that seamlessly integrates AWS generative AI capabilities in the form of large language models (LLMs) based on Amazon Bedrock into the Office experience. By harnessing the latest advancements in generative AI, we empower employees to unlock new levels of efficiency and creativity within the tools they already use every day. Whether it’s drafting compelling text, analyzing complex datasets, or gaining more in-depth insights from information, integrating generative AI with Office suite transforms the way teams approach their essential work. Join us as we explore how your organization can leverage this transformative technology to drive innovation and boost employee productivity.

Solution overview


Figure 1: Solution architecture overview

The solution architecture in Figure 1 shows how Office applications interact with a serverless backend hosted on the AWS Cloud through an Add-In. This architecture allows users to leverage Amazon Bedrock’s generative AI capabilities directly from within the Office suite, enabling enhanced productivity and insights within their existing workflows.

Components deep-dive

Office Add-ins

Office Add-ins allow extending Office products with custom extensions built on standard web technologies. Using AWS, organizations can host and serve Office Add-ins for users worldwide with minimal infrastructure overhead.

An Office Add-in is composed of two elements:

The code snippet below demonstrates part of a function that could run whenever a user invokes the plugin, performing the following actions:

  1. Initiate a request to the generative AI backend, providing the user prompt and available context in the request body
  2. Integrate the results from the backend response into the Word document using Microsoft’s Office JavaScript APIs. Note that these APIs use objects as namespaces, alleviating the need for explicit imports. Instead, we use the globally available namespaces, such as Word, to directly access relevant APIs, as shown in following example snippet.
// Initiate backend request (optional context)
const response = await sendPrompt({ user_message: prompt, context: selectedContext });

// Modify Word content with responses from the Backend
await Word.run(async (context) => {
  let documentBody;

  // Target for the document modifications
  if (response.location === 'Replace') {
    documentBody = context.document.getSelection(); // active text selection
  } else {
    documentBody = context.document.body; // entire document body
  }

  // Markdown support for preserving original content layout
  // Dependencies used: React markdown
  const content = renderToString(<Markdown>{ response.content } < /Markdown>);
  const operation = documentBody.insertHtml(content, response.location);

  // set properties for the output content (font, size, color, etc.)
  operation.font.set({ name: 'Arial' });

  // flush changes to the Word document
  await context.sync();
});

Generative AI backend infrastructure

The AWS Cloud backend consists of three components:

  1. Amazon API Gateway acts as an entry point, receiving requests from the Office applications’ Add-in. API Gateway supports multiple mechanisms for controlling and managing access to an API.
  2. AWS Lambda handles the REST API integration, processing the requests and invoking the appropriate AWS services.
  3. Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available via an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With Bedrock’s serverless experience, you can get started quickly, privately customize FMs with your own data, and quickly integrate and deploy them into your applications using the AWS tools without having to manage infrastructure.

LLM prompting

Amazon Bedrock allows you to choose from a wide selection of foundation models for prompting. Here, we use Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock for completions. The system prompt we used in this example is as follows:

You are an office assistant helping humans to write text for their documents.

[When preparing the answer, take into account the following text: <text>{context}</text>]
Before answering the question, think through it step-by-step within the <thinking></thinking> tags.
Then, detect the user's language from their question and store it in the form of an ISO 639-1 code within the <user_language></user_language> tags.
Then, develop your answer in the user’s language within the <response></response> tags.

In the prompt, we first give the LLM a persona, indicating that it is an office assistant helping humans. The second, optional line contains text that has been selected by the user in the document and is provided as context to the LLM. We specifically instruct the LLM to first mimic a step-by-step thought process for arriving at the answer (chain-of-thought reasoning), an effective measure of prompt-engineering to improve the output quality. Next, we instruct it to detect the user’s language from their question so we can later refer to it. Finally, we instruct the LLM to develop its answer using the previously detected user language within response tags, which are used as the final response. While here, we use the default configuration for inference parameters such as temperature, that can quickly be configured with every LLM prompt. The user input is then added as a user message to the prompt and sent via the Amazon Bedrock Messages API to the LLM.

Implementation details and demo setup in an AWS account

As a prerequisite, we need to make sure that we are working in an AWS Region with Amazon Bedrock support for the foundation model (here, we use Anthropic’s Claude 3.5 Sonnet). Also, access to the required relevant Amazon Bedrock foundation models needs to be added. For this demo setup, we describe the manual steps taken in the AWS console. If required, this setup can also be defined in Infrastructure as Code.

To set up the integration, follow these steps:

  1. Create an AWS Lambda function with Python runtime and below code to be the backend for the API. Make sure that we have Powertools for AWS Lambda (Python) available in our runtime, for example, by attaching aLambda layer to our function. Make sure that the Lambda function’s IAM role provides access to the required FM, for example:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "bedrock:InvokeModel",
                "Resource": [
                    "arn:aws:bedrock:*::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0"
                ]
            }
        ]
    }
    

    The following code block shows a sample implementation for the REST API Lambda integration based on a Powertools for AWS Lambda (Python) REST API event handler:

    import json
    import re
    from typing import Optional
    
    import boto3
    from aws_lambda_powertools import Logger
    from aws_lambda_powertools.event_handler import APIGatewayRestResolver, CORSConfig
    from aws_lambda_powertools.logging import correlation_paths
    from aws_lambda_powertools.utilities.typing import LambdaContext
    from pydantic import BaseModel
    
    logger = Logger()
    app = APIGatewayRestResolver(
        enable_validation=True,
        cors=CORSConfig(allow_origin="http://localhost:3000"),  # for testing purposes
    )
    
    bedrock_runtime_client = boto3.client("bedrock-runtime")
    
    
    SYSTEM_PROMPT = """
    You are an office assistant helping humans to write text for their documents.
    
    {context}
    Before answering the question, think through it step-by-step within the <thinking></thinking> tags.
    Then, detect the user's language from their question and store it in the form of an ISO 639-1 code within the <user_language></user_language> tags.
    Then, develop your answer in the user's language in markdown format within the <response></response> tags.
    """
    
    class Query(BaseModel):
        user_message: str  # required
        context: Optional[str] = None  # optional
        max_tokens: int = 1000  # default value
        model_id: str = "anthropic.claude-3-5-sonnet-20240620-v1:0"  # default value
    
    def wrap_context(context: Optional[str]) -> str:
        if context is None:
            return ""
        else:
            return f"When preparing the answer take into account the following text: <text>{context}</text>"
    
    def parse_completion(completion: str) -> dict:
        response = {"completion": completion}
        try:
            tags = ["thinking", "user_language", "response"]
            tag_matches = re.finditer(
                f"<(?P<tag>{'|'.join(tags)})>(?P<content>.*?)</(?P=tag)>",
                completion,
                re.MULTILINE | re.DOTALL,
            )
            for match in tag_matches:
                response[match.group("tag")] = match.group("content").strip()
        except Exception:
            logger.exception("Unable to parse LLM response")
            response["response"] = completion
    
        return response
    
    
    @app.post("/query")
    def query(query: Query):
        bedrock_response = bedrock_runtime_client.invoke_model(
            modelId=query.model_id,
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": query.max_tokens,
                    "system": SYSTEM_PROMPT.format(context=wrap_context(query.context)),
                    "messages": [{"role": "user", "content": query.user_message}],
                }
            ),
        )
        response_body = json.loads(bedrock_response.get("body").read())
        logger.info("Received LLM response", response_body=response_body)
        response_text = response_body.get("content", [{}])[0].get(
            "text", "LLM did not respond with text"
        )
        return parse_completion(response_text)
    
    @logger.inject_lambda_context(correlation_id_path=correlation_paths.API_GATEWAY_REST)
    def lambda_handler(event: dict, context: LambdaContext) -> dict:
        return app.resolve(event, context)
    

  2. Create an API Gateway REST API with a Lambda proxy integration to expose the Lambda function via a REST API. You can follow this tutorial for creating a REST API for the Lambda function by using the API Gateway console. By creating a Lambda proxy integration with a proxy resource, we can route requests to the resources to the Lambda function. Follow the tutorial to deploy the API and take note of the API’s invoke URL. Make sure to configure adequate access control for the REST API.

We can now invoke and test our function via the API’s invoke URL. The following example uses curl to send a request (make sure to replace all placeholders in curly braces as required), and the response generated by the LLM:

$ curl --header "Authorization: {token}" 
     --header "Content-Type: application/json" 
     --request POST 
     --data '{"user_message": "Write a 2 sentence summary about AWS."}' 
     https://{restapi_id}.execute-api.{region}.amazonaws.com/{stage_name}/query | jq .
{
 "completion": "<thinking>nTo summarize AWS in 2 sentences:n1. AWS (Amazon Web Services) is a comprehensive cloud computing platform offering a wide range of services like computing power, database storage, content delivery, and more.n2. It allows organizations and individuals to access these services over the internet on a pay-as-you-go basis without needing to invest in on-premises infrastructure.n</thinking>nn<user_language>en</user_language>nn<response>nnAWS (Amazon Web Services) is a cloud computing platform that offers a broad set of global services including computing, storage, databases, analytics, machine learning, and more. It enables companies of all sizes to access these services over the internet on a pay-as-you-go pricing model, eliminating the need for upfront capital expenditure or on-premises infrastructure management.nn</response>",
 "thinking": "To summarize AWS in 2 sentences:n1. AWS (Amazon Web Services) is a comprehensive cloud computing platform offering a wide range of services like computing power, database storage, content delivery, and more.n2. It allows organizations and individuals to access these services over the internet on a pay-as-you-go basis without needing to invest in on-premises infrastructure.",
 "user_language": "en",
 "response": "AWS (Amazon Web Services) is a cloud computing platform that offers a broad set of global services including computing, storage, databases, analytics, machine learning, and more. It enables companies of all sizes to access these services over the internet on a pay-as-you-go pricing model, eliminating the need for upfront capital expenditure or on-premises infrastructure management."
} 

If required, the created resources can be cleaned up by 1) deleting the API Gateway REST API, and 2) deleting the REST API Lambda function and associated IAM role.

Example use cases

To create an interactive experience, the Office Add-in integrates with the cloud back-end that implements conversational capabilities with support for additional context retrieved from the Office JavaScript API.

Next, we demonstrate two different use cases supported by the proposed solution, text generation and text refinement.

Text generation


Figure 2: Text generation use-case demo

In the demo in Figure 2, we show how the plug-in is prompting the LLM to produce a text from scratch. The user enters their query with some context into the Add-In text input area. Upon sending, the backend will prompt the LLM to generate respective text, and return it back to the frontend. From the Add-in, it is inserted into the Word document at the cursor position using the Office JavaScript API.

Text refinement


Figure 3: Text refinement use-case demo

In Figure 3, the user highlighted a text segment in the work area and entered a prompt into the Add-In text input area to rephrase the text segment. Again, the user input and highlighted text are processed by the backend and returned to the Add-In, thereby replacing the previously highlighted text.

Conclusion

This blog post showcases how the transformative power of generative AI can be incorporated into Office processes. We described an end-to-end sample of integrating Office products with an Add-in for text generation and manipulation with the power of LLMs. In our example, we used managed LLMs on Amazon Bedrock for text generation. The backend is hosted as a fully serverless application on the AWS cloud.

Text generation with LLMs in Office supports employees by streamlining their writing process and boosting productivity. Employees can leverage the power of generative AI to generate and edit high-quality content quickly, freeing up time for other tasks. Additionally, the integration with a familiar tool like Word provides a seamless user experience, minimizing disruptions to existing workflows.

To learn more about boosting productivity, building differentiated experiences, and innovating faster with AWS visit the Generative AI on AWS page.


About the Authors

Martin Maritsch is a Generative AI Architect at AWS ProServe focusing on Generative AI and MLOps. He helps enterprise customers to achieve business outcomes by unlocking the full potential of AI/ML services on the AWS Cloud.

Miguel Pestana is a Cloud Application Architect in the AWS Professional Services team with over 4 years of experience in the automotive industry delivering cloud native solutions. Outside of work Miguel enjoys spending its days at the beach or with a padel racket in one hand and a glass of sangria on the other.

Carlos Antonio Perea Gomez is a Builder with AWS Professional Services. He enables customers to become AWSome during their journey to the cloud. When not up in the cloud he enjoys scuba diving deep in the waters.

Read More

From innovation to impact: How AWS and NVIDIA enable real-world generative AI success

From innovation to impact: How AWS and NVIDIA enable real-world generative AI success

As we gather for NVIDIA GTC, organizations of all sizes are at a pivotal moment in their AI journey. The question is no longer whether to adopt generative AI, but how to move from promising pilots to production-ready systems that deliver real business value. The organizations that figure this out first will have a significant competitive advantage—and we’re already seeing compelling examples of what’s possible.

Consider Hippocratic AI’s work to develop AI-powered clinical assistants to support healthcare teams as doctors, nurses, and other clinicians face unprecedented levels of burnout. During a recent hurricane in Florida, their system called 100,000 patients in a day to check on medications and provide preventative healthcare guidance–the kind of coordinated outreach that would be nearly impossible to achieve manually. They aren’t just building another chatbot; they are reimagining healthcare delivery at scale.

Production-ready AI like this requires more than just cutting-edge models or powerful GPUs. In my decade working with customers’ data journeys, I’ve seen that an organization’s most valuable asset is its domain-specific data and expertise. And now leading our data and AI go-to-market, I hear customers consistently emphasize what they need to transform their domain advantage into AI success: infrastructure and services they can trust—with performance, cost-efficiency, security, and flexibility—all delivered at scale. When the stakes are high, success requires not just cutting-edge technology, but the ability to operationalize it at scale—a challenge that AWS has consistently solved for customers. As the world’s most comprehensive and broadly adopted cloud, our partnership with NVIDIA’s pioneering accelerated computing platform for generative AI amplifies this capability. It’s inspiring to see how, together, we’re enabling customers across industries to confidently move AI into production.

In this post, I will share some of these customers’ remarkable journeys, offering practical insights for any organization looking to harness the power of generative AI.

Transforming content creation with generative AI

Content creation represents one of the most visible and immediate applications of generative AI today. Adobe, a pioneer that has shaped creative workflows for over four decades, has moved with remarkable speed to integrate generative AI across its flagship products, helping millions of creators work in entirely new ways.

Adobe’s approach to generative AI infrastructure exemplifies what their VP of Generative AI, Alexandru Costin, calls an “AI superhighway”—a sophisticated technical foundation that enables rapid iteration of AI models and seamless integration into their creative applications. The success of their Firefly family of generative AI models, integrated across flagship products like Photoshop, demonstrates the power of this approach. For their AI training and inference workloads, Adobe uses NVIDIA GPU-accelerated Amazon Elastic Compute Cloud (Amazon EC2) P5en (NVIDIA H200 GPUs), P5 (NVIDIA H100 GPUs), P4de (NVIDIA A100 GPUs), and G5 (NVIDIA A10G GPUs) instances. They also use NVIDIA software such as NVIDIA TensorRT and NVIDIA Triton Inference Server for faster, scalable inference. Adobe needed maximum flexibility to build their AI infrastructure, and AWS provided the complete stack of services needed—from Amazon FSx for Lustre for high-performance storage, to Amazon Elastic Kubernetes Service (Amazon EKS) for container orchestration, to Elastic Fabric Adapter (EFA) for high-throughput networking—to create a production environment that could reliably serve millions of creative professionals.

Key takeaway

If you’re building and managing your own AI pipelines, Adobe’s success highlights a critical insight: although GPU-accelerated compute often gets the spotlight in AI infrastructure discussions, what’s equally important is the NVIDIA software stack along with the foundation of orchestration, storage, and networking services that enable production-ready AI. Their results speak for themselves—Adobe achieved a 20-fold scale-up in model training while maintaining the enterprise-grade performance and reliability their customers expect.

Pioneering new AI applications from the ground up

Throughout my career, I’ve been particularly energized by startups that take on audacious challenges—those that aren’t just building incremental improvements but are fundamentally reimagining how things work. Perplexity exemplifies this spirit. They’ve taken on a technology most of us now take for granted: search. It’s the kind of ambitious mission that excites me, not just because of its bold vision, but because of the incredible technical challenges it presents. When you’re processing 340 million queries monthly and serving over 1,500 organizations, transforming search isn’t just about having great ideas—it’s about building robust, scalable systems that can deliver consistent performance in production.

Perplexity’s innovative approach earned them membership in both AWS Activate and NVIDIA Inception—flagship programs designed to accelerate startup innovation and success. These programs provided them with the resources, technical guidance, and support needed to build at scale. They were one of the early adopters of Amazon SageMaker HyperPod, and continue to use its distributed training capabilities to accelerate model training time by up to 40%. They use a highly optimized inference stack built with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server to serve both their search application and pplx-api, their public API service that gives developers access to their proprietary models. The results speak for themselves—their inference stack achieves up to 3.1 times lower latency compared to other platforms. Both their training and inference workloads run on NVIDIA GPU-accelerated EC2 P5 instances, delivering the performance and reliability needed to operate at scale. To give their users even more flexibility, Perplexity complements their own models with services such as Amazon Bedrock, and provides access to additional state-of-the-art models in their API. Amazon Bedrock offers ease of use and reliability, which are crucial for their team—as they note, it allows them to effectively maintain the reliability and latency their product demands.

What I find particularly compelling about Perplexity’s journey is their commitment to technical excellence, exemplified by their work optimizing GPU memory transfer with EFA networking. The team achieved 97.1% of the theoretical maximum bandwidth of 3200 Gbps and open sourced their innovations, enabling other organizations to benefit from their learnings.

For those interested in the technical details, I encourage you to read their fascinating post Journey to 3200 Gbps: High-Performance GPU Memory Transfer on AWS Sagemaker Hyperpod.

Key takeaway

For organizations with complex AI workloads and specific performance requirements, Perplexity’s approach offers a valuable lesson. Sometimes, the path to production-ready AI isn’t about choosing between self-hosted infrastructure and managed services—it’s about strategically combining both. This hybrid strategy can deliver both exceptional performance (evidenced by Perplexity’s 3.1 times lower latency) and the flexibility to evolve.

Transforming enterprise workflows with AI

Enterprise workflows represent the backbone of business operations—and they’re a crucial proving ground for AI’s ability to deliver immediate business value. ServiceNow, which terms itself the AI platform for business transformation, is rapidly integrating AI to reimagine core business processes at scale.

ServiceNow’s innovative AI solutions showcase their vision for enterprise-specific AI optimization. As Srinivas Sunkara, ServiceNow’s Vice President, explains, their approach focuses on deep AI integration with technology workflows, core business processes, and CRM systems—areas where traditional large language models (LLMs) often lack domain-specific knowledge. To train generative AI models at enterprise scale, ServiceNow uses NVIDIA DGX Cloud on AWS. Their architecture combines high-performance FSx for Lustre storage with NVIDIA GPU clusters for training, and NVIDIA Triton Inference Server handles production deployment. This robust technology platform allows ServiceNow to focus on domain-specific AI development and customer value rather than infrastructure management.

Key takeaway

ServiceNow offers an important lesson about enterprise AI adoption: while foundation models (FMs) provide powerful general capabilities, the greatest business value often comes from optimizing models for specific enterprise use cases and workflows. In many cases, it’s precisely this deliberate specialization that transforms AI from an interesting technology into a true business accelerator.

Scaling AI across enterprise applications

Cisco’s Webex team’s journey with generative AI exemplifies how large organizations can methodically transform their applications while maintaining enterprise standards for reliability and efficiency. With a comprehensive suite of telecommunications applications serving customers globally, they needed an approach that would allow them to incorporate LLMs across their portfolio—from AI assistants to speech recognition—without compromising performance or increasing operational complexity.

The Webex team’s key insight was to separate their models from their applications. Previously, they had embedded AI models into the container images for applications running on Amazon EKS, but as their models grew in sophistication and size, this approach became increasingly inefficient. By migrating their LLMs to Amazon SageMaker AI and using NVIDIA Triton Inference Server, they created a clean architectural break between their relatively lean applications and the underlying models, which require more substantial compute resources. This separation allows applications and models to scale independently, significantly reducing development cycle time and increasing resource utilization. The team deployed dozens of models on SageMaker AI endpoints, using Triton Inference Server’s model concurrency capabilities to scale globally across AWS data centers.

The results validate Cisco’s methodical approach to AI transformation. By separating applications from models, their development teams can now fix bugs, perform tests, and add features to applications much faster, without having to manage large models in their workstation memory. The architecture also enables significant cost optimization—applications remain available during off-peak hours for reliability, and model endpoints can scale down when not needed, all without impacting application performance. Looking ahead, the team is evaluating Amazon Bedrock to further improve their price-performance, demonstrating how thoughtful architecture decisions create a foundation for continuous optimization.

Key takeaway

For enterprises with large application portfolios looking to integrate AI at scale, Cisco’s methodical approach offers an important lesson: separating LLMs from applications creates a cleaner architectural boundary that improves both development velocity and cost optimization. By treating models and applications as independent components, Cisco significantly improved development cycle time while reducing costs through more efficient resource utilization.

Building mission-critical AI for healthcare

Earlier, we highlighted how Hippocratic AI reached 100,000 patients during a crisis. Behind this achievement lies a story of rigorous engineering for safety and reliability—essential in healthcare where stakes are extraordinarily high.

Hippocratic AI’s approach to this challenge is both innovative and rigorous. They’ve developed what they call a “constellation architecture”—a sophisticated system of over 20 specialized models working in concert, each focused on specific safety aspects like prescription adherence, lab analysis, and over-the-counter medication guidance. This distributed approach to safety means they have to train multiple models, requiring management of significant computational resources. That’s why they use SageMaker HyperPod for their training infrastructure, using Amazon FSx and Amazon Simple Storage Service (Amazon S3) for high-speed storage access to NVIDIA GPUs, while Grafana and Prometheus provide the comprehensive monitoring needed to provide optimal GPU utilization. They build upon NVIDIA’s low-latency inference stack, and are enhancing conversational AI capabilities using NVIDIA Riva models for speech recognition and text-to-speech translation, and are also using NVIDIA NIM microservices to deploy these models. Given the sensitive nature of healthcare data and HIPAA compliance requirements, they’ve implemented a sophisticated multi-account, multi-cluster strategy on AWS—running production inference workloads with patient data on completely separate accounts and clusters from their development and training environments. This careful attention to both security and performance allows them to handle thousands of patient interactions while maintaining precise control over clinical safety and accuracy.

The impact of Hippocratic AI’s work extends far beyond technical achievements. Their AI-powered clinical assistants address critical healthcare workforce burnout by handling burdensome administrative tasks—from pre-operative preparation to post-discharge follow-ups. For example, during weather emergencies, their system can rapidly assess heat risks and coordinate transport for vulnerable patients—the kind of comprehensive care that would be too burdensome and resource-intensive to coordinate manually at scale.

Key takeaway

For organizations building AI solutions for complex, regulated, and high-stakes environments, Hippocratic AI’s constellation architecture reinforces what we’ve consistently emphasized: there’s rarely a one-size-fits-all model for every use case. Just as Amazon Bedrock offers a choice of models to meet diverse needs, Hippocratic AI’s approach of combining over 20 specialized models—each focused on specific safety aspects—demonstrates how a thoughtfully designed ensemble can achieve both precision and scale.

Conclusion

As the technology partners enabling these and countless other customer innovations, AWS and NVIDIA’s long-standing collaboration continues to evolve to meet the demands of the generative AI era. Our partnership, which began over 14 years ago with the world’s first GPU cloud instance, has grown to offer the industry’s widest range of NVIDIA accelerated computing solutions and software services for optimizing AI deployments. Through initiatives like Project Ceiba—one of the world’s fastest AI supercomputers hosted exclusively on AWS using NVIDIA DGX Cloud for NVIDIA’s own research and development use—we continue to push the boundaries of what’s possible.

As all the examples we’ve covered illustrate, it isn’t just about the technology we build together—it’s how organizations of all sizes are using these capabilities to transform their industries and create new possibilities. These stories ultimately reveal something more fundamental: when we make powerful AI capabilities accessible and reliable, people find remarkable ways to use them to solve meaningful problems. That’s the true promise of our partnership with NVIDIA—enabling innovators to create positive change at scale. I’m excited to continue inventing and partnering with NVIDIA and can’t wait to see what our mutual customers are going to do next.

Resources

Check out the following resources to learn more about our partnership with NVIDIA and generative AI on AWS:


About the Author

Rahul Pathak is Vice President Data and AI GTM at AWS, where he leads the global go-to-market and specialist teams who are helping customers create differentiated value with AWS’s AI and capabilities such as Amazon Bedrock, Amazon Q, Amazon SageMaker, and Amazon EC2 and Data Services such as Amaqzon S3, AWS Glue and Amazon Redshift. Rahul believes that generative AI will transform virtually every single customer experience and that data is a key differentiator for customers as they build AI applications. Prior to his current role, he was Vice President, Relational Database Engines where he led Amazon Aurora, Redshift, and DSQL . During his 13+ years at AWS, Rahul has been focused on launching, building, and growing managed database and analytics services, all aimed at making it easy for customers to get value from their data. Rahul has over twenty years of experience in technology and has co-founded two companies, one focused on analytics and the other on IP-geolocation. He holds a degree in Computer Science from MIT and an Executive MBA from the University of Washington.

Read More

Amazon Q Business now available in Europe (Ireland) AWS Region

Amazon Q Business now available in Europe (Ireland) AWS Region

Today, we are excited to announce that Amazon Q Business—a fully managed generative-AI powered assistant that you can configure to answer questions, provide summaries and generate content based on your enterprise data—is now generally available in the Europe (Ireland) AWS Region.

Since its launch, Amazon Q Business has been helping customers find information, gain insight, and take action at work. The general availability of Amazon Q Business in the Europe (Ireland) Region will support customers across Ireland and the EU to transform how their employees work and access information, while maintaining data security and privacy requirements.

AWS customers and partners innovate using Amazon Q Business in Europe

Organizations across the EU are using Amazon Q Business for a wide variety of use cases, including answering questions about company data, summarizing documents, and providing business insights.

Katya Dunets, the AWS Lead Sales Engineer for Adastra noted,

Adastra stands at the forefront of technological innovation, specializing in artificial intelligence, data, cloud, digital, and governance services. Our team was facing the daunting challenge of sifting through hundreds of documents on SharePoint, searching for content and information critical for market research and RFP generation. This process was not only time-consuming but also impeded our agility and responsiveness. Recognizing the need for a transformative solution, we turned to Amazon Q Business for its prowess in answering queries, summarizing documents, generating content, and executing tasks, coupled with its direct SharePoint integration. Amazon Q Business became the catalyst for unprecedented efficiency within Adastra, dramatically streamlining document retrieval, enhancing cross-team collaboration through shared insights from past projects, and accelerating our RFP development process by 70%. Amazon Q Business has not only facilitated a smoother exchange of knowledge within our teams but has also empowered us to maintain our competitive edge by focusing on innovation rather than manual tasks. Adastra’s journey with Amazon Q exemplifies our commitment to harnessing cutting-edge technology to better serve both our clients and their customers.

AllCloud is a cloud solutions provider specializing in cloud stack, infrastructure, platform, and Software-as-a-Service. Their CTO, Peter Nebel stated,

“AllCloud faces the common challenge of information sprawl. Critical knowledge for sales and delivery teams is scattered across various tools—Salesforce for customer and marketing data, Google Drive for documents, Bamboo for HR and internal information, and Confluence for internal wikis. This fragmented approach wastes valuable time as employees hunt and peck for the information they need, hindering productivity and potentially impacting client satisfaction. Amazon Q Business provides AllCloud a solution to increase productivity by streamlining information access. By leveraging Amazon Q’s natural language search capabilities, AllCloud can empower its personnel with a central hub to find answers to their questions across all their existing information sources. This drives efficiency and accuracy by eliminating the need for time-consuming searches across multiple platforms and ensures all teams have access to the most up-to-date information. Amazon Q will significantly accelerate productivity, across all lines of business, allowing AllCloud’s teams to focus on delivering exceptional service to their clients.”

Lars Ritter, Senior Manager at Woodmark Consulting noted,

“Amazon Bedrock and Amazon Q Business have been game-changers for Woodmark. Employees struggled with time-consuming searches across various siloed systems, leading to reduced productivity and slower operations. To solve for the inefficient retrieval of corporate knowledge from unstructured data sources we turned to Amazon Bedrock and Amazon Q Business for help. With this innovative solution, Woodmark has been able to revolutionize data accessibility, empowering our teams to effortlessly retrieve insights using simple natural language queries, and to make informed decisions without relying on specialized data teams, which was not feasible before. These solutions have dramatically increased efficiency, fostered a data-driven culture, and positioned us for scalable growth, driving our organization toward unparalleled success.”

Scott Kumono, Product Manager for Kinectus at Siemens Healthineers adds,

“Amazon Q Business has enhanced the delivery of service and clinical support for our ultrasound customers. Previously, finding specific information meant sifting through a 1,000-page manual or waiting for customer support to respond. Now, customers have instant access to answers and specifications right at their fingertips, using Kinectus Remote Service. With Amazon Q Business we were able to significantly reduce manual work and wait times to find the right information, allowing our customers to focus on what really matters – patient care.”

Till Gloger, Head of Digital Production Platform Region Americas at Volkswagen Group of America states,

“Volkswagen innovates not only on its products, but also on how to boost employee productivity and increase production throughput. Volkswagen is testing the use of Amazon Q to streamline employee workflows by potentially integrating it with existing processes. This integration has the possibility to help employees save time during the assembly process, reducing some processes from minutes to seconds, ultimately leading to more throughput.”

Pricing

With Amazon Q Business, enterprise customers pay for user subscriptions and index capacity. For more details, see Amazon Q Business pricing.

Get started with Amazon Q Business today

To get started with Amazon Q Business, users first need to configure an application environment and create a knowledge base using over 40 data source connectors that index documents (e.g text, pdf, images, tables). Organizations then set up user authentication through AWS IAM Identity Center or other SAML-based identity providers like Okta, Ping Identity, and Microsoft Entra ID. After configuring access permissions, applications users can navigate to their organization’s Amazon Q Business web interface using their credentials to begin interacting with the Q Business and the data they have access to. Q Business enables natural language interactions where users can ask questions and receive answers based on their indexed documents, uploaded content, and world knowledge – this may include getting details, generating content or insights. Users can access Amazon Q Business through multiple channels including web applications, Slack, Microsoft Teams, Microsoft 365 for Word and Outlook, or through browser extensions for gen-AI assistance directly where they work. Additionally, customers can securely share their data with verified independent software vendors (ISVs) like Asana, Miro, PagerDuty, and Zoom using the data accessors feature, which maintains security and compliance while respecting user-level permissions.

Learn more about how to get started with Amazon Q Business here. Read about other Amazon Q Business customers’ success stories here. Certain Amazon Q Business features already available in US East (N. Virginia) and US West (Oregon) including Q Apps, Q Actions, and Audio/Video file support will become available in Europe (Ireland) soon.


About the Authors

Jose Navarro is an AI/ML Specialist Solutions Architect at AWS, based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production.

Morgan Dutton is a Senior Technical Program Manager at AWS, Amazon Q Business based in Seattle.

Eva Pagneux is a Principal Product Manager at AWS, Amazon Q Business, based in San Francisco.

Wesleigh Roeca is a Senior Worldwide Gen AI/ML Specialist at AWS, Amazon Q Business, based in Santa Monica.

Read More

Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

This post is cowritten with Abdullahi Olaoye, Akshit Arora and Eliuth Triana Isaza at NVIDIA.

As enterprises continue to push the boundaries of generative AI, scalable and efficient model training frameworks are essential. The NVIDIA NeMo Framework provides a robust, end-to-end solution for developing, customizing, and deploying large-scale AI models, while Amazon SageMaker HyperPod delivers the distributed infrastructure needed to handle multi-GPU, multi-node workloads seamlessly.

In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.

NVIDIA NeMo Framework Overview

The NVIDIA NeMo Framework is an end-to-end solution for developing cutting edge generative AI models such as LLMs, vision language models (VLMs), video and speech models, and others.

At its core, NeMo Framework provides model builders with:

  • Comprehensive development tools: A complete ecosystem of tools, scripts, and proven recipes that guide users through every phase of the LLM lifecycle, from initial data preparation to final deployment.
  • Advanced customization: Flexible customization options that teams can use to tailor models to their specific use cases while maintaining peak performance.
  • Optimized infrastructure: Sophisticated multi-GPU and multi-node configurations that maximize computational efficiency for both language and image applications.
  • Enterprise-grade features with built-in capabilities including:
    • Advanced parallelism techniques
    • Memory optimization strategies
    • Distributed checkpointing
    • Streamlined deployment pipelines

By consolidating these powerful features into a unified framework, NeMo significantly reduces the complexity and cost associated with generative AI development. NeMo Framework 2.0 is a flexible, IDE-independent Python-based framework that enables flexible integration in each developer’s workflow. The framework provides capabilities such as code completion, type checking and programmatic extensions and configuration customization. The NeMo Framework includes NeMo-Run, a library designed to that streamline the configuration, execution, and management of machine learning experiments across various computing environments.

The end-to-end NeMo Framework includes the following key features that streamline and accelerate AI development:

  • Data curation: NeMo Curator is a Python library that includes a suite of modules for data-mining and synthetic data generation. They are scalable and optimized for GPUs, making them ideal for curating natural language data to train or fine-tune LLMs. With NeMo Curator, you can efficiently extract high-quality text from extensive raw web data sources.
  • Training and customization: NeMo Framework provides tools for efficient training and customization of LLMs and multimodal models. It includes default configurations for compute cluster setup, data downloading, and model hyperparameters autotuning, which can be adjusted to train on new datasets and models. In addition to pre-training, NeMo supports both supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) techniques such as LoRA, Ptuning, and more.
  • Alignment: NeMo Aligner is a scalable toolkit for efficient model alignment. The toolkit supports state-of-the-art model alignment algorithms such as SteerLM, DPO, reinforcement learning from human feedback (RLHF), and much more. By using these algorithms, you can align language models to be safer, more harmless, and more helpful.

Solution overview

In this post, we show you how to efficiently train large-scale generative AI models with NVIDIA NeMo Framework 2.0 using SageMaker HyperPod, a managed distributed training service designed for high-performance workloads. This solution integrates NeMo Framework 2.0 with the scalable infrastructure of SageMaker HyperPod, creating seamless orchestration of multi-node, multi-GPU clusters.

The key steps to deploying this solution include:

  • Setting up SageMaker HyperPod prerequisites: Configuring networking, storage, and permissions management (AWS Identity and Access Management (IAM) roles).
  • Launching the SageMaker HyperPod cluster: Using lifecycle scripts and a predefined cluster configuration to deploy compute resources.
  • Configuring the environment: Setting up NeMo Framework and installing the required dependencies.
  • Building a custom container: Creating a Docker image that packages NeMo Framework and installs the required AWS networking dependencies.
  • Running NeMo model training: Using NeMo-Run with a Slurm-based execution setup to train an example LLaMA (180M) model efficiently.

Architecture diagram

The architecture, shown in the preceding diagram shows an Amazon SageMaker HyperPod Cluster.

Prerequisites

First, you deploy a SageMaker HyperPod cluster before running the job. But to deploy the cluster, you need to create some prerequisite resources.

Note that there is a cost associated with running a SageMaker HyperPod cluster, see the Amazon SageMaker AI Pricing (HyperPod pricing in On-demand pricing) for more information.

The following prerequisite steps are adapted from the Amazon SageMaker HyperPod workshop, which you can visit for additional information.

Use the following steps to deploy the prerequisite resources.

  1. Sign in to the AWS Management Console using the AWS account you want to deploy the SageMaker HyperPod cluster in. You will create a VPC, subnets, an FSx Lustre volume, an Amazon Simple Storage Service (Amazon S3) bucket, and IAM role as pre-requisites; so make sure that your IAM role or user for console access has permissions to create these resources.
  2. Use the CloudFormation template to go to your AWS CloudFormation console and launch the solution template.
  3. Template parameters:
    • Change the Availability Zone to match the AWS Region where you’re deploying the template. See Availability Zone IDs for the AZ ID for your Region.
    • All other parameters can be left as default or changed as needed for your use case.
  4. Select the acknowledgement box in the Capabilities section and create the stack.

It takes about 10 minutes for the CloudFormation stack creation to complete. The following figure shows the deployment timeline of the CloudFormation stack deployment for the prerequisite infrastructure components.

Launch the training job

With the prerequisite infrastructure deployed in your AWS account, you next deploy the SageMaker HyperPod cluster that you’ll use for the model training example. For the model training job, you will use the NeMo Framework to launch training jobs efficiently.

Step 1: Set up a SageMaker HyperPod cluster

After the prerequisite resources are successfully deployed, create a SageMaker HyperPod cluster.

The deployment steps are adapted from the SageMaker HyperPod workshop, which you can review for additional information.

  1. Install and configure the AWS Command Line Interface (AWS CLI). If you already have it installed, verify that the version is at least 2.17.1 by running the following command:
$ aws --version
  1. Configure the environment variables that using outputs from the CloudFormation stack deployed earlier.
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/create_config.sh
# Change the region below to the region you wish to use
$ AWS_REGION=us-east-1 bash create_config.sh
$ source env_vars
# Confirm environment variables
$ cat env_vars
  1. Download the lifecycle scripts and upload them to the S3 bucket created in the prerequisites. SageMaker HyperPod uses lifecycle scripts to bootstrap a cluster. Examples of actions the lifecycle script manages include setting up Slurm and mounting the FSx Lustre filesystem.
$ git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c
# upload script
$ aws s3 cp --recursive 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src
  1. Create a cluster config file for setting up the cluster. The following is an example of creating a cluster config from a template. The example cluster config is for g5.48xlarge compute nodes accelerated by 8 x NVIDIA A10G GPUs. See Create Cluster for cluster config examples of additional Amazon Elastic Compute Cloud (Amazon EC2) instance types. A cluster config file contains the following information:
    1. Cluster name
    2. It defines three instance groups
      1. Login-group: Acts as the entry point for users and administrators. Typically used for managing jobs, monitoring and debugging.
      2. Controller-machine: This is the head node for the Hyperpod Slurm cluster. It manages the overall orchestration of the distributed training process and handles job scheduling and communication within nodes.
      3. Worker-group: The group of nodes that executes the actual model training workload
    3. VPC configuration
$ cd 3.test_cases/22.nemo-run/slurm
$ curl -O https://awsome-distributed-training.s3.amazonaws.com/blog-assets/nemo2.0-hyperpod/cluster-config-template.json 
$ cp cluster-config-template.json cluster-config.json
# Replace the placeholders in the cluster config
$ source env_vars
$ sed -i "s/$BUCKET/${BUCKET}/g" cluster-config.json
$ sed -i "s/$ROLE/${ROLE}/g" cluster-config.json 
$ sed -i "s/$SECURITY_GROUP/${SECURITY_GROUP}/g" cluster-config.json
$ sed -i "s/$SUBNET_ID/${SUBNET_ID}/g" cluster-config.json
  1. Create a config file based on the following example with the cluster provisioning parameters and upload it to the S3 bucket.
$ instance_type=$(jq '.InstanceGroups[] | select(.InstanceGroupName == "worker-group-1").InstanceType' cluster-config.json)
$ cat > provisioning_parameters.json << EOL
{
"version": "1.0.0",
"workload_manager": "slurm",
"controller_group": "controller-machine",
"login_group": "login-group",
"worker_groups": [
{      
		"instance_group_name": "worker-group-1",      
		"partition_name": ${instance_type}
	}  
],  "fsx_dns_name": "${FSX_ID}.fsx.${AWS_REGION}.amazonaws.com",
"fsx_mountname": "${FSX_MOUNTNAME}"
}
EOL
# copy to the S3 Bucket
$ aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
  1. Create the SageMaker HyperPod cluster
$ aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json --region $AWS_REGION
  1. Use the following code or the console to check the status of the cluster. The status should be Creating. Wait for the cluster status to be InService proceeding
$ aws sagemaker list-clusters --output table

The following screenshot shows the results of the –output table command showing the cluster status as Creating.

The following screenshot shows the Cluster Management page and status of the cluster in the Amazon SageMaker AI console.

The following screenshot shows the results of the –output table command showing the cluster status as InService.

Step 2: SSH into the cluster

After the cluster is ready (that is, has a status of InService), you can connect to it using the AWS Systems Manager Session Manager and an SSH helper script. See SSH into Cluster for more information

  1. Install the AWS SSM Session Manager Plugin.
  2. Create a local key pair that can be added to the cluster by the helper script for easier SSH access and run the following SSH helper script.
$ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c controller-machine ml-cluster

Step 3: Interact with the cluster and clone the repository

After connecting to the cluster, you can validate that the command is properly configured by running several commands. See Get to know your Cluster for more information.

  1. View the existing partition and nodes per partition
$ sinfo
  1. List the jobs that are in the queue or running.
$ squeue
  1. SSH to the compute nodes.
# First ssh into the cluster head node as ubuntu user
$ ssh ml-cluster

#SSH into one of the compute nodes
$ salloc -N 1
$ ssh $(srun hostname)

#Exit to the head node
$ exit

#Exit again to cancel the srun job above
$ exit
  1. Clone the code sample GitHub repository onto the cluster controller node (head node).
$ cd /fsx/ubuntu
$ git clone https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c$ cd 3.test_cases/22.nemo-run/slurm

Now, you’re ready to run your NeMo Framework Jobs on the SageMaker HyperPod cluster.

Step 4: Build the job container

The next step is to build the job container. By using a container, you can create a consistent, portable, and reproducible environment, helping to ensure that all dependencies, configurations, and optimizations remain intact. This is particularly important for high-performance computing (HPC) and AI workloads, where variations in the software stack can impact performance and compatibility.

To have a fully functioning and optimized environment, you need to add AWS-specific networking dependencies (EFA, OFI plugin, update NCCL, and NCCL tests) to the NeMo Framework container from NVIDIA GPU Cloud (NGC) Catalog. After building the Docker image, you will use Enroot to create a squash file from it. A squash file is a compressed, read-only file system that encapsulates the container image in a lightweight format. It helps reduce storage space, speeds up loading times, and improves efficiency when deploying the container across multiple nodes in a cluster. By converting the Docker image into a squash file, you can achieve a more optimized and performant execution environment, especially in distributed training scenarios.

Make sure that you have a registered account with NVIDIA and can access NGC. Retrieve the NGC API key following the instructions from NVIDIA. Use the following command to configure NGC. When prompted, use $oauthtoken for the login username and the API key from NGC for the password.

$ docker login nvcr.io

You can use the following command to build the Docker file and create a SquashFS image.

$ docker build --progress=plain -t nemo_hyperpod:24.12 -f Dockerfile .
$ sudo enroot import -o /fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh dockerd://nemo_hyperpod:24.12

Step 5: Set up NeMo-Run and other dependencies on the head node

Before continuing:

  1. NeMo-Run requires python3.10, verify that this is installed on the head node before proceeding.
  2. You can use the following steps to set up Nemo-Run dependencies using a virtual environment. The steps create and activate a virtual environment then execute the venv.sh script to install the dependencies. Dependencies being installed include the NeMo toolkit, NeMo-Run, PyTorch, Megatron-LM, and others.
$ python3.10 -m venv temp-env
$ source temp-env/bin/activate
$ bash venv.sh
  1. To prepare for the pre-training of the LLaMA model in an offline mode and to help ensure consistent tokenization, use the widely adopted GPT-2 vocabulary and merges files. This approach helps avoid potential issues related to downloading tokenizer files during training:
$ mkdir -p /fsx/ubuntu/temp/megatron
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_vocab
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_merges

Step 6: Launch the pretraining job using NeMo-Run

Run the training script to start the LLaMA pretraining job. The training script run.py defines the configuration for a LLaMA 180M parameter model, defines a Slurm executor, defines the experiment, and launches the experiment.

The following function defines the model configuration.

def small_llama_cfg() -> llm.GPTConfig:
   return run.Config(
       llm.Llama3Config8B,       
	   rotary_base=500_000,       
	   seq_length=1024,       
	   num_layers=12,       
	   hidden_size=768,       
	   ffn_hidden_size=2688,       
	   num_attention_heads=16,       
	   init_method_std=0.023,
   )

The following function defines the Slurm executor.

def slurm_executor(
   account: str,   
   partition: str,   
   nodes: int,   
   user: str = "local",   
   host: str = "local",   
   remote_job_dir: str = "/fsx/ubuntu/nemo2-sm-hyperpod/tmp/",   
   time: str = "01:00:00",   
   custom_mounts: Optional[list[str]] = None,   
	custom_env_vars: Optional[dict[str, str]] = None,   
	container_image: str = "/fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh",   
	retries: int = 0,) -> run.SlurmExecutor: 

The following function runs the experiment.

with run.Experiment(exp_name, log_level="INFO") as exp:
       exp.add(pretrain_recipe, executor=executor, tail_logs=True, name="training")
       # Run the experiment
       exp.run(detach=True)

Use the following command to run the training job.

$ python run.py --nodes 2 --max_steps 1000

The –nodes argument specifies the number of nodes to use during the pretraining job, while the –max_steps argument specifies the maximum number of training iterations. This is useful for controlling the duration of training.

The following figure shows the logs of a running training job.

You can download the training logs from the cluster to your local machine and use machine learning visualization tools like TensorBoard to visualize your experimentation. See Install TensorFlow 2 for information about installing TensorBoard. The following is an example of downloading logs from the cluster and visualizing the logs on TensorBoard.

  1. After installing TensorBoard, download the log files from the cluster to your workstation where TensorBoardis installed
$ rsync -aP ml-cluster:/path/to/logs/checkpoints/tb_logs/events.out.tfevents.1741213162.ip-10-1-7-21.55692.0 .

  1. After the logs are downloaded, you can launch TensorBoard with the log files in the current directory.
$ tensorboard --logdir .

Below is a tensorboard screenshot for a training job. There we can see the reduced_train_loss which shows a decreasing loss curve over the training steps.

Troubleshooting

  • If some of the nodes appear “down” or “down*” as shown below, we can see that both the two nodes are shown in down* status

Solution: login to them and run sudo systemctl restart slurmd. As shown below, the two nodes went to an idle state.

Clean up

Use the following steps to clean up the infrastructure created for this post and avoid incurring ongoing costs. You can also find cleanup instructions in Cleanup.

  1. Delete the SageMaker HyperPod cluster.
    $ aws sagemaker delete-cluster --cluster-name ml-cluster

  2. Delete the CloudFormation stack created in the prerequisites.
    $ aws cloudformation wait stack-delete-complete --stack-name sagemaker-hyperpod

Conclusion

Using the NVIDIA NeMo 2.0 framework on SageMaker HyperPod offers a scalable, cost-efficient, and streamlined approach to training large-scale generative AI models. By following the step-by-step deployment process, you can use the power of distributed computing with minimal setup complexity.

References


About the authors

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Greeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring service. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon AI MLOps, DevOps, Scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Read More

NeMo Retriever Llama 3.2 text embedding and reranking NVIDIA NIM microservices now available in Amazon SageMaker JumpStart

NeMo Retriever Llama 3.2 text embedding and reranking NVIDIA NIM microservices now available in Amazon SageMaker JumpStart

Today, we are excited to announce that the NeMo Retriever Llama3.2 Text Embedding and Reranking NVIDIA NIM microservices are available in Amazon SageMaker JumpStart. With this launch, you can now deploy NVIDIA’s optimized reranking and embedding models to build, experiment, and responsibly scale your generative AI ideas on AWS.

In this post, we demonstrate how to get started with these models on SageMaker JumpStart.

About NVIDIA NIM on AWS

NVIDIA NIM microservices integrate closely with AWS managed services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker to enable the deployment of generative AI models at scale. As part of NVIDIA AI Enterprise available in AWS Marketplace, NIM is a set of user-friendly microservices designed to streamline and accelerate the deployment of generative AI. These prebuilt containers support a broad spectrum of generative AI models, from open source community models to NVIDIA AI foundation models (FMs) and custom models. NIM microservices provide straightforward integration into generative AI applications using industry-standard APIs and can be deployed with just a few lines of code, or with a few clicks on the SageMaker JumpStart console. Engineered to facilitate seamless generative AI inferencing at scale, NIM helps you deploy your generative AI applications.

Overview of NVIDIA NeMo Retriever NIM microservices

In this section, we provide an overview of the NVIDIA NeMo Retriever NIM microservices discussed in this post.

NeMo Retriever text embedding NIM

The NVIDIA NeMo Retriever Llama3.2 embedding NIM is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8,192 tokens) and dynamic embedding size (Matryoshka Embeddings). This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. In addition to enabling multilingual and cross-lingual question-answering retrieval, this model reduces the data storage footprint by 35-fold through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently.

NeMo Retriever text reranking NIM

The NVIDIA NeMo Retriever Llama3.2 reranking NIM is optimized for providing a logit score that represents how relevant a document is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8,192 tokens). This model was evaluated on the same 26 languages mentioned earlier.

SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art FMs for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks.

Solution overview

You can now discover and deploy the NeMo Retriever text embedding and reranking NIM microservices in Amazon SageMaker Studio or programmatically through the Amazon SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and in your virtual private cloud (VPC), helping to support data security for enterprise security needs.

In the following sections, we demonstrate how to deploy these microservices and run real-time and batch inference.

Make sure your SageMaker AWS Identity and Access Management (IAM) service role has the AmazonSageMakerFullAccess permission policy attached.

To deploy NeMo Retriever Llama3.2 embedding and reranking microservices successfully, confirm one of the following:

  • Make sure your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
    • aws-marketplace:ViewSubscriptions
    • aws-marketplace:Unsubscribe
    • aws-marketplace:Subscribe
  • Alternatively, confirm your AWS account has a subscription to the model. If so, you can skip the following deployment instructions and start at the Subscribe to the model package section.

Deploy NeMo Retriever microservices on SageMaker JumpStart

For those new to SageMaker JumpStart, we demonstrate using SageMaker Studio to access models on SageMaker JumpStart. The following screenshot shows the NeMo Retriever text embedding and reranking microservices available on SageMaker JumpStart.

NeMo Retriever text embedding and reranking microservices available on SageMaker JumpStart.

Deployment starts when you choose the Deploy option. You might be prompted to subscribe to this model through AWS Marketplace. If you are already subscribed, then you can move forward with choosing the second Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

Deploy the NeMo Retriever microservice.

Subscribe to the model package

To subscribe to the model package, complete the following steps

  1. Depending on the model you want to deploy, open the model package listing page for Llama-3.2-NV-EmbedQA-1B-v2 or Llama-3.2-NV-RerankQA-1B-v2.
  2. On the AWS Marketplace listing, choose Continue to subscribe.
  3. On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with EULA, pricing, and support terms.
  4. Choose Continue to configuration and then choose an AWS Region.

A product Amazon Resource Name (ARN) will be displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

Deploy NeMo Retriever microservices using the SageMaker SDK

In this section, we walk through deploying the NeMo Retriever text embedding NIM through the SageMaker SDK. A similar process can be followed for deploying the NeMo Retriever text reranking NIM as well.

Define the SageMaker model using the model package ARN

To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

# Define the model details
model_package_arn = "Specify the model package ARN here"
sm_model_name = "nim-llama-3-2-nv-embedqa-1b-v2"

# Create the SageMaker model
create_model_response = sm.create_model(
ModelName=sm_model_name,
PrimaryContainer={
'ModelPackageName': model_package_arn
},
ExecutionRoleArn=role,
EnableNetworkIsolation=True
)
print("Model Arn: " + create_model_response["ModelArn"])

Create the endpoint configuration

Next, we create an endpoint configuration specifying instance type; in this case, we are using an ml.g5.2xlarge instance type accelerated by NVIDIA A10G GPUs. Make sure you have the account-level service limit for using ml.g5.2xlarge for endpoint usage as one or more instances. To request a service quota increase, refer to AWS service quotas. For further performance improvements, you can use NVIDIA Hopper GPUs (P5 instances) on SageMaker.

# Create the endpoint configuration
endpoint_config_name = sm_model_name

create_endpoint_config_response = sm.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
'VariantName': 'AllTraffic',
'ModelName': sm_model_name,
'InitialInstanceCount': 1,
'InstanceType': 'ml.g5.xlarge', 
'InferenceAmiVersion': 'al2-ami-sagemaker-inference-gpu-2',
'RoutingConfig': {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'},
'ModelDataDownloadTimeoutInSeconds': 3600, # Specify the model download timeout in seconds.
'ContainerStartupHealthCheckTimeoutInSeconds': 3600, # Specify the health checkup timeout in seconds
}
]
)
print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Create the endpoint

Using the preceding endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService after the deployment is successful.

# Create the endpoint
endpoint_name = endpoint_config_name
create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Deploy the NIM microservice

Deploy the NIM microservice with the following code:

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
time.sleep(60)
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

We get the following output:

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:611951037680:endpoint/nim-llama-3-2-nv-embedqa-1b-v2
Status: InService

After you deploy the model, your endpoint is ready for inference. In the following section, we use a sample text to do an inference request. For inference request format, NIM on SageMaker supports the OpenAI API inference protocol (at the time of writing). For an explanation of supported parameters, see Create an embedding vector from the input text.

Inference example with NeMo Retriever text embedding NIM microservice

The NVIDIA NeMo Retriever Llama3.2 embedding model is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8,192 tokens) and dynamic embedding size (Matryoshka Embeddings). In this section, we provide examples of running real-time inference and batch inference.

Real-time inference example

The following code example illustrates how to perform real-time inference using the NeMo Retriever Llama3.2 embedding model:

import pprint
pp1 = pprint.PrettyPrinter(indent=2, width=80, compact=True, depth=3)

input_embedding = '''{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
"input": ["Sample text 1", "Sample text 2"],
"input_type": "query"
}'''

print("Example input data for embedding model endpoint:")
print(input_embedding)

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=input_embedding
)

print("nEmbedding endpoint response:")
response = json.load(response["Body"])
pp1.pprint(response)

We get the following output:

Example input data for embedding model endpoint:
{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2", 
"input": ["Sample text 1", "Sample text 2"],
"input_type": "query"
}

Embedding endpoint response:
{ 'data': [ {'embedding': [...], 'index': 0, 'object': 'embedding'},
            {'embedding': [...], 'index': 1, 'object': 'embedding'}],
  'model': 'nvidia/llama-3.2-nv-embedqa-1b-v2',
  'object': 'list',
  'usage': {'prompt_tokens': 14, 'total_tokens': 14}}

Batch inference example

When you have many documents, you can vectorize each of them with a for loop. This will often result in many requests. Alternatively, you can send requests consisting of batches of documents to reduce the number of requests to the API endpoint. We use the following example with a dataset of 10 documents. Let’s test the model with a number of documents in different languages:

documents = [
"El futuro de la computación cuántica en aplicaciones criptográficas.",
"L’application des réseaux neuronaux dans les systèmes de véhicules autonomes.",
"Analyse der Rolle von Big Data in personalisierten Gesundheitslösungen.",
"L’evoluzione del cloud computing nello sviluppo di software aziendale.",
"Avaliando o impacto da IoT na infraestrutura de cidades inteligentes.",
"Потенциал граничных вычислений для обработки данных в реальном времени.",
"评估人工智能在欺诈检测系统中的有效性。",
"倫理的なAIアルゴリズムの開発における課題と機会。",
"دمج تقنية الجيل الخامس (5G) في تعزيز الاتصال بالإنترنت للأشياء (IoT).",
"सुरक्षित लेनदेन के लिए बायोमेट्रिक प्रमाणीकरण विधियों में प्रगति।"
]

The following code demonstrates how to group the documents into batches and invoke the endpoint repeatedly to vectorize the whole dataset. Specifically, the example code loops over the 10 documents in batches of size 5 (batch_size=5).

pp2 = pprint.PrettyPrinter(indent=2, width=80, compact=True, depth=2)

encoded_data = []
batch_size = 5

# Loop over the documents in increments of the batch size
for i in range(0, len(documents), batch_size):
input = json.dumps({
"input": documents[i:i+batch_size],
"input_type": "passage",
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
})

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=input,
)

response = json.load(response["Body"])

# Concatenating vectors into a single list; preserve original index
encoded_data.extend({"embedding": data[1]["embedding"], "index": data[0] } for
data in zip(range(i,i+batch_size), response["data"]))

# Print the response data
pp2.pprint(encoded_data)

We get the following output:

[ {'embedding': [...], 'index': 0}, {'embedding': [...], 'index': 1},
  {'embedding': [...], 'index': 2}, {'embedding': [...], 'index': 3},
  {'embedding': [...], 'index': 4}, {'embedding': [...], 'index': 5},
  {'embedding': [...], 'index': 6}, {'embedding': [...], 'index': 7},
  {'embedding': [...], 'index': 8}, {'embedding': [...], 'index': 9}]

Inference example with NeMo Retriever text reranking NIM microservice

The NVIDIA NeMo Retriever Llama3.2 reranking NIM microservice is optimized for providing a logit score that represents how relevant a documents is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8,192 tokens).

In the following example, we create an input payload for a list of emails in multiple languages:

payload_model = "nvidia/llama-3.2-nv-rerankqa-1b-v2"
query = {"text": "What emails have been about returning items?"}
documents = [
    {"text":"Contraseña incorrecta. Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?"},
    {"text":"Confirmation Email Missed. Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?"},
    {"text":"أسئلة حول سياسة الإرجاع. مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب"},
    {"text":"Customer Support is Busy. Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?"},
    {"text":"Falschen Artikel erhalten. Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken."},
    {"text":"Customer Service is Unavailable. Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?"},
    {"text":"Return Policy for Defective Product. Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."},
    {"text":"收到错误物品. 早上好,关于我最近的订单,我有一个问题。我收到了错误的商品,需要退货。"},
    {"text":"Return Defective Product. Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."}
]

payload = {
  "model": payload_model,
  "query": query,
  "passages": documents,
  "truncate": "END"
}

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(f'Documents: {response}')
print(json.dumps(output, indent=2))

In this example, the relevance (logit) scores are normalized to be in the range [0, 1]. Scores close to 1 indicate a high relevance to the query, and scores closer to 0 indicate low relevance.

Documents: {'ResponseMetadata': {'RequestId': 'a3f19e06-f468-4382-a927-3485137ffcf6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'a3f19e06-f468-4382-a927-3485137ffcf6', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Tue, 04 Mar 2025 21:46:39 GMT', 'content-type': 'application/json', 'content-length': '349', 'connection': 'keep-alive'}, 'RetryAttempts': 0}, 'ContentType': 'application/json', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x7fbb00ff94b0>}
{
  "rankings": [
    {
      "index": 4,
      "logit": 0.0791015625
    },
    {
      "index": 8,
      "logit": -0.1904296875
    },
    {
      "index": 7,
      "logit": -2.583984375
    },
    {
      "index": 2,
      "logit": -4.71484375
    },
    {
      "index": 6,
      "logit": -5.34375
    },
    {
      "index": 1,
      "logit": -5.64453125
    },
    {
      "index": 5,
      "logit": -11.96875
    },
    {
      "index": 3,
      "logit": -12.2265625
    },
    {
      "index": 0,
      "logit": -16.421875
    }
  ],
  "usage": {
    "prompt_tokens": 513,
    "total_tokens": 513
  }
}

Let’s see the top-ranked document for our query:

# 1. Extract the array of rankings
rankings = output["rankings"]  # or output.get("rankings", [])

# 2. Get the top-ranked entry (highest logit)
top_ranked_entry = rankings[0]
top_index = top_ranked_entry["index"]  # e.g. 4 in your example

# 3. Retrieve the corresponding document
top_document = documents[top_index]

print("Top-ranked document:")
print(top_document)

The following is the top-ranked document based on the provided relevance scores:

Top-ranked document:
{'text': 'Falschen Artikel erhalten. Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken.'}

This translates to the following:

"Wrong item received. Hello, I have a question about my last order. I received the wrong item and need to return it."

Based on the preceding results from the model, we see that a higher logit indicates stronger alignment with the query, whereas lower (or more negative) values indicate lower relevance. In this case, the document discussing receiving the wrong item (in German) was ranked first with the highest logit, confirming that the model quickly and effectively identified it as the most relevant passage regarding product returns.

Clean up

To clean up your resources, use the following commands:

sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

Conclusion

The NVIDIA NeMo Retriever Llama 3.2 NIM microservices bring powerful multilingual capabilities to enterprise search and retrieval systems. These models excel in diverse use cases, including cross-lingual search applications, enterprise knowledge bases, customer support systems, and content recommendation engines. The text embedding NIM’s dynamic embedding size (Matryoshka Embeddings) reduces storage footprint by 35-fold while supporting 26 languages and documents up to 8,192 tokens. The reranking NIM accurately scores document relevance across languages, enabling precise information retrieval even for multilingual content. For organizations managing global knowledge bases or customer-facing search experiences, these NVIDIA-optimized microservices provide a significant advantage in latency, accuracy, and efficiency, allowing developers to quickly deploy sophisticated search capabilities without compromising on performance or linguistic diversity.

SageMaker JumpStart provides a straightforward way to use state-of-the-art large language FMs for text embedding and reranking. Through the UI or just a few lines of code, you can deploy a highly accurate text embedding model to generate dense vector representations that capture semantic meaning and a reranking model to find semantic matches and retrieve the most relevant information from various data stores at scale and cost-efficiently.


About the Authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s in Computer Science and Bioinformatics.

Greeshma NallapareddyGreeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within cloud platforms and enhancing user experience on accelerated computing.

Abdullahi OlaoyeAbdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Banu NagasundaramBanu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the machine learning and generative AI hub provided by Amazon SageMaker. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Chase PinkertonChase Pinkerton is a Startups Solutions Architect at Amazon Web Services. He holds a Bachelor’s in Computer Science with a minor in Economics from Tufts University. He’s passionate about helping startups grow and scale their businesses. When not working, he enjoys road cycling, hiking, playing volleyball, and photography.

Eliuth Triana IsazaEliuth Triana Isaza is a Developer Relations Manager at NVIDIA, empowering Amazon’s AI MLOps, DevOps, scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, and tennis and poker player.

Read More

Amazon Bedrock Guardrails announces IAM Policy-based enforcement to deliver safe AI interactions

Amazon Bedrock Guardrails announces IAM Policy-based enforcement to deliver safe AI interactions

As generative AI adoption accelerates across enterprises, maintaining safe, responsible, and compliant AI interactions has never been more critical. Amazon Bedrock Guardrails provides configurable safeguards that help organizations build generative AI applications with industry-leading safety protections. With Amazon Bedrock Guardrails, you can implement safeguards in your generative AI applications that are customized to your use cases and responsible AI policies. You can create multiple guardrails tailored to different use cases and apply them across multiple foundation models (FMs), improving user experiences and standardizing safety controls across generative AI applications. Beyond Amazon Bedrock models, the service offers the flexible ApplyGuardrails API that enables you to assess text using your pre-configured guardrails without invoking FMs, allowing you to implement safety controls across generative AI applications—whether running on Amazon Bedrock or on other systems—at both input and output levels.

Today, we’re announcing a significant enhancement to Amazon Bedrock Guardrails: AWS Identity and Access Management (IAM) policy-based enforcement. This powerful capability enables security and compliance teams to establish mandatory guardrails for every model inference call, making sure organizational safety policies are consistently enforced across AI interactions. This feature enhances AI governance by enabling centralized control over guardrail implementation.

Challenges with building generative AI applications

Organizations deploying generative AI face critical governance challenges: content appropriateness, where models might produce undesirable responses to problematic prompts; safety concerns, with potential generation of harmful content even from innocent prompts; privacy protection requirements for handling sensitive information; and consistent policy enforcement across AI deployments.

Perhaps most challenging is making sure that appropriate safeguards are applied consistently across AI interactions within an organization, regardless of which team or individual is developing or deploying applications.

Amazon Bedrock Guardrails capabilities

Amazon Bedrock Guardrails enables you to implement safeguards in generative AI applications customized to your specific use cases and responsible AI policies. Guardrails currently supports six types of policies:

  • Content filters – Configurable thresholds across six harmful categories: hate, insults, sexual, violence, misconduct, and prompt injections
  • Denied topics – Definition of specific topics to be avoided in the context of an application
  • Sensitive information filters – Detection and removal of personally identifiable information (PII) and custom regex entities to protect user privacy
  • Word filters – Blocking of specific words in generative AI applications, such as harmful words, profanity, or competitor names and products
  • Contextual grounding checks – Detection and filtering of hallucinations in model responses by verifying if the response is properly grounded in the provided reference source and relevant to the user query
  • Automated reasoning – Prevention of factual errors from hallucinations using sound mathematical, logic-based algorithmic verification and reasoning processes to verify the information generated by a model, so outputs align with known facts and aren’t based on fabricated or inconsistent data

Policy-based enforcement of guardrails

Security teams often have organizational requirements to enforce the use of Amazon Bedrock Guardrails for every inference call to Amazon Bedrock. To support this requirement, Amazon Bedrock Guardrails provides the new IAM condition key bedrock:GuardrailIdentifier, which can be used in IAM policies to enforce the use of a specific guardrail for model inference. The condition key in the IAM policy can be applied to the following APIs:

The following diagram illustrates the policy-based enforcement workflow.

If the guardrail configured in your IAM policy doesn’t match the guardrail specified in the request, the request will be rejected with an access denied exception, enforcing compliance with organizational policies.

Policy examples

In this section, we present several policy examples demonstrating how to enforce guardrails for model inference.

Example 1: Enforce the use of a specific guardrail and its numeric version

The following example illustrates the enforcement of exampleguardrail and its numeric version 1 during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:1"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:1"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

The added explicit deny denies the user request for calling the listed actions with other GuardrailIdentifier and GuardrailVersion values irrespective of other permissions the user might have.

Example 2: Enforce the use of a specific guardrail and its draft version

The following example illustrates the enforcement of exampleguardrail and its draft version during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 3: Enforce the use of a specific guardrail and its numeric versions

The following example illustrates the enforcement of exampleguardrail and its numeric versions during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:*"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:*"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 4: Enforce the use of a specific guardrail and its versions, including the draft

The following example illustrates the enforcement of exampleguardrail and its versions, including the draft, during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail*"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail*"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 5: Enforce the use of a specific guardrail and version pair from a list of guardrail and version pairs

The following example illustrates the enforcement of exampleguardrail1 and its version 1, or exampleguardrail2 and its version 2, or exampleguardrail3 and its version 3 and its draft during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": [
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1:1",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2:2",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
                    ]
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": [
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1:1",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2:2",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
                    ]
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1",
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2",
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
            ]
        }
    ]
}

Known limitations

When implementing policy-based guardrail enforcement, be aware of these limitations:

  • At the time of this writing, Amazon Bedrock Guardrails doesn’t support resource-based policies for cross-account access.
  • If a user assumes a role that has a specific guardrail configured using the bedrock:GuardrailIdentifier condition key, the user can strategically use input tags to help avoid having guardrail checks applied to certain parts of their prompt. Input tags allow users to mark specific sections of text that should be processed by guardrails, leaving other sections unprocessed. For example, a user could intentionally leave sensitive or potentially harmful content outside of the tagged sections, preventing those portions from being evaluated against the guardrail policies. However, regardless of how the prompt is structured or tagged, the guardrail will still be fully applied to the model’s response.
  • If a user has a role configured with a specific guardrail requirement (using the bedrock:GuardrailIdentifier condition), they shouldn’t use that same role to access services like Amazon Bedrock Knowledge Bases RetrieveAndGenerate or Amazon Bedrock Agents InvokeAgent. These higher-level services work by making multiple InvokeModel calls behind the scenes on the user’s behalf. Although some of these calls might include the required guardrail, others don’t. When the system attempts to make these guardrail-free calls using a role that requires guardrails, it results in AccessDenied errors, breaking the functionality of these services. To help avoid this issue, organizations should separate permissions—using different roles for direct model access with guardrails versus access to these composite Amazon Bedrock services.

Conclusion

The new IAM policy-based guardrail enforcement in Amazon Bedrock represents a crucial advancement in AI governance as generative AI becomes integrated into business operations. By enabling centralized policy enforcement, security teams can maintain consistent safety controls across AI applications regardless of who develops or deploys them, effectively mitigating risks related to harmful content, privacy violations, and bias. This approach offers significant advantages: it scales efficiently as organizations expand their AI initiatives without creating administrative bottlenecks, helps prevent technical debt by standardizing safety implementations, and enhances the developer experience by allowing teams to focus on innovation rather than compliance mechanics.

This capability demonstrates organizational commitment to responsible AI practices through comprehensive monitoring and audit mechanisms. Organizations can use model invocation logging in Amazon Bedrock to capture complete request and response data in Amazon CloudWatch Logs or Amazon Simple Storage Service (Amazon S3) buckets, including specific guardrail trace documentation showing when and how content was filtered. Combined with AWS CloudTrail integration that records guardrail configurations and policy enforcement actions, businesses can confidently scale their generative AI initiatives with appropriate safety mechanisms protecting their brand, customers, and data—striking the essential balance between innovation and ethical responsibility needed to build trust in AI systems.

Get started today with Amazon Bedrock Guardrails and implement configurable safeguards that balance innovation with responsible AI governance across your organization.


About the Authors

Shyam Srinivasan is on the Amazon Bedrock Guardrails product team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.

Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at AWS. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value.

Read More

Build your gen AI–based text-to-SQL application using RAG, powered by Amazon Bedrock (Claude 3 Sonnet and Amazon Titan for embedding)

Build your gen AI–based text-to-SQL application using RAG, powered by Amazon Bedrock (Claude 3 Sonnet and Amazon Titan for embedding)

SQL is one of the key languages widely used across businesses, and it requires an understanding of databases and table metadata. This can be overwhelming for nontechnical users who lack proficiency in SQL. Today, generative AI can help bridge this knowledge gap for nontechnical users to generate SQL queries by using a text-to-SQL application. This application allows users to ask questions in natural language and then generates a SQL query for the user’s request.

Large language models (LLMs) are trained to generate accurate SQL queries for natural language instructions. However, off-the-shelf LLMs can’t be used without some modification. Firstly, LLMs don’t have access to enterprise databases, and the models need to be customized to understand the specific database of an enterprise. Additionally, the complexity increases due to the presence of synonyms for columns and internal metrics available.

The limitation of LLMs in understanding enterprise datasets and human context can be addressed using Retrieval Augmented Generation (RAG). In this post, we explore using Amazon Bedrock to create a text-to-SQL application using RAG. We use Anthropic’s Claude 3.5 Sonnet model to generate SQL queries, Amazon Titan in Amazon Bedrock for text embedding and Amazon Bedrock to access these models.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Solution overview

This solution is primarily based on the following services:

  1. Foundational model – We use Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock as our LLM to generate SQL queries for user inputs.
  2. Vector embeddings – We use Amazon Titan Text Embeddings v2 on Amazon Bedrock for embeddings. Embedding is the process by which text, images, and audio are given numerical representation in a vector space. Embedding is usually performed by a machine learning (ML) model. The following diagram provides more details about embeddings.vector embeddings
  3. RAG – We use RAG for providing more context about table schema, column synonyms, and sample queries to the FM. RAG is a framework for building generative AI applications that can make use of enterprise data sources and vector databases to overcome knowledge limitations. RAG works by using a retriever module to find relevant information from an external data store in response to a user’s prompt. This retrieved data is used as context, combined with the original prompt, to create an expanded prompt that is passed to the LLM. The language model then generates a SQL query that incorporates the enterprise knowledge. The following diagram illustrates the RAG framework.RAG Framework
  4. Streamlit This open source Python library makes it straightforward to create and share beautiful, custom web apps for ML and data science. In just a few minutes you can build powerful data apps using only Python.

The following diagram shows the solution architecture.

solution architecture

We need to update the LLMs with an enterprise-specific database. This make sure that the model can correctly understand the database and generate a response tailored to enterprise-based data schema and tables. There are multiple file formats available for storing this information, such as JSON, PDF, TXT, and YAML. In our case, we created JSON files to store table schema, table descriptions, columns with synonyms, and sample queries. JSON’s inherently structured format allows for clear and organized representation of complex data such as table schemas, column definitions, synonyms, and sample queries. This structure facilitates quick parsing and manipulation of data in most programming languages, reducing the need for custom parsing logic.

There can be multiple tables with similar information, which can lower the model’s accuracy. To increase the accuracy, we categorized the tables in four different types based on the schema and created four JSON files to store different tables. We’ve added one dropdown menu with four choices. Each choice represents one of these four categories and is lined to individual JSON files. After the user selects the value from the dropdown menu, the relevant JSON file is passed to Amazon Titan Text Embeddings v2, which can convert text into embeddings. These embeddings are stored in a vector database for faster retrieval.

We added the prompt template to the FM to define the roles and responsibilities of the model. You can add additional information such as which SQL engine should be used to generate the SQL queries.

When the user provides the input through the chat prompt, we use similarity search to find the relevant table metadata from the vector database for the user’s query. The user input is combined with relevant table metadata and the prompt template, which is passed to the FM as a single input all together. The FM generates the SQL query based on the final input.

To evaluate the model’s accuracy and track the mechanism, we store every user input and output in Amazon Simple Storage Service (Amazon S3).

Prerequisites

To create this solution, complete the following prerequisites:

  1. Sign up for an AWS account if you don’t already have one.
  2. Enable model access for Amazon Titan Text Embeddings v2 and Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock.
  3. Create an S3 bucket as ‘simplesql-logs-****‘, replace ‘****’ with your unique identifier. Bucket names are unique globally across the entire Amazon S3 service.
  4. Choose your testing environment. We recommend that you test in Amazon SageMaker Studio, although you can use other available local environments.
  5. Install the following libraries to execute the code:
    pip install streamlit
    pip install jq
    pip install openpyxl
    pip install "faiss-cpu"
    pip install langchain

Procedure

There are three main components in this solution:

  1. JSON files store the table schema and configure the LLM
  2. Vector indexing using Amazon Bedrock
  3. Streamlit for the front-end UI

You can download all three components and code snippets provided in the following section.

Generate the table schema

We use the JSON format to store the table schema. To provide more inputs to the model, we added a table name and its description, columns and their synonyms, and sample queries in our JSON files. Create a JSON file as Table_Schema_A.json by copying the following code into it:

{
  "tables": [
    {
      "separator": "table_1",
      "name": "schema_a.orders",
      "schema": "CREATE TABLE schema_a.orders (order_id character varying(200), order_date timestamp without time zone, customer_id numeric(38,0), order_status character varying(200), item_id character varying(200) );",
      "description": "This table stores information about orders placed by customers.",
      "columns": [
        {
          "name": "order_id",
          "description": "unique identifier for orders.",
          "synonyms": ["order id"]
        },
        {
          "name": "order_date",
          "description": "timestamp when the order was placed",
          "synonyms": ["order time", "order day"]
        },
        {
          "name": "customer_id",
          "description": "Id of the customer associated with the order",
          "synonyms": ["customer id", "userid"]
        },
        {
          "name": "order_status",
          "description": "current status of the order, sample values are: shipped, delivered, cancelled",
          "synonyms": ["order status"]
        },
        {
          "name": "item_id",
          "description": "item associated with the order",
          "synonyms": ["item id"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(order_id) as total_orders from schema_a.orders where customer_id = '9782226' and order_status = 'cancelled'",
          "user_input": "Count of orders cancelled by customer id: 978226"
        }
      ]
    },
    {
      "separator": "table_2",
      "name": "schema_a.customers",
      "schema": "CREATE TABLE schema_a.customers (customer_id numeric(38,0), customer_name character varying(200), registration_date timestamp without time zone, country character varying(200) );",
      "description": "This table stores the details of customers.",
      "columns": [
        {
          "name": "customer_id",
          "description": "Id of the customer, unique identifier for customers",
          "synonyms": ["customer id"]
        },
        {
          "name": "customer_name",
          "description": "name of the customer",
          "synonyms": ["name"]
        },
        {
          "name": "registration_date",
          "description": "registration timestamp when customer registered",
          "synonyms": ["sign up time", "registration time"]
        },
        {
          "name": "country",
          "description": "customer's original country",
          "synonyms": ["location", "customer's region"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(customer_id) as total_customers from schema_a.customers where country = 'India' and to_char(registration_date, 'YYYY') = '2024'",
          "user_input": "The number of customers registered from India in 2024"
        },
        {
          "query": "select count(o.order_id) as order_count from schema_a.orders o join schema_a.customers c on o.customer_id = c.customer_id where c.customer_name = 'john' and to_char(o.order_date, 'YYYY-MM') = '2024-01'",
          "user_input": "Total orders placed in January 2024 by customer name john"
        }
      ]
    },
    {
      "separator": "table_3",
      "name": "schema_a.items",
      "schema": "CREATE TABLE schema_a.items (item_id character varying(200), item_name character varying(200), listing_date timestamp without time zone );",
      "description": "This table stores the complete details of items listed in the catalog.",
      "columns": [
        {
          "name": "item_id",
          "description": "Id of the item, unique identifier for items",
          "synonyms": ["item id"]
        },
        {
          "name": "item_name",
          "description": "name of the item",
          "synonyms": ["name"]
        },
        {
          "name": "listing_date",
          "description": "listing timestamp when the item was registered",
          "synonyms": ["listing time", "registration time"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(item_id) as total_items from schema_a.items where to_char(listing_date, 'YYYY') = '2024'",
          "user_input": "how many items are listed in 2024"
        },
        {
          "query": "select count(o.order_id) as order_count from schema_a.orders o join schema_a.customers c on o.customer_id = c.customer_id join schema_a.items i on o.item_id = i.item_id where c.customer_name = 'john' and i.item_name = 'iphone'",
          "user_input": "how many orders are placed for item 'iphone' by customer name john"
        }
      ]
    }
  ]
}

Configure the LLM and initialize vector indexing using Amazon Bedrock

Create a Python file as library.py by following these steps:

  1. Add the following import statements to add the necessary libraries:
    import boto3  # AWS SDK for Python
    from langchain_community.document_loaders import JSONLoader  # Utility to load JSON files
    from langchain.llms import Bedrock  # Large Language Model (LLM) from Anthropic
    from langchain_community.chat_models import BedrockChat  # Chat interface for Bedrock LLM
    from langchain.embeddings import BedrockEmbeddings  # Embeddings for Titan model
    from langchain.memory import ConversationBufferWindowMemory  # Memory to store chat conversations
    from langchain.indexes import VectorstoreIndexCreator  # Create vector indexes
    from langchain.vectorstores import FAISS  # Vector store using FAISS library
    from langchain.text_splitter import RecursiveCharacterTextSplitter  # Split text into chunks
    from langchain.chains import ConversationalRetrievalChain  # Conversational retrieval chain
    from langchain.callbacks.manager import CallbackManager

  2. Initialize the Amazon Bedrock client and configure Anthropic’s Claude 3.5 You can limit the number of output tokens to optimize the cost:
    # Create a Boto3 client for Bedrock Runtime
    bedrock_runtime = boto3.client(
        service_name="bedrock-runtime",
        region_name="us-east-1"
    )
    
    # Function to get the LLM (Large Language Model)
    def get_llm():
        model_kwargs = {  # Configuration for Anthropic model
            "max_tokens": 512,  # Maximum number of tokens to generate
            "temperature": 0.2,  # Sampling temperature for controlling randomness
            "top_k": 250,  # Consider the top k tokens for sampling
            "top_p": 1,  # Consider the top p probability tokens for sampling
            "stop_sequences": ["nnHuman:"]  # Stop sequence for generation
        }
        # Create a callback manager with a default callback handler
        callback_manager = CallbackManager([])
        
        llm = BedrockChat(
            model_id="anthropic.claude-3-5-sonnet-20240620-v1:0",  # Set the foundation model
            model_kwargs=model_kwargs,  # Pass the configuration to the model
            callback_manager=callback_manager
            
        )
    
        return llm

  3. Create and return an index for the given schema type. This approach is an efficient way to filter tables and provide relevant input to the model:
    # Function to load the schema file based on the schema type
    def load_schema_file(schema_type):
        if schema_type == 'Schema_Type_A':
            schema_file = "Table_Schema_A.json"  # Path to Schema Type A
        elif schema_type == 'Schema_Type_B':
            schema_file = "Table_Schema_B.json"  # Path to Schema Type B
        elif schema_type == 'Schema_Type_C':
            schema_file = "Table_Schema_C.json"  # Path to Schema Type C
        return schema_file
    
    # Function to get the vector index for the given schema type
    def get_index(schema_type):
        embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",
                                       client=bedrock_runtime)  # Initialize embeddings
    
        db_schema_loader = JSONLoader(
            file_path=load_schema_file(schema_type),  # Load the schema file
            # file_path="Table_Schema_RP.json",  # Uncomment to use a different file
            jq_schema='.',  # Select the entire JSON content
            text_content=False)  # Treat the content as text
    
        db_schema_text_splitter = RecursiveCharacterTextSplitter(  # Create a text splitter
            separators=["separator"],  # Split chunks at the "separator" string
            chunk_size=10000,  # Divide into 10,000-character chunks
            chunk_overlap=100  # Allow 100 characters to overlap with previous chunk
        )
    
        db_schema_index_creator = VectorstoreIndexCreator(
            vectorstore_cls=FAISS,  # Use FAISS vector store
            embedding=embeddings,  # Use the initialized embeddings
            text_splitter=db_schema_text_splitter  # Use the text splitter
        )
    
        db_index_from_loader = db_schema_index_creator.from_loaders([db_schema_loader])  # Create index from loader
    
        return db_index_from_loader

  4. Use the following function to create and return memory for the chat session:
    # Function to get the memory for storing chat conversations
    def get_memory():
        memory = ConversationBufferWindowMemory(memory_key="chat_history", return_messages=True)  # Create memory
    
        return memory

  5. Use the following prompt template to generate SQL queries based on user input:
    # Template for the question prompt
    template = """ Read table information from the context. Each table contains the following information:
    - Name: The name of the table
    - Description: A brief description of the table
    - Columns: The columns of the table, listed under the 'columns' key. Each column contains:
      - Name: The name of the column
      - Description: A brief description of the column
      - Type: The data type of the column
      - Synonyms: Optional synonyms for the column name
    - Sample Queries: Optional sample queries for the table, listed under the 'sample_data' key
    
    Given this structure, Your task is to provide the SQL query using Amazon Redshift syntax that would retrieve the data for following question. The produced query should be functional, efficient, and adhere to best practices in SQL query optimization.
    
    Question: {}
    """

  6. Use the following function to get a response from the RAG chat model:
    # Function to get the response from the conversational retrieval chain
    def get_rag_chat_response(input_text, memory, index):
        llm = get_llm()  # Get the LLM
    
        conversation_with_retrieval = ConversationalRetrievalChain.from_llm(
            llm, index.vectorstore.as_retriever(), memory=memory, verbose=True)  # Create conversational retrieval chain
    
        chat_response = conversation_with_retrieval.invoke({"question": template.format(input_text)})  # Invoke the chain
    
        return chat_response['answer']  # Return the answer

Configure Streamlit for the front-end UI

Create the file app.py by following these steps:

  1. Import the necessary libraries:
    import streamlit as st
    import library as lib
    from io import StringIO
    import boto3
    from datetime import datetime
    import csv
    import pandas as pd
    from io import BytesIO

  2. Initialize the S3 client:
    s3_client = boto3.client('s3')
    bucket_name = 'simplesql-logs-****'
    #replace the 'simplesql-logs-****’ with your S3 bucket name
    log_file_key = 'logs.xlsx'

  3. Configure Streamlit for UI:
    st.set_page_config(page_title="Your App Name")
    st.title("Your App Name")
    
    # Define the available menu items for the sidebar
    menu_items = ["Home", "How To", "Generate SQL Query"]
    
    # Create a sidebar menu using radio buttons
    selected_menu_item = st.sidebar.radio("Menu", menu_items)
    
    # Home page content
    if selected_menu_item == "Home":
        # Display introductory information about the application
        st.write("This application allows you to generate SQL queries from natural language input.")
        st.write("")
        st.write("**Get Started** by selecting the button Generate SQL Query !")
        st.write("")
        st.write("")
        st.write("**Disclaimer :**")
        st.write("- Model's response depends on user's input (prompt). Please visit How-to section for writing efficient prompts.")
               
    # How-to page content
    elif selected_menu_item == "How To":
        # Provide guidance on how to use the application effectively
        st.write("The model's output completely depends on the natural language input. Below are some examples which you can keep in mind while asking the questions.")
        st.write("")
        st.write("")
        st.write("")
        st.write("")
        st.write("**Case 1 :**")
        st.write("- **Bad Input :** Cancelled orders")
        st.write("- **Good Input :** Write a query to extract the cancelled order count for the items which were listed this year")
        st.write("- It is always recommended to add required attributes, filters in your prompt.")
        st.write("**Case 2 :**")
        st.write("- **Bad Input :** I am working on XYZ project. I am creating a new metric and need the sales data. Can you provide me the sales at country level for 2023 ?")
        st.write("- **Good Input :** Write an query to extract sales at country level for orders placed in 2023 ")
        st.write("- Every input is processed as tokens. Do not provide un-necessary details as there is a cost associated with every token processed. Provide inputs only relevant to your query requirement.") 

  4. Generate the query:
    # SQL-AI page content
    elif selected_menu_item == "Generate SQL Query":
        # Define the available schema types for selection
        schema_types = ["Schema_Type_A", "Schema_Type_B", "Schema_Type_C"]
        schema_type = st.sidebar.selectbox("Select Schema Type", schema_types)

  5. Use the following for SQL generation:
    if schema_type:
            # Initialize or retrieve conversation memory from session state
            if 'memory' not in st.session_state:
                st.session_state.memory = lib.get_memory()
    
            # Initialize or retrieve chat history from session state
            if 'chat_history' not in st.session_state:
                st.session_state.chat_history = []
    
            # Initialize or update vector index based on selected schema type
            if 'vector_index' not in st.session_state or 'current_schema' not in st.session_state or st.session_state.current_schema != schema_type:
                with st.spinner("Indexing document..."):
                    # Create a new index for the selected schema type
                    st.session_state.vector_index = lib.get_index(schema_type)
                    # Update the current schema in session state
                    st.session_state.current_schema = schema_type
    
            # Display the chat history
            for message in st.session_state.chat_history:
                with st.chat_message(message["role"]):
                    st.markdown(message["text"])
    
            # Get user input through the chat interface, set the max limit to control the input tokens.
            input_text = st.chat_input("Chat with your bot here", max_chars=100)
            
            if input_text:
                # Display user input in the chat interface
                with st.chat_message("user"):
                    st.markdown(input_text)
    
                # Add user input to the chat history
                st.session_state.chat_history.append({"role": "user", "text": input_text})
    
                # Generate chatbot response using the RAG model
                chat_response = lib.get_rag_chat_response(
                    input_text=input_text, 
                    memory=st.session_state.memory,
                    index=st.session_state.vector_index
                )
                
                # Display chatbot response in the chat interface
                with st.chat_message("assistant"):
                    st.markdown(chat_response)
    
                # Add chatbot response to the chat history
                st.session_state.chat_history.append({"role": "assistant", "text": chat_response})

  6. Log the conversations to the S3 bucket:
    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    
                try:
                    # Attempt to download the existing log file from S3
                    log_file_obj = s3_client.get_object(Bucket=bucket_name, Key=log_file_key)
                    log_file_content = log_file_obj['Body'].read()
                    df = pd.read_excel(BytesIO(log_file_content))
    
                except s3_client.exceptions.NoSuchKey:
                    # If the log file doesn't exist, create a new DataFrame
                    df = pd.DataFrame(columns=["User Input", "Model Output", "Timestamp", "Schema Type"])
    
                # Create a new row with the current conversation data
                new_row = pd.DataFrame({
                    "User Input": [input_text], 
                    "Model Output": [chat_response], 
                    "Timestamp": [timestamp],
                    "Schema Type": [schema_type]
                })
                # Append the new row to the existing DataFrame
                df = pd.concat([df, new_row], ignore_index=True)
                
                # Prepare the updated DataFrame for S3 upload
                output = BytesIO()
                df.to_excel(output, index=False)
                output.seek(0)
                
                # Upload the updated log file to S3
                s3_client.put_object(Body=output.getvalue(), Bucket=bucket_name, Key=log_file_key)
    

Test the solution

Open your terminal and invoke the following command to run the Streamlit application.

streamlit run app.py

To visit the application using your browser, navigate to the localhost.

To visit the application using SageMaker, copy your notebook URL and replace ‘default/lab’ in the URL with ‘default/proxy/8501/ ‘ . It should look something like the following:

https://your_sagemaker_lab_url.studio.us-east-1.sagemaker.aws/jupyterlab/default/proxy/8501/

Choose Generate SQL query to open the chat window. Test your application by asking questions in natural language. We tested the application with the following questions and it generated accurate SQL queries.

Count of orders placed from India last month?
Write a query to extract the canceled order count for the items that were listed this year.
Write a query to extract the top 10 item names having highest order for each country.

Troubleshooting tips

Use the following solutions to address errors:

Error – An error raised by inference endpoint means that an error occurred (AccessDeniedException) when calling the InvokeModel operation. You don’t have access to the model with the specified model ID.
Solution – Make sure you have access to the FMs in Amazon Bedrock, Amazon Titan Text Embeddings v2, and Anthropic’s Claude 3.5 Sonnet.

Error – app.py does not exist
Solution – Make sure your JSON file and Python files are in the same folder and you’re invoking the command in the same folder.

Error – No module named streamlit
Solution – Open the terminal and install the streamlit module by running the command pip install streamlit

Error – An error occurred (NoSuchBucket) when calling the GetObject operation. The specified bucket doesn’t exist.
Solution – Verify your bucket name in the app.py file and update the name based on your S3 bucket name.

Clean up

Clean up the resources you created to avoid incurring charges. To clean up your S3 bucket, refer to Emptying a bucket.

Conclusion

In this post, we showed how Amazon Bedrock can be used to create a text-to-SQL application based on enterprise-specific datasets. We used Amazon S3 to store the outputs generated by the model for corresponding inputs. These logs can be used to test the accuracy and enhance the context by providing more details in the knowledge base. With the aid of a tool like this, you can create automated solutions that are accessible to nontechnical users, empowering them to interact with data more efficiently.

Ready to get started with Amazon Bedrock? Start learning with these interactive workshops.

For more information on SQL generation, refer to these posts:

We recently launched a managed NL2SQL module to retrieve structured data in Amazon Bedrock Knowledge  . To learn more, visit Amazon Bedrock Knowledge Bases now supports structured data retrieval.


About the Author

rajendra choudharyRajendra Choudhary is a Sr. Business Analyst at Amazon. With 7 years of experience in developing data solutions, he possesses profound expertise in data visualization, data modeling, and data engineering. He is passionate about supporting customers by leveraging generative AI–based solutions. Outside of work, Rajendra is an avid foodie and music enthusiast, and he enjoys swimming and hiking.

Read More

Unleash AI innovation with Amazon SageMaker HyperPod

Unleash AI innovation with Amazon SageMaker HyperPod

The rise of generative AI has significantly increased the complexity of building, training, and deploying machine learning (ML) models. It now demands deep expertise, access to vast datasets, and the management of extensive compute clusters. Customers also face the challenges of writing specialized code for distributed training, continuously optimizing models, addressing hardware issues, and keeping projects on track and within budget. To simplify this process, AWS introduced Amazon SageMaker HyperPod during AWS re:Invent 2023, and it has emerged as a pioneering solution, revolutionizing how companies approach AI development and deployment.

As Amazon CEO Andy Jassy recently shared, “One of the most exciting innovations we’ve introduced is SageMaker HyperPod. HyperPod accelerates the training of machine learning models by distributing and parallelizing workloads across numerous powerful processors like AWS’s Trainium chips or GPUs. HyperPod also constantly monitor your infrastructure for problems, automatically repairing them when detected. During repair, your work is automatically saved, ensuring seamless resumption. This innovation is widely adopted, with most SageMaker AI customers relying on HyperPod for their demanding training needs.”

In this post, we show how SageMaker HyperPod, and its new features introduced at AWS re:Invent 2024, is designed to meet the demands of modern AI workloads, offering a persistent and optimized cluster tailored for distributed training and accelerated inference at cloud scale and attractive price-performance.

Customers using SageMaker HyperPod

Leading startups like Writer, Luma AI, and Perplexity, as well as major enterprises such as Thomson Reuters and Salesforce, are accelerating model development with SageMaker HyperPod. Amazon itself used SageMaker HyperPod to train its new Amazon Nova models, significantly reducing training costs, enhancing infrastructure performance, and saving months of manual effort that would have otherwise been spent on cluster setup and end-to-end process management.

Today, more organizations are eager to fine-tune popular publicly available models or train their own specialized models to revolutionize their businesses and applications with generative AI. To support this demand, SageMaker HyperPod continues to evolve, introducing new innovations that make it straightforward, faster, and more cost-effective for customers to build, train, and deploy these models at scale.

Deep infrastructure control

SageMaker HyperPod offers persistent clusters with deep infrastructure control, enabling builders to securely connect using SSH to Amazon Elastic Compute Cloud (Amazon EC2) instances for advanced model training, infrastructure management, and debugging. To maximize availability, HyperPod maintains a pool of dedicated and spare instances (at no additional cost), minimizing downtime during critical node replacements.

You can use familiar orchestration tools such as Slurm or Amazon Elastic Kubernetes Service (Amazon EKS), along with the libraries built on these tools, to enable flexible job scheduling and compute sharing. Integrating SageMaker HyperPod clusters with Slurm also allows the use of NVIDIA’s Enroot and Pyxis for efficient container scheduling in performant, unprivileged sandboxes. The underlying operating system and software stack are based on the Deep Learning AMI, preconfigured with NVIDIA CUDA, NVIDIA cuDNN, and the latest versions of PyTorch and TensorFlow. SageMaker HyperPod also is integrated with Amazon SageMaker AI distributed training libraries, optimized for AWS infrastructure, enabling automatic workload distribution across thousands of accelerators for efficient parallel training.

Builders can use built-in ML tools within SageMaker HyperPod to enhance model performance. For example, Amazon SageMaker with TensorBoard helps visualize model architecture and address convergence issues, as shown in the following screenshot. Integration with observability tools like Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana offers deeper insights into cluster performance, health, and utilization, streamlining development time.

SageMaker HyperPod allows you to implement custom libraries and frameworks, enabling the service to be tailored to specific AI project needs. This level of personalization is essential in the rapidly evolving AI landscape, where innovation often requires experimenting with cutting-edge techniques and technologies. The adaptability of SageMaker HyperPod means that businesses are not constrained by infrastructure limitations, fostering creativity and technological advancement.

Intelligent resource management

As organizations increasingly provision large amounts of accelerated compute capacity for model training, they face challenges in effectively governing resource usage. These compute resources are both expensive and finite, making it crucial to prioritize critical model development tasks and avoid waste or under utilization. Without proper controls over task prioritization and resource allocation, some projects stall due to insufficient resources, while others leave resources underused. This creates a significant burden for administrators, who must constantly reallocate resources, and for data scientists, who struggle to maintain progress. These inefficiencies delay AI innovation and drive up costs.

SageMaker HyperPod addresses these challenges with its task governance capabilities, enabling you to maximize accelerator utilization for model training, fine-tuning, and inference. With just a few clicks, you can define task priorities and set limits on compute resource usage for teams. Once configured, SageMaker HyperPod automatically manages the task queue, making sure the most critical work receives the necessary resources. This reduction in operational overhead allows organizations to reallocate valuable human resources toward more innovative and strategic initiatives. This reduces model development costs by up to 40%.

For instance, if an inference task powering a customer-facing service requires urgent compute capacity but all resources are currently in use, SageMaker HyperPod reallocates underutilized or non-urgent resources to prioritize the critical task. Non-urgent tasks are automatically paused, checkpoints are saved to preserve progress, and these tasks resume seamlessly when resources become available. This makes sure you maximize your compute investments without compromising ongoing work.

As a fast-growing generative AI startup, Articul8 AI constantly optimizes their compute environment to allocate accelerated compute resources as efficiently as possible. With automated task prioritization and resource allocation in SageMaker HyperPod, they have seen a dramatic improvement in GPU utilization, reducing idle time and accelerating their model development process by optimizing tasks ranging from training and fine-tuning to inference. The ability to automatically shift resources to high-priority tasks has increased their team’s productivity, allowing them to bring new generative AI innovations to market faster than ever before.

At its core, SageMaker HyperPod represents a paradigm shift in AI infrastructure, moving beyond the traditional emphasis on raw computational power to focus on intelligent and adaptive resource management. By prioritizing optimized resource allocation, SageMaker HyperPod minimizes waste, maximizes efficiency, and accelerates innovation—all while reducing costs. This makes AI development more accessible and scalable for organizations of all sizes.

Get started faster with SageMaker HyperPod recipes

Many customers want to customize popular publicly available models, like Meta’s Llama and Mistral, for their specific use cases using their organization’s data. However, optimizing training performance often requires weeks of iterative testing—experimenting with algorithms, fine-tuning parameters, monitoring training impact, debugging issues, and benchmarking performance.

To simplify this process, SageMaker HyperPod now offers over 30 curated model training recipes for some of today’s most popular models, including DeepSeek R1, DeepSeek R1 Distill Llama, DeepSeek R1 Distill Qwen, Llama, Mistral, and Mixtral. These recipes enable you to get started in minutes by automating key steps like loading training datasets, applying distributed training techniques, and configuring systems for checkpointing and recovery from infrastructure failures. This empowers users of all skill levels to achieve better price-performance for model training on AWS infrastructure from the outset, eliminating weeks of manual evaluation and testing.

You can browse the GitHub repo to explore available training recipes, customize parameters to fit your needs, and deploy in minutes. With a simple one-line change, you can seamlessly switch between GPU or AWS Trainium based instances to further optimize price-performance.

Researchers at Salesforce were looking for ways to quickly get started with foundation model (FM) training and fine-tuning, without having to worry about the infrastructure, or spend weeks optimizing their training stack for each new model. With SageMaker HyperPod recipes, researchers at Salesforce can conduct rapid prototyping when customizing FMs. Now, Salesforce’s AI Research teams are able to get started in minutes with a variety of pre-training and fine-tuning recipes, and can operationalize frontier models with high performance.

Integrating Kubernetes with SageMaker Hyperpod

Though the standalone capabilities of SageMaker HyperPod are impressive, its integration with Amazon EKS takes AI workloads to new levels of power and flexibility. Amazon EKS simplifies the deployment, scaling, and management of containerized applications, making it an ideal solution for orchestrating complex AI/ML infrastructure.

By running SageMaker HyperPod on Amazon EKS, organizations can use Kubernetes’s advanced scheduling and orchestration features to dynamically provision and manage compute resources for AI/ML workloads, providing optimal resource utilization and scalability.

“We were able to meet our large language model training requirements using Amazon SageMaker HyperPod,” says John Duprey, Distinguished Engineer, Thomson Reuters Labs. “Using Amazon EKS on SageMaker HyperPod, we were able to scale up capacity and easily run training jobs, enabling us to unlock benefits of LLMs in areas such as legal summarization and classification.”

This integration also enhances fault tolerance and high availability. With self-healing capabilities, HyperPod automatically replaces failed nodes, maintaining workload continuity. Automated GPU health monitoring and seamless node replacement provide reliable execution of AI/ML workloads with minimal downtime, even during hardware failures.

Additionally, running SageMaker HyperPod on Amazon EKS enables efficient resource isolation and sharing using Kubernetes namespaces and resource quotas. Organizations can isolate different AI/ML workloads or teams while maximizing resource utilization across the cluster.

Flexible training plans help meet timelines and budgets

Although infrastructure innovations help reduce costs and improve training efficiency, customers still face challenges in planning and managing the compute capacity needed to complete training tasks on time and within budget. To address this, AWS is introducing flexible training plans for SageMaker HyperPod.

With just a few clicks, you can specify your desired completion date and the maximum amount of compute resources needed. SageMaker HyperPod then helps acquire capacity and sets up clusters, saving teams weeks of preparation time. This eliminates much of the uncertainty customers encounter when acquiring large compute clusters for model development tasks.


SageMaker HyperPod training plans are now available in US East (N. Virginia), US East (Ohio), and US West (Oregon) AWS Regions and support ml.p4d.48xlarge, ml.p5.48xlarge, ml.p5e.48xlarge, ml.p5en.48xlarge, and ml.trn2.48xlarge instances. Trn2 and P5en instances are only in the US East (Ohio) Region. To learn more, visit the SageMaker HyperPod product page and SageMaker pricing page.

Hippocratic AI is an AI company that develops the first safety-focused large language model (LLM) for healthcare. To train its primary LLM and the supervisor models, Hippocratic AI required powerful compute resources, which were in high demand and difficult to obtain. SageMaker HyperPod flexible training plans made it straightforward for them to gain access to EC2 P5 instances.

Developers and data scientists at OpenBabylon, an AI company that customizes LLMs for underrepresented languages, has been using SageMaker HyperPod flexible training plans for a few months to streamline their access to GPU resources to run large-scale experiments. Using the multi-node SageMaker HyperPod distributed training capabilities, they conducted 100 large-scale model training experiments, achieving state-of-the-art results in English-to-Ukrainian translation. This breakthrough was achieved on time and cost-effectively, demonstrating the ability of SageMaker HyperPod to successfully deliver complex projects on time and at budget.

Integrating training and inference infrastructures

A key focus area is integrating next-generation AI accelerators like the anticipated AWS Trainium2 release. These advanced accelerators promise unparalleled computational performance, offering 30–40% better price-performance than the current generation of GPU-based EC2 instances, significantly boosting AI model training and deployment efficiency and speed. This will be crucial for real-time applications and processing vast datasets simultaneously. The seamless accelerator integration with SageMaker HyperPod enables businesses to harness cutting-edge hardware advancements, driving AI initiatives forward.

Another pivotal aspect is that SageMaker HyperPod, through its integration with Amazon EKS, enables scalable inference solutions. As real-time data processing and decision-making demands grow, the SageMaker HyperPod architecture efficiently handles these requirements. This capability is essential across sectors like healthcare, finance, and autonomous systems, where timely, accurate AI inferences are critical. Offering scalable inference enables deploying high-performance AI models under varying workloads, enhancing operational effectiveness.

Moreover, integrating training and inference infrastructures represents a significant advancement, streamlining the AI lifecycle from development to deployment and providing optimal resource utilization throughout. Bridging this gap facilitates a cohesive, efficient workflow, reducing transition complexities from development to real-world applications. This holistic integration supports continuous learning and adaptation, which is key for next-generation, self-evolving AI models (continuously learning models, which possess the ability to adapt and refine themselves in real time based on their interactions with the environment).

SageMaker HyperPod uses established open source technologies, including MLflow integration through SageMaker, container orchestration through Amazon EKS, and Slurm workload management, providing users with familiar and proven tools for their ML workflows. By engaging the global AI community and encouraging knowledge sharing, SageMaker HyperPod continuously evolves, incorporating the latest research advancements. This collaborative approach helps SageMaker HyperPod remain at the forefront of AI technology, providing the tools to drive transformative change.

Conclusion

SageMaker HyperPod represents a fundamental change in AI infrastructure, offering a future-fit solution that empowers organizations to unlock the full potential of AI technologies. With its intelligent resource management, versatility, scalability, and forward-thinking design, SageMaker HyperPod enables businesses to accelerate innovation, reduce operational costs, and stay ahead of the curve in the rapidly evolving AI landscape.

Whether it’s optimizing the training of LLMs, processing complex datasets for medical imaging inference, or exploring novel AI architectures, SageMaker HyperPod provides a robust and flexible foundation for organizations to push the boundaries of what is possible in AI.

As AI continues to reshape industries and redefine what is possible, SageMaker HyperPod stands at the forefront, enabling organizations to navigate the complexities of AI workloads with unparalleled agility, efficiency, and innovation. With its commitment to continuous improvement, strategic partnerships, and alignment with emerging technologies, SageMaker HyperPod is poised to play a pivotal role in shaping the future of AI, empowering organizations to unlock new realms of possibility and drive transformative change.

Take the first step towards revolutionizing your AI initiatives by scheduling a consultation with our experts. Let us guide you through the process of harnessing the power of SageMaker HyperPod and unlock a world of possibilities for your business.


About the authors

Ilan Gleiser is a Principal GenAI Specialist at AWS WWSO Frameworks team focusing on developing scalable Artificial General Intelligence architectures and optimizing foundation model training and inference. With a rich background in AI and machine learning, Ilan has published over 20 blogs and delivered 100+ prototypes globally over the last 5 years. Ilan holds a Master’s degree in mathematical economics.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services and an AWS Certified Solutions Architect – Professional. Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.

Shubha Kumbadakone is a Sr. Mgr on the AWS WWSO Frameworks team focusing on Foundation Model Builders and self-managed machine learning with a focus on open-source software and tools. She has more than 19 years of experience in cloud infrastructure and machine learning and is helping customers build their distributed training and inference at scale for their ML models on AWS. She also holds a patent on a caching algorithm for rapid resume from hibernation for mobile systems.

Matt Nightingale is a Solutions Architect Manager on the AWS WWSO Frameworks team focusing on Generative AI Training and Inference. Matt specializes in distributed training architectures with a focus on hardware performance and reliability. Matt holds a bachelors degree from University of Virginia and is based in Boston, Massachusetts.

Read More

Revolutionizing clinical trials with the power of voice and AI

Revolutionizing clinical trials with the power of voice and AI

In the rapidly evolving healthcare landscape, patients often find themselves navigating a maze of complex medical information, seeking answers to their questions and concerns. However, accessing accurate and comprehensible information can be a daunting task, leading to confusion and frustration. This is where the integration of cutting-edge technologies, such as audio-to-text translation and large language models (LLMs), holds the potential to revolutionize the way patients receive, process, and act on vital medical information.

As the healthcare industry continues to embrace digital transformation, solutions that combine advanced technologies like audio-to-text translation and LLMs will become increasingly valuable in addressing key challenges, such as patient education, engagement, and empowerment. By taking advantage of these innovative technologies, healthcare providers can deliver more personalized, efficient, and effective care, ultimately improving patient outcomes and driving progress in the life sciences domain.

For instance, envision a voice-enabled virtual assistant that not only understands your spoken queries, but also transcribes them into text with remarkable accuracy. This transcription then serves as the input for a powerful LLM, which draws upon its vast knowledge base to provide personalized, context-aware responses tailored to your specific situation. This solution can transform the patient education experience, empowering individuals to make informed decisions about their healthcare journey.

In this post, we discuss possible use cases for combining speech recognition technology with LLMs, and how the solution can revolutionize clinical trials.

By combining speech recognition technology with LLMs, the solution can accurately transcribe a patient’s spoken queries into text, enabling the LLM to understand and analyze the context of the question. The LLM can then use its extensive knowledge base, which can be regularly updated with the latest medical research and clinical trial data, to provide relevant and trustworthy responses tailored to the patient’s specific situation.

Some of the potential benefits of this integrated approach are that patients can receive instant access to reliable information, empowering them to make more informed decisions about their healthcare. Additionally, the solution can help alleviate the burden on healthcare professionals by providing patients with a convenient and accessible source of information, freeing up valuable time for more critical tasks. Furthermore, the voice-enabled interface can enhance accessibility for patients with disabilities or those who prefer verbal communication, making sure that no one is left behind in the pursuit of better health outcomes.

Use cases overview

In this section, we discuss several possible use cases for this solution.

Use case 1: Audio-to-text translation and LLM integration for clinical trial patient interactions

In the domain of clinical trials, effective communication between patients and physicians is crucial for gathering accurate data, enforcing patient adherence, and maintaining study integrity. This use case demonstrates how audio-to-text translation combined with LLM capabilities can streamline and enhance the process of capturing and analyzing patient-physician interactions during clinical trial visits and telemedicine sessions.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio capture – During patient visits or telemedicine sessions, the audio of the patient-physician interaction is recorded securely, with appropriate consent and privacy measures in place.
  2. Audio-to-text translation – The recorded audio is processed through an advanced speech recognition (ASR) system, which converts the audio into text transcripts. This step provides an accurate and efficient conversion of spoken words into a format suitable for further analysis.
  3. Text preprocessing – The transcribed text undergoes preprocessing steps, such as removing identifying information, formatting the data, and enforcing compliance with relevant data privacy regulations.
  4. LLM integration – The preprocessed text is fed into a powerful LLM tailored for the healthcare and life sciences (HCLS) domain. The LLM analyzes the text, identifying key information relevant to the clinical trial, such as patient symptoms, adverse events, medication adherence, and treatment responses.
  5. Intelligent insights and recommendations – Using its large knowledge base and advanced natural language processing (NLP) capabilities, the LLM provides intelligent insights and recommendations based on the analyzed patient-physician interaction. These insights can include:
    1. Potential adverse event detection and reporting.
    2. Identification of protocol deviations or non-compliance.
    3. Recommendations for personalized patient care or adjustments to treatment regimens.
    4. Extraction of relevant data points for electronic health records (EHRs) and clinical trial databases.
  6. Data integration and reporting – The extracted insights and recommendations are integrated into the relevant clinical trial management systems, EHRs, and reporting mechanisms. This streamlines the process of data collection, analysis, and decision-making for clinical trial stakeholders, including investigators, sponsors, and regulatory authorities.

The solution offers the following potential benefits:

  • Improved data accuracy – By accurately capturing and analyzing patient-physician interactions, this approach minimizes the risks of manual transcription errors and provides high-quality data for clinical trial analysis and decision-making.
  • Enhanced patient safety – The LLM’s ability to detect potential adverse events and protocol deviations can help identify and mitigate risks, improving patient safety and study integrity.
  • Personalized patient care – Using the LLM’s insights, physicians can provide personalized care recommendations, tailored treatment plans, and better manage patient adherence, leading to improved patient outcomes.
  • Streamlined data collection and analysis – Automating the process of extracting relevant data points from patient-physician interactions can significantly reduce the time and effort required for manual data entry and analysis, enabling more efficient clinical trial management.
  • Regulatory compliance – By integrating the extracted insights and recommendations into clinical trial management systems and EHRs, this approach facilitates compliance with regulatory requirements for data capture, adverse event reporting, and trial monitoring.

This use case demonstrates the potential of combining audio-to-text translation and LLM capabilities to enhance patient-physician communication, improve data quality, and support informed decision-making in the context of clinical trials. By using advanced technologies, this integrated approach can contribute to more efficient, effective, and patient-centric clinical research processes.

Use case 2: Intelligent site monitoring with audio-to-text translation and LLM capabilities

In the HCLS domain, site monitoring plays a crucial role in maintaining the integrity and compliance of clinical trials. Site monitors conduct on-site visits, interview personnel, and verify documentation to assess adherence to protocols and regulatory requirements. However, this process can be time-consuming and prone to errors, particularly when dealing with extensive audio recordings and voluminous documentation.

By integrating audio-to-text translation and LLM capabilities, we can streamline and enhance the site monitoring process, leading to improved efficiency, accuracy, and decision-making support.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio capture and transcription – During site visits, monitors record interviews with site personnel, capturing valuable insights and observations. These audio recordings are then converted into text using ASR and audio-to-text translation technologies.
  2. Document ingestion – Relevant site documents, such as patient records, consent forms, and protocol manuals, are digitized and ingested into the system.
  3. LLM-powered data analysis – The transcribed interviews and ingested documents are fed into a powerful LLM, which can understand and correlate the information from multiple sources. The LLM can identify key insights, potential issues, and areas of non-compliance by analyzing the content and context of the data.
  4. Case report form generation – Based on the LLM’s analysis, a comprehensive case report form (CRF) is generated, summarizing the site visit findings, identifying potential risks or deviations, and providing recommendations for corrective actions or improvements.
  5. Decision support and site selection – The CRFs and associated data can be further analyzed by the LLM to identify patterns, trends, and potential risks across multiple sites. This information can be used to support decision-making processes, such as site selection for future clinical trials, based on historical performance and compliance data.

The solution offers the following potential benefits:

  • Improved efficiency – By automating the transcription and data analysis processes, site monitors can save significant time and effort, allowing them to focus on more critical tasks and cover more sites within the same time frame.
  • Enhanced accuracy – LLMs can identify and correlate subtle patterns and nuances within the data, reducing the risk of overlooking critical information or making erroneous assumptions.
  • Comprehensive documentation – The generated CRFs provide a standardized and detailed record of site visits, facilitating better communication and collaboration among stakeholders.
  • Regulatory compliance – The LLM-powered analysis can help identify potential areas of non-compliance, enabling proactive measures to address issues and mitigate risks.
  • Informed decision-making – The insights derived from the LLM’s analysis can support data-driven decision-making processes, such as site selection for future clinical trials, based on historical performance and compliance data.

By combining audio-to-text translation and LLM capabilities, this integrated approach offers a powerful solution for intelligent site monitoring in the HCLS domain, supporting improved efficiency, accuracy, and decision-making while providing regulatory compliance and quality assurance.

Use case 3: Enhancing adverse event reporting in clinical trials with audio-to-text and LLMs

Clinical trials are crucial for evaluating the safety and efficacy of investigational drugs and therapies. Accurate and comprehensive adverse event reporting is essential for identifying potential risks and making informed decisions. By combining audio-to-text translation with LLM capabilities, we can streamline and augment the adverse event reporting process, leading to improved patient safety and more efficient clinical research.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio data collection – During clinical trial visits or follow-ups, audio recordings of patient-doctor interactions are captured, capturing detailed descriptions of adverse events or symptoms experienced by the participants. These audio recordings can be obtained through various channels, such as in-person visits, telemedicine consultations, or dedicated voice reporting systems.
  2. Audio-to-text transcription – The audio recordings are processed through an audio-to-text translation system, converting the spoken words into written text format. ASR and NLP techniques provide accurate transcription, accounting for factors like accents, background noise, and medical terminology.
  3. Text data integration – The transcribed text data is integrated with other sources of adverse event reporting, such as electronic case report forms (eCRFs), patient diaries, and medication logs. This comprehensive dataset provides a holistic view of the adverse events reported across multiple data sources.
  4. LLM analysis – The integrated dataset is fed into an LLM specifically trained on medical and clinical trial data. The LLM analyzes the textual data, identifying patterns, extracting relevant information, and generating insights related to adverse event occurrences, severity, and potential causal relationships.
  5. Intelligent reporting and decision support – The LLM generates detailed adverse event reports, highlighting key findings, trends, and potential safety signals. These reports can be presented to clinical trial teams, regulatory bodies, and safety monitoring committees, supporting informed decision-making processes. The LLM can also provide recommendations for further investigation, protocol modifications, or risk mitigation strategies based on the identified adverse event patterns.

The solution offers the following potential benefits:

  • Improved data capture – By using audio-to-text translation, valuable information from patient-doctor interactions can be captured and included in adverse event reporting, reducing the risk of missed or incomplete data.
  • Enhanced accuracy and completeness – The integration of multiple data sources, combined with the LLM’s analysis capabilities, provides a comprehensive and accurate understanding of adverse events, reducing the potential for errors or omissions.
  • Efficient data analysis – The LLM can rapidly process large volumes of textual data, identifying patterns and insights that might be difficult or time-consuming for human analysts to detect manually.
  • Timely decision support – Real-time adverse event reporting and analysis enable clinical trial teams to promptly identify and address potential safety concerns, mitigating risks and providing participant well-being.
  • Regulatory compliance – Comprehensive adverse event reporting and detailed documentation facilitate compliance with regulatory requirements and support transparent communication with regulatory agencies.

By integrating audio-to-text translation with LLM capabilities, this approach addresses the critical need for accurate and timely adverse event reporting in clinical trials, ultimately enhancing patient safety, improving research efficiency, and supporting informed decision-making in the HCLS domain.

Use case 4: Audio-to-text and LLM integration for enhanced patient care

In the healthcare domain, effective communication and accurate data capture are crucial for providing personalized and high-quality care. By integrating audio-to-text translation capabilities with LLM technology, we can streamline processes and unlock valuable insights, ultimately improving patient outcomes.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input collection – Caregivers or healthcare professionals can record audio updates on a patient’s condition, mood, or relevant observations using a secure and user-friendly interface. This could be done through mobile devices, dedicated recording stations, or during virtual consultations.
  2. Audio-to-text transcription – The recorded audio files are securely transmitted to a speech-to-text engine, which converts the spoken words into text format. Advanced NLP techniques provide accurate transcription, handling accents, medical terminology, and background noise.
  3. Text processing and contextualization – The transcribed text is then fed into an LLM trained on various healthcare datasets, including medical literature, clinical guidelines, and deidentified patient records. The LLM processes the text, identifies key information, and extracts relevant context and insights.
  4. LLM-powered analysis and recommendations – Using its sizeable knowledge base and natural language understanding capabilities, the LLM can perform various tasks, such as:
    1. Identifying potential health concerns or risks based on the reported symptoms and observations.
    2. Suggesting personalized care plans or treatment options aligned with evidence-based practices.
    3. Providing recommendations for follow-up assessments, diagnostic tests, or specialist consultations.
    4. Flagging potential drug interactions or contraindications based on the patient’s medical history.
    5. Generating summaries or reports in a structured format for efficient documentation and communication.
  5. Integration with EHRs – The analyzed data and recommendations from the LLM can be seamlessly integrated into the patient’s EHR, providing a comprehensive and up-to-date medical profile. This enables healthcare professionals to access relevant information promptly and make informed decisions during consultations or treatment planning.

The solution offers the following potential benefits:

  • Improved efficiency – By automating the transcription and analysis process, healthcare professionals can save time and focus on providing personalized care, rather than spending extensive hours on documentation and data entry.
  • Enhanced accuracy – ASR and NLP techniques provide accurate transcription, reducing errors and improving data quality.
  • Comprehensive patient insights – The LLM’s ability to process and contextualize unstructured audio data provides a more holistic understanding of the patient’s condition, enabling better-informed decision-making.
  • Personalized care plans – By using the LLM’s knowledge base and analytical capabilities, healthcare professionals can develop tailored care plans aligned with the patient’s specific needs and medical history.
  • Streamlined communication – Structured reports and summaries generated by the LLM facilitate efficient communication among healthcare teams, making sure everyone has access to the latest patient information.
  • Continuous learning and improvement – As more data is processed, the LLM can continuously learn and refine its recommendations, improving its performance over time.

By integrating audio-to-text translation and LLM capabilities, healthcare organizations can unlock new efficiencies, enhance patient-provider communication, and ultimately deliver superior care while staying at the forefront of technological advancements in the industry.

Use case 5: Audio-to-text translation and LLM integration for clinical trial protocol design

Efficient and accurate protocol design is crucial for successful study execution and regulatory compliance. By combining audio-to-text translation capabilities with the power of LLMs, we can streamline the protocol design process, using diverse data sources and AI-driven insights to create high-quality protocols in a timely manner.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input collection – Clinical researchers, subject matter experts, and stakeholders provide audio inputs, such as recorded meetings, discussions, or interviews, related to the proposed clinical trial. These audio files can capture valuable insights, requirements, and domain-specific knowledge.
  2. Audio-to-text transcription – Using ASR technology, the audio inputs are converted into text transcripts with high accuracy. This step makes sure that valuable information is captured and transformed into a format suitable for further processing by LLMs.
  3. Data integration – Relevant data sources, such as previous clinical trial protocols, regulatory guidelines, scientific literature, and medical databases, are integrated into the workflow. These data sources provide contextual information and serve as a knowledge base for the LLM.
  4. LLM processing – The transcribed text, along with the integrated data sources, is fed into a powerful LLM. The LLM uses its knowledge base and NLP capabilities to analyze the inputs, identify key elements, and generate a draft clinical trial protocol.
  5. Protocol refinement and review – The draft protocol generated by the LLM is reviewed by clinical researchers, medical experts, and regulatory professionals. They provide feedback, make necessary modifications, and enforce compliance with relevant guidelines and best practices.
  6. Iterative improvement – As the AI system receives feedback and correlated outcomes from completed clinical trials, it continuously learns and refines its protocol design capabilities. This iterative process enables the LLM to become more accurate and efficient over time, leading to higher-quality protocol designs.

The solution offers the following potential benefits:

  • Efficiency – By automating the initial protocol design process, researchers can save valuable time and resources, allowing them to focus on more critical aspects of clinical trial execution.
  • Accuracy and consistency – LLMs can use vast amounts of data and domain-specific knowledge, reducing the risk of errors and providing consistency across protocols.
  • Knowledge integration – The ability to seamlessly integrate diverse data sources, including audio recordings, scientific literature, and regulatory guidelines, enhances the quality and comprehensiveness of the protocol design.
  • Continuous improvement – The iterative learning process allows the AI system to adapt and improve its protocol design capabilities based on real-world outcomes, leading to increasingly accurate and effective protocols over time.
  • Decision-making support – By providing well-structured and comprehensive protocols, the AI-driven approach enables better-informed decision-making for clinical researchers, sponsors, and regulatory bodies.

This integrated approach using audio-to-text translation and LLM capabilities has the potential to revolutionize the clinical trial protocol design process, ultimately contributing to more efficient and successful clinical trials, accelerating the development of life-saving treatments, and improving patient outcomes.

Use case 6: Voice-enabled clinical trial and disease information assistant

In the HCLS domain, effective communication and access to accurate information are crucial for patients, caregivers, and healthcare professionals. This use case demonstrates how audio-to-text translation combined with LLM capabilities can address these needs by providing an intelligent, voice-enabled assistant for clinical trial and disease information.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input – The user, whether a patient, caregiver, or healthcare professional, can initiate the process by providing a voice query related to a specific disease or clinical trial. This could include questions about the disease itself, treatment options, ongoing trials, eligibility criteria, or other relevant information.
  2. Audio-to-text translation – The audio input is converted into text using state-of-the-art speech recognition technology. This step makes sure that the user’s query is accurately transcribed and ready for further processing by the LLM.
  3. Data integration – The system integrates various data sources, including clinical trial data, disease-specific information from reputable sources (such as PubMed or WebMD), and other relevant third-party resources. This comprehensive data integration makes sure that the LLM has access to a large knowledge base for generating accurate and comprehensive responses.
  4. LLM processing – The transcribed query is fed into the LLM, which uses its natural language understanding capabilities to comprehend the user’s intent and extract relevant information from the integrated data sources. The LLM can provide intelligent responses, insights, and recommendations based on the query and the available data.
  5. Response generation – The LLM generates a detailed, context-aware response addressing the user’s query. This response can be presented in various formats, such as text, audio (using text-to-speech technology), or a combination of both, depending on the user’s preferences and accessibility needs.
  6. Feedback and continuous improvement – The system can incorporate user feedback mechanisms to improve its performance over time. This feedback can be used to refine the LLM’s understanding, enhance the data integration process, and make sure that the system remains up to date with the latest clinical trial and disease information.

The solution offers the following potential benefits:

  • Improved access to information – By using voice input and NLP capabilities, the system empowers patients, caregivers, and healthcare professionals to access accurate and comprehensive information about diseases and clinical trials, regardless of their technical expertise or literacy levels.
  • Enhanced communication – The voice-enabled interface facilitates seamless communication between users and the system, enabling them to ask questions and receive responses in a conversational manner, mimicking human-to-human interaction.
  • Personalized insights – The LLM can provide personalized insights and recommendations based on the user’s specific query and context, enabling more informed decision-making and tailored support for individuals.
  • Time and efficiency gains – By automating the process of information retrieval and providing intelligent responses, the system can significantly reduce the time and effort required for healthcare professionals to manually search and synthesize information from multiple sources.
  • Improved patient engagement – By offering accessible and user-friendly access to disease and clinical trial information, the system can empower patients and caregivers to actively participate in their healthcare journey, fostering better engagement and understanding.

This use case highlights the potential of integrating audio-to-text translation with LLM capabilities to address real-world challenges in the HCLS domain. By using cutting-edge technologies, this solution can improve information accessibility, enhance communication, and support more informed decision-making for all stakeholders involved in clinical trials and disease management.

For the demonstration purpose we will focus on following use case:

Use case overview: Patient reporting and analysis in clinical trials

In clinical trials, it’s crucial to gather accurate and comprehensive patient data to assess the safety and efficacy of investigational drugs or therapies. Traditional methods of collecting patient reports can be time-consuming, prone to errors, and might result in incomplete or inconsistent data. By combining audio-to-text translation with LLM capabilities, we can streamline the patient reporting process and unlock valuable insights to support decision-making.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input – Patients participating in clinical trials can provide their updates, symptoms, and feedback through voice recordings using a mobile application or a dedicated recording device.
  2. Audio-to-text transcription – The recorded audio files are securely transmitted to a cloud-based infrastructure, where they undergo automated transcription using ASR technology. The audio is converted into text, providing accurate and verbatim transcripts.
  3. Data consolidation – The transcribed patient reports are consolidated into a structured database, enabling efficient storage, retrieval, and analysis.
  4. LLM processing – The consolidated textual data is then processed by an LLM trained on biomedical and clinical trial data. The LLM can perform various tasks, including:
    1. Natural language processing – Extracting relevant information and identifying key symptoms, adverse events, or treatment responses from the patient reports.
    2. Sentiment analysis – Analyzing the emotional and psychological state of patients based on their language and tone, which can provide valuable insights into their overall well-being and treatment experience.
    3. Pattern recognition – Identifying recurring themes, trends, or anomalies across multiple patient reports, enabling early detection of potential safety concerns or efficacy signals.
    4. Knowledge extraction – Using the LLM’s understanding of biomedical concepts and clinical trial protocols to derive meaningful insights and recommendations from the patient data.
  5. Insights and reporting – The processed data and insights derived from the LLM are presented through interactive dashboards, visualizations, and reports. These outputs can be tailored to different stakeholders, such as clinical researchers, medical professionals, and regulatory authorities.

The solution offers the following potential benefits:

  • Improved data quality – By using audio-to-text transcription, the risk of errors and inconsistencies associated with manual data entry is minimized, providing high-quality patient data.
  • Time and cost-efficiency – Automated transcription and LLM-powered analysis can significantly reduce the time and resources required for data collection, processing, and analysis, leading to faster decision-making and cost savings.
  • Enhanced patient experience – Patients can provide their updates conveniently through voice recordings, reducing the burden of manual data entry and enabling more natural communication.
  • Comprehensive analysis – The combination of NLP, sentiment analysis, and pattern recognition capabilities offered by LLMs allows for a holistic understanding of patient experiences, treatment responses, and potential safety signals.
  • Regulatory compliance – Accurate and comprehensive patient data, coupled with robust analysis, can support compliance with regulatory requirements for clinical trial reporting and data documentation.

By integrating audio-to-text translation and LLM capabilities, clinical trial sponsors and research organizations can benefit from streamlined patient reporting, enhanced data quality, and powerful insights to support informed decision-making throughout the clinical development process.

Solution overview

The following diagram illustrates the solution architecture.

Solution overview: patient reporting and analysis in clinical trials

Solution overview: patient reporting and analysis in clinical trials

Key AWS services used in this solution include Amazon Simple Storage Service (Amazon S3), AWS HealthScribe, Amazon Transcribe, and Amazon Bedrock.

Prerequisites

This solution requires the following prerequisites:

Data samples

To illustrate the concept and provide a practical understanding, we have curated a collection of audio samples. These samples serve as representative examples, simulating site interviews conducted by researchers at clinical trial sites with patient participants.

The audio recordings offer a glimpse into the type of data typically encountered during such interviews. We encourage you to listen to these samples to gain a better appreciation of the data and its context.

These samples are for demonstration purposes only and don’t contain any real patient information or sensitive data. They are intended solely to provide a sample structure and format for the audio recordings used in this particular use case.

Sample Data Audio File
Site interview 1
Site Interview 2
Site Interview 3
Site Interview 4
Site Interview 5

Prompt templates

Prior to deploying and executing this solution, it’s essential to comprehend the input prompts and the anticipated output from the LLM. Although this is merely a sample, the potential outcomes and possibilities can be vastly expanded by crafting creative prompts.

We use the following input prompt template:

You are an expert medical research analyst for clinical trials of medicines.

You will be provided with a dictionary containing text transcriptions of clinical trial interviews conducted between patients and interviewers.

The dictionary keys represent the interview_id, and the values contain the interview transcripts.

<interview_transcripts>add_interview_transcripts</interview_transcripts>

Your task is to analyze all the transcripts and generate a comprehensive report summarizing the key findings and conclusions from the clinical trial.

The response Amazon Bedrock will be as below:

Based on the interview transcripts provided, here is a comprehensive report summarizing the key findings and conclusions from the clinical trial:

Introduction:

This report analyzes transcripts from interviews conducted with patients participating in a clinical trial for a new investigational drug. The interviews cover various aspects of the trial, including the informed consent process, randomization procedures, dosing schedules, follow-up visits, and patient experiences with potential side effects.

Key Findings:

1. Informed Consent Process:

– The informed consent process was thorough, with detailed explanations provided to patients about the trial’s procedures, potential risks, and benefits (Transcript 5).

– Patients were given ample time to review the consent documents, discuss them with family members, and have their questions addressed satisfactorily by the study team (Transcript 5).

– Overall, patients felt they fully understood the commitments and requirements of participating in the trial (Transcript 5).

2. Randomization and Blinding:

– Patients were randomized to either receive the investigational drug or a placebo, as part of a placebo-controlled study design (Transcript 2).

– The randomization process was adequately explained to patients, and they understood the rationale behind blinding, which is to prevent bias in the results (Transcript 2).

– Patients expressed acceptance of the possibility of receiving a placebo, recognizing its importance for the research (Transcript 2).

3. Dosing Schedule and Adherence:

– The dosing schedule involved taking the medication twice daily, in the morning and evening (Transcript 4).

– Some patients reported occasional difficulties in remembering the evening dose but implemented strategies like setting reminders on their phones to improve adherence (Transcript 4).

4. Follow-up Visits and Assessments:

– Follow-up visits were scheduled at specific intervals, such as 30 days, 3 months, and 6 months after the last dose (Transcripts 1 and 3).

– During these visits, various assessments were conducted, including blood tests, physical exams, ECGs, and evaluation of patient-reported outcomes like pain levels (Transcripts 1 and 3).

– Patients were informed that they would receive clinically significant findings from these assessments (Transcript 3).

5. Patient-Reported Side Effects:

– Some patients reported experiencing mild side effects, such as headaches, nausea, and joint pain improvement (Transcripts 3 and 4).

– The study team diligently documented and monitored these side effects, noting them in case report forms for further evaluation (Transcript 4).

6. Study Conduct and Communication:

– The study team provided 24/7 contact information, allowing patients to reach out with concerns between scheduled visits (Transcript 1).

– Patients were informed that they would receive information about the overall study results once available (Transcript 1).

– Patients were made aware of their ability to withdraw from the study at any time if they became uncomfortable (Transcript 2).

Conclusions:

Based on the interview transcripts, the clinical trial appears to have been conducted in a thorough and ethical manner, adhering to principles of informed consent, randomization, and blinding. Patients were adequately informed about the trial procedures, potential risks, and their rights as participants. The study team diligently monitored patient safety, documented adverse events, and maintained open communication channels. Overall, the transcripts suggest a well-managed clinical trial with a focus on patient safety, data integrity, and adherence to research protocols.

Deploy resources with AWS CloudFormation

To deploy the solution, use AWS CloudFormation template

Test the application

To test the application, complete the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Locate your bucket starting with blog-hcls-assets-*.
  3. Navigate to the S3 prefix hcls-framework/samples-input-audio/. You will see sample audio files, which we reviewed earlier in this post.
  4. Select these files, and on the Actions menu, choose Copy.Select these files, and on the Actions menu, choose Copy.
  5. For Destination, choose Browse S3 and navigate to the S3 path for hcls-framework/input-audio/.For Destination, choose Browse S3 and navigate to the S3 path

Copying these sample files will trigger an S3 event invoking the AWS Lambda function audio-to-text. To review the invocations of the Lambda function on the AWS Lambda console, navigate to the audio-to-text function and then the Monitor tab, which contains detailed logs.

Review AWS Lambda execution logs

You can review the status of the Amazon Transcribe jobs on the Amazon Transcribe console.

You can review the status of the Amazon Transcribe jobs on the Amazon Transcribe console.

At this step, the interview transcripts are ready. They should be available in Amazon S3 under the prefix hcls-framework/input-text/.

At this step, the interview transcripts are ready. They should be available in Amazon S3.

You can download a sample file and review the contents. You will notice the content of this file as JSON with a text transcript available under the key transcripts, along with other metadata.

You can download a sample file and review the contents. You will notice the content of this file as JSON with a text transcript available under the key transcripts, along with other metadata.

Now let’s run Anthropic’s Claude 3 Sonnet using the Lambda function hcls_clinical_trial_analysis to analyze the transcripts and generate a comprehensive report summarizing the key findings and conclusions from the clinical trial.

  1. On the Lambda console, navigate to the function named hcls_clinical_trial_analysis.
  2. Choose Test.
  3. If the console prompts you to create a new test event, do so with default or no input to the test event.

If the console prompts you to create a new test event, do so with default or no input to the test event.

  1. Run the test event.

To review the output, open the Lambda console and navigate to the function named hcls_clinical_trial_analysis, and then on the Monitor tab, for detailed logs, choose View CloudWatch Logs. In the logs, you will see your comprehensive report on the clinical trial.

In the logs, you will see your comprehensive report on the clinical trial.

So far, we have completed a process involving:

  • Collecting audio interviews from clinical trials
  • Transcribing the audio to text
  • Compiling transcripts into a dictionary
  • Using Amazon Bedrock (Anthropic’s Claude 3 Sonnet) to generate a comprehensive summary

Although we focused on summarization, this approach can be extended to other applications such as sentiment analysis, extracting key learnings, identifying common complaints, and more.

Summary

Healthcare patients often find themselves in need of reliable information about their conditions, clinical trials, or treatment options. However, accessing accurate and up-to-date medical knowledge can be a daunting task. Our innovative solution integrates cutting-edge audio-to-text translation and LLM capabilities to revolutionize how patients receive vital healthcare information. By using speech recognition technology, we can accurately transcribe patients’ spoken queries, allowing our LLM to comprehend the context and provide personalized, evidence-based responses tailored to their specific needs. This empowers patients to make informed decisions, enhances accessibility for those with disabilities or preferences for verbal communication, and alleviates the workload on healthcare professionals, ultimately improving patient outcomes and driving progress in the HCLS domain.

Take charge of your healthcare journey with our innovative voice-enabled virtual assistant. Empower yourself with accurate and personalized information by simply asking your questions aloud. Our cutting-edge solution integrates speech recognition and advanced language models to provide reliable, context-aware responses tailored to your specific needs. Embrace the future of healthcare today and experience the convenience of instantaneous access to vital medical information.


About the Authors

Vrinda Dabke leads AWS Professional Services North America Delivery. Prior to joining AWS, Vrinda held a variety of leadership roles in Fortune 100 companies like UnitedHealth Group, The Hartford, Aetna, and Pfizer. Her work has been focused on in the areas of business intelligence, analytics, and AI/ML. She is a motivational people leader with experience in leading and managing high-performing global teams in complex matrix organizations.

Kannan Raman leads the North America Delivery for AWS Professional Services Healthcare and Life Sciences practice at AWS. He has over 24 years of healthcare and life sciences experience and provides thought leadership in digital transformation. He works with C level customer executives to help them with their digital transformation agenda.

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Read More