Intelligent healthcare assistants: Empowering stakeholders with personalized support and data-driven insights

Intelligent healthcare assistants: Empowering stakeholders with personalized support and data-driven insights

Large language models (LLMs) have revolutionized the field of natural language processing, enabling machines to understand and generate human-like text with remarkable accuracy. However, despite their impressive language capabilities, LLMs are inherently limited by the data they were trained on. Their knowledge is static and confined to the information they were trained on, which becomes problematic when dealing with dynamic and constantly evolving domains like healthcare.

The healthcare industry is a complex, ever-changing landscape with a vast and rapidly growing body of knowledge. Medical research, clinical practices, and treatment guidelines are constantly being updated, rendering even the most advanced LLMs quickly outdated. Additionally, patient data, including electronic health records (EHRs), diagnostic reports, and medical histories, are highly personalized and unique to each individual. Relying solely on an LLM’s pre-trained knowledge is insufficient for providing accurate and personalized healthcare recommendations.

Furthermore, healthcare decisions often require integrating information from multiple sources, such as medical literature, clinical databases, and patient records. LLMs lack the ability to seamlessly access and synthesize data from these diverse and distributed sources. This limits their potential to provide comprehensive and well-informed insights for healthcare applications.

Overcoming these challenges is crucial for using the full potential of LLMs in the healthcare domain. Patients, healthcare providers, and researchers require intelligent agents that can provide up-to-date, personalized, and context-aware support, drawing from the latest medical knowledge and individual patient data.

Enter LLM function calling, a powerful capability that addresses these challenges by allowing LLMs to interact with external functions or APIs, enabling them to access and use additional data sources or computational capabilities beyond their pre-trained knowledge. By combining the language understanding and generation abilities of LLMs with external data sources and services, LLM function calling opens up a world of possibilities for intelligent healthcare agents.

In this blog post, we will explore how Mistral LLM on Amazon Bedrock can address these challenges and enable the development of intelligent healthcare agents with LLM function calling capabilities, while maintaining robust data security and privacy through Amazon Bedrock Guardrails.

Healthcare agents equipped with LLM function calling can serve as intelligent assistants for various stakeholders, including patients, healthcare providers, and researchers. They can assist patients by answering medical questions, interpreting test results, and providing personalized health advice based on their medical history and current conditions. For healthcare providers, these agents can help with tasks such as summarizing patient records, suggesting potential diagnoses or treatment plans, and staying up to date with the latest medical research. Additionally, researchers can use LLM function calling to analyze vast amounts of scientific literature, identify patterns and insights, and accelerate discoveries in areas such as drug development or disease prevention.

Benefits of LLM function calling

LLM function calling offers several advantages for enterprise applications, including enhanced decision-making, improved efficiency, personalized experiences, and scalability. By combining the language understanding capabilities of LLMs with external data sources and computational resources, enterprises can make more informed and data-driven decisions, automate and streamline various tasks, provide tailored recommendations and experiences for individual users or customers, and handle large volumes of data and process multiple requests concurrently.

Potential use cases for LLM function calling in the healthcare domain include patient triage, medical question answering, and personalized treatment recommendations. LLM-powered agents can assist in triaging patients by analyzing their symptoms, medical history, and risk factors, and providing initial assessments or recommendations for seeking appropriate care. Patients and healthcare providers can receive accurate and up-to-date answers to medical questions by using LLMs’ ability to understand natural language queries and access relevant medical knowledge from various data sources. Additionally, by integrating with electronic health records (EHRs) and clinical decision support systems, LLM function calling can provide personalized treatment recommendations tailored to individual patients’ medical histories, conditions, and preferences.

Amazon Bedrock supports a variety of foundation models. In this post, we will be exploring how to perform function calling using Mistral from Amazon Bedrock. Mistral supports function calling, which allows agents to invoke external functions or APIs from within a conversation flow. This capability enables agents to retrieve data, perform calculations, or use external services to enhance their conversational abilities. Function calling in Mistral is achieved through the use of specific function call blocks that define the external function to be invoked and handle the response or output.

Solution overview

LLM function calling typically involves integrating an LLM model with an external API or function that provides access to additional data sources or computational capabilities. The LLM model acts as an interface, processing natural language inputs and generating responses based on its pre-trained knowledge and the information obtained from the external functions or APIs. The architecture typically consists of the LLM model, a function or API integration layer, and external data sources and services.

Healthcare agents can integrate LLM models and call external functions or APIs through a series of steps: natural language input processing, self-correction, chain of thought, function or API calling through an integration layer, data integration and processing, and persona adoption. The agent receives natural language input, processes it through the LLM model, calls relevant external functions or APIs if additional data or computations are required, combines the LLM model’s output with the external data or results, and provides a comprehensive response to the user.

High Level Architecture

High Level Architecture- Healthcare assistant

The architecture for the Healthcare Agent is shown in the preceding figure and is as follows:

  1. Consumers interact with the system through Amazon API Gateway.
  2. AWS Lambda orchestrator, along with tool configuration and prompts, handles orchestration and invokes the Mistral model on Amazon Bedrock.
  3. Agent function calling allows agents to invoke Lambda functions to retrieve data, perform computations, or use external services.
  4. Functions such as insurance, claims, and pre-filled Lambda functions handle specific tasks.
  5. Data is stored in a conversation history, and a member database (MemberDB) is used to store member information and the knowledge base has static documents used by the agent.
  6. AWS CloudTrail, AWS Identity and Access Management (IAM), and Amazon CloudWatch handle data security.
  7. AWS Glue, Amazon SageMaker, and Amazon Simple Storage Service (Amazon S3) facilitate data processing.

A sample code using function calling through the Mistral LLM can be found at mistral-on-aws.

Security and privacy considerations

Data privacy and security are of utmost importance in the healthcare sector because of the sensitive nature of personal health information (PHI) and the potential consequences of data breaches or unauthorized access. Compliance with regulations such as HIPAA and GDPR is crucial for healthcare organizations handling patient data. To maintain robust data protection and regulatory compliance, healthcare organizations can use Amazon Bedrock Guardrails, a comprehensive set of security and privacy controls provided by Amazon Web Services (AWS).

Amazon Bedrock Guardrails offers a multi-layered approach to data security, including encryption at rest and in transit, access controls, audit logging, ground truth validation and incident response mechanisms. It also provides advanced security features such as data residency controls, which allow organizations to specify the geographic regions where their data can be stored and processed, maintaining compliance with local data privacy laws.

When using LLM function calling in the healthcare domain, it’s essential to implement robust security measures and follow best practices for handling sensitive patient information. Amazon Bedrock Guardrails can play a crucial role in this regard by helping to provide a secure foundation for deploying and operating healthcare applications and services that use LLM capabilities.

Some key security measures enabled by Amazon Bedrock Guardrails are:

  • Data encryption: Patient data processed by LLM functions can be encrypted at rest and in transit, making sure that sensitive information remains secure even in the event of unauthorized access or data breaches.
  • Access controls: Amazon Bedrock Guardrails enables granular access controls, allowing healthcare organizations to define and enforce strict permissions for who can access, modify, or process patient data through LLM functions.
  • Secure data storage: Patient data can be stored in secure, encrypted storage services such as Amazon S3 or Amazon Elastic File System (Amazon EFS), making sure that sensitive information remains protected even when at rest.
  • Anonymization and pseudonymization: Healthcare organizations can use Amazon Bedrock Guardrails to implement data anonymization and pseudonymization techniques, making sure that patient data used for training or testing LLM models doesn’t contain personally identifiable information (PII).
  • Audit logging and monitoring: Comprehensive audit logging and monitoring capabilities provided by Amazon Bedrock Guardrails enable healthcare organizations to track and monitor all access and usage of patient data by LLM functions, enabling timely detection and response to potential security incidents.
  • Regular security audits and assessments: Amazon Bedrock Guardrails facilitates regular security audits and assessments, making sure that the healthcare organization’s data protection measures remain up-to-date and effective in the face of evolving security threats and regulatory requirements.

By using Amazon Bedrock Guardrails, healthcare organizations can confidently deploy LLM function calling in their applications and services, maintaining robust data security, privacy protection, and regulatory compliance while enabling the transformative benefits of AI-powered healthcare assistants.

Case studies and real-world examples

3M Health Information Systems is collaborating with AWS to accelerate AI innovation in clinical documentation by using AWS machine learning (ML) services, compute power, and LLM capabilities. This collaboration aims to enhance 3M’s natural language processing (NLP) and ambient clinical voice technologies, enabling intelligent healthcare agents to capture and document patient encounters more efficiently and accurately. These agents, powered by LLMs, can understand and process natural language inputs from healthcare providers, such as spoken notes or queries, and use LLM function calling to access and integrate relevant medical data from EHRs, knowledge bases, and other data sources. By combining 3M’s domain expertise with AWS ML and LLM capabilities, the companies can improve clinical documentation workflows, reduce administrative burdens for healthcare providers, and ultimately enhance patient care through more accurate and comprehensive documentation.

GE Healthcare developed Edison, a secure intelligence solution running on AWS, to ingest and analyze data from medical devices and hospital information systems. This solution uses AWS analytics, ML, and Internet of Things (IoT) services to generate insights and analytics that can be delivered through intelligent healthcare agents powered by LLMs. These agents, equipped with LLM function calling capabilities, can seamlessly access and integrate the insights and analytics generated by Edison, enabling them to assist healthcare providers in improving operational efficiency, enhancing patient outcomes, and supporting the development of new smart medical devices. By using LLM function calling to retrieve and process relevant data from Edison, the agents can provide healthcare providers with data-driven recommendations and personalized support, ultimately enabling better patient care and more effective healthcare delivery.

Future trends and developments

Future advancements in LLM function calling for healthcare might include more advanced natural language processing capabilities, such as improved context understanding, multi-turn conversational abilities, and better handling of ambiguity and nuances in medical language. Additionally, the integration of LLM models with other AI technologies, such as computer vision and speech recognition, could enable multimodal interactions and analysis of various medical data formats.

Emerging technologies such as multimodal models, which can process and generate text, images, and other data formats simultaneously, could enhance LLM function calling in healthcare by enabling more comprehensive analysis and visualization of medical data. Personalized language models, trained on individual patient data, could provide even more tailored and accurate responses. Federated learning techniques, which allow model training on decentralized data while preserving privacy, could address data-sharing challenges in healthcare.

These advancements and emerging technologies could shape the future of healthcare agents by making them more intelligent, adaptive, and personalized. Agents could seamlessly integrate multimodal data, such as medical images and lab reports, into their analysis and recommendations. They could also continuously learn and adapt to individual patients’ preferences and health conditions, providing truly personalized care. Additionally, federated learning could enable collaborative model development while maintaining data privacy, fostering innovation and knowledge sharing across healthcare organizations.

Conclusion

LLM function calling has the potential to revolutionize the healthcare industry by enabling intelligent agents that can understand natural language, access and integrate various data sources, and provide personalized recommendations and insights. By combining the language understanding capabilities of LLMs with external data sources and computational resources, healthcare organizations can enhance decision-making, improve operational efficiency, and deliver superior patient experiences. However, addressing data privacy and security concerns is crucial for the successful adoption of this technology in the healthcare domain.

As the healthcare industry continues to embrace digital transformation, we encourage readers to explore and experiment with LLM function calling in their respective domains. By using this technology, healthcare organizations can unlock new possibilities for improving patient care, advancing medical research, and streamlining operations. With a focus on innovation, collaboration, and responsible implementation, the healthcare industry can harness the power of LLM function calling to create a more efficient, personalized, and data-driven future. AWS can help organizations use LLM function calling and build intelligent healthcare assistants through its AI/ML services, including Amazon Bedrock, Amazon Lex, and Lambda, while maintaining robust security and compliance using Amazon Bedrock Guardrails. To learn more, see AWS for Healthcare & Life Sciences.


About the Authors

Laks Sundararajan is a seasoned Enterprise Architect helping companies reset, transform and modernize their IT, digital, cloud, data and insight strategies. A proven leader with significant expertise around Generative AI, Digital, Cloud and Data/Analytics Transformation, Laks is a Sr. Solutions Architect with Healthcare and Life Sciences (HCLS).

Subha Venugopal is a Senior Solutions Architect at AWS with over 15 years of experience in the technology and healthcare sectors. Specializing in digital transformation, platform modernization, and AI/ML, she leads AWS Healthcare and Life Sciences initiatives. Subha is dedicated to enabling equitable healthcare access and is passionate about mentoring the next generation of professionals.

Read More

Getting started with computer use in Amazon Bedrock Agents

Getting started with computer use in Amazon Bedrock Agents

Computer use is a breakthrough capability from Anthropic that allows foundation models (FMs) to visually perceive and interpret digital interfaces. This capability enables Anthropic’s Claude models to identify what’s on a screen, understand the context of UI elements, and recognize actions that should be performed such as clicking buttons, typing text, scrolling, and navigating between applications. However, the model itself doesn’t execute these actions—it requires an orchestration layer to safely implement the supported actions.

Today, we’re announcing computer use support within Amazon Bedrock Agents using Anthropic’s Claude 3.5 Sonnet V2 and Anthropic’s Claude Sonnet 3.7 models on Amazon Bedrock. This integration brings Anthropic’s visual perception capabilities as a managed tool within Amazon Bedrock Agents, providing you with a secure, traceable, and managed way to implement computer use automation in your workflows.

Organizations across industries struggle with automating repetitive tasks that span multiple applications and systems of record. Whether processing invoices, updating customer records, or managing human resource (HR) documents, these workflows often require employees to manually transfer information between different systems – a process that’s time-consuming, error-prone, and difficult to scale.

Traditional automation approaches require custom API integrations for each application, creating significant development overhead. Computer use capabilities change this paradigm by allowing machines to perceive existing interfaces just as humans.

In this post, we create a computer use agent demo that provides the critical orchestration layer that transforms computer use from a perception capability into actionable automation. Without this orchestration layer, computer use would only identify potential actions without executing them. The computer use agent demo powered by Amazon Bedrock Agents provides the following benefits:

  • Secure execution environment – Execution of computer use tools in a sandbox environment with limited access to the AWS ecosystem and the web. It is crucial to note that currently Amazon Bedrock Agent does not provide a sandbox environment
  • Comprehensive logging – Ability to track each action and interaction for auditing and debugging
  • Detailed tracing capabilities – Visibility into each step of the automated workflow
  • Simplified testing and experimentation – Reduced risk when working with this experimental capability through managed controls
  • Seamless orchestration – Coordination of complex workflows across multiple systems without custom code

This integration combines Anthropic’s perceptual understanding of digital interfaces with the orchestration capabilities of Amazon Bedrock Agents, creating a powerful agent for automating complex workflows across applications. Rather than build custom integrations for each system, developers can now create agents that perceive and interact with existing interfaces in a managed, secure way.

With computer use, Amazon Bedrock Agents can automate tasks through basic GUI actions and built-in Linux commands. For example, your agent could take screenshots, create and edit text files, and run built-in Linux commands. Using Amazon Bedrock Agents and compatible Anthropic’s Claude models, you can use the following action groups:

  • Computer tool – Enables interactions with user interfaces (clicking, typing, scrolling)
  • Text editor tool – Provides capabilities to edit and manipulate files
  • Bash – Allows execution of built-in Linux commands

Solution overview

An example computer use workflow consists of the following steps:

  1. Create an Amazon Bedrock agent and use natural language to describe what the agent should do and how it should interact with users, for example: “You are computer use agent capable of using Firefox web browser for web search.”
  2. Add the Amazon Bedrock Agents supported computer use action groups to your agent using CreateAgentActionGroup API.
  3. Invoke the agent with a user query that requires computer use tools, for example, “What is Amazon Bedrock, can you search the web?”
  4. The Amazon Bedrock agent uses the tool definitions at its disposal and decides to use the computer action group to click a screenshot of the environment. Using the return control capability of Amazon Bedrock Agents, the agent the responds with the tool or tools that it wants to execute. The return control capability is required for using computer use with Amazon Bedrock Agents.
  5. The workflow parses the agent response and executes the tool returned in a sandbox environment. The output is given back to the Amazon Bedrock agent for further processing.
  6. The Amazon Bedrock agent continues to respond with tools at its disposal until the task is complete.

You can recreate this example in the us-west-2 AWS Region with the AWS Cloud Development Kit (AWS CDK) by following the instructions in the GitHub repository. This demo deploys a containerized application using AWS Fargate across two Availability Zones in the us-west-2 Region. The infrastructure operates within a virtual private cloud (VPC) containing public subnets in each Availability Zone, with an internet gateway providing external connectivity. The architecture is complemented by essential supporting services, including AWS Key Management Service (AWS KMS) for security and Amazon CloudWatch for monitoring, creating a resilient, serverless container environment that alleviates the need to manage underlying infrastructure while maintaining robust security and high availability.

The following diagram illustrates the solution architecture.

At the core of our solution are two Fargate containers managed through Amazon Elastic Container Service (Amazon ECS), each protected by its own security group. The first is our orchestration container, which not only handles the communication between Amazon Bedrock Agents and end users, but also orchestrates the workflow that enables tool execution. The second is our environment container, which serves as a secure sandbox where the Amazon Bedrock agent can safely run its computer use tools. The environment container has limited access to the rest of the ecosystem and the internet. We utilize service discovery to connect Amazon ECS services with DNS names.

The orchestration container includes the following components:

  • Streamlit UI – The Streamlit UI that facilitates interaction between the end user and computer use agent
  • Return control loop – The workflow responsible for parsing the tools that the agent wants to execute and returning the output of these tools

The environment container includes the following components:

  • UI and pre-installed applications – A lightweight UI and pre-installed Linux applications like Firefox that can be used to complete the user’s tasks
  • Tool implementation – Code that can execute computer use tool in the environment like “screenshot” or “double-click”
  • Quart (RESTful) JSON API – An orchestration container that uses Quart to execute tools in a sandbox environment

The following diagram illustrates these components.

Prerequisites

  1. AWS Command Line Interface (CLI), follow instructions here. Make sure to setup credentials, follow instructions here.
  2. Require Python 3.11 or later.
  3. Require Node.js 14.15.0 or later.
  4. AWS CDK CLI, follow instructions here.
  5. Enable model access for Anthropic’s Claude Sonnet 3.5 V2 and for Anthropic’s Claude Sonnet 3.7.
  6. Boto3 version >= 1.37.10.

Create an Amazon Bedrock agent with computer use

You can use the following code sample to create a simple Amazon Bedrock agent with computer, bash, and text editor action groups. It is crucial to provide a compatible action group signature when using Anthropic’s Claude 3.5 Sonnet V2 and Anthropic’s Claude 3.7 Sonnet as highlighted here.

Model Action Group Signature
Anthropic’s Claude 3.5 Sonnet V2 computer_20241022
text_editor_20241022
bash_20241022
Anthropic’s Claude 3.7 Sonnet computer_20250124
text_editor_20250124
bash_20250124
import boto3
import time

# Step 1: Create the bedrock agent client

bedrock_agent = boto3.client("bedrock-agent", region_name="us-west-2")

# Step 2: Create an agent

create_agent_response = create_agent_response = bedrock_agent.create_agent(
        agentResourceRoleArn=agent_role_arn, # Amazon Bedrock Agent execution role
        agentName="computeruse",
        description="""Example agent for computer use. 
				This agent should only operate on 
				Sandbox environments with limited privileges.""",
        foundationModel="us.anthropic.claude-3-7-sonnet-20250219-v1:0",      
		instruction="""You are computer use agent capable of using Firefox 
                 web browser for web search.""",
)

time.sleep(30) # wait for agent to be created

# Step 3.1: Create and attach computer action group

bedrock_agent.create_agent_action_group(
    actionGroupName="ComputerActionGroup",
    actionGroupState="ENABLED",
    agentId=create_agent_response["agent"]["agentId"],
    agentVersion="DRAFT",
    parentActionGroupSignature="ANTHROPIC.Computer",
    parentActionGroupSignatureParams={
        "type": "computer_20250124",
        "display_height_px": "768",
        "display_width_px": "1024",
        "display_number": "1",
    },
)

# Step 3.2: Create and attach bash action group

bedrock_agent.create_agent_action_group(
    actionGroupName="BashActionGroup",
    actionGroupState="ENABLED",
    agentId=create_agent_response["agent"]["agentId"],
    agentVersion="DRAFT",
    parentActionGroupSignature="ANTHROPIC.Bash",
    parentActionGroupSignatureParams={
        "type": "bash_20250124",
    },
)

# Step 3.3: Create and attach text editor action group

bedrock_agent.create_agent_action_group(
    actionGroupName="TextEditorActionGroup",
    actionGroupState="ENABLED",
    agentId=create_agent_response["agent"]["agentId"],
    agentVersion="DRAFT",
    parentActionGroupSignature="ANTHROPIC.TextEditor",
    parentActionGroupSignatureParams={
        "type": "text_editor_20250124",
    },
)

# Step 3.4 Create Weather Action Group

bedrock_agent.create_agent_action_group(
        actionGroupName="WeatherActionGroup",
        agentId=create_agent_response["agent"]["agentId"],
        agentVersion="DRAFT",
        actionGroupExecutor = {
            'customControl': 'RETURN_CONTROL',
        },
        functionSchema = {
            'functions': [
                {
                    "name": "get_current_weather",
                    "description": "Get the current weather in a given location.",
                    "parameters": {
                        "location": {
                            "type": "string",
                            "description": "The city, e.g., San Francisco",
                            "required": True,
                        },
                        "unit": {
                            "type": "string",
                            "description": 'The unit to use, e.g., 
									fahrenheit or celsius. Defaults to "fahrenheit"',
                            "required": False,
                        },
                    },
                    "requireConfirmation": "DISABLED",
                }
            ]
        },
)
time.sleep(10)
# Step 4: Prepare agent

bedrock_agent.prepare_agent(agentId=create_agent_response["agent"]["agentId"])

Example use case

In this post, we demonstrate an example where we use Amazon Bedrock Agents with the computer use capability to complete a web form. In the example, the computer use agent can also switch Firefox tabs to interact with a customer relationship management (CRM) agent to get the required information to complete the form. Although this example uses a sample CRM application as the system of record, the same approach works with Salesforce, SAP, Workday, or other systems of record with the appropriate authentication frameworks in place.

In the demonstrated use case, you can observe how well the Amazon Bedrock agent performed with computer use tools. Our implementation completed the customer ID, customer name, and email by visually examining the excel data. However, for the overview, it decided to select the cell and copy the data, because the information wasn’t completely visible on the screen. Finally, the CRM agent was used to get additional information on the customer.

Best practices

The following are some ways you can improve the performance for your use case:

Considerations

The computer use feature is made available to you as a beta service as defined in the AWS Service Terms. It is subject to your agreement with AWS and the AWS Service Terms, and the applicable model EULA. Computer use poses unique risks that are distinct from standard API features or chat interfaces. These risks are heightened when using the computer use feature to interact with the internet. To minimize risks, consider taking precautions such as:

  • Operate computer use functionality in a dedicated virtual machine or container with minimal privileges to minimize direct system exploits or accidents
  • To help prevent information theft, avoid giving the computer use API access to sensitive accounts or data
  • Limit the computer use API’s internet access to required domains to reduce exposure to malicious content
  • To enforce proper oversight, keep a human in the loop for sensitive tasks (such as making decisions that could have meaningful real-world consequences) and for anything requiring affirmative consent (such as accepting cookies, executing financial transactions, or agreeing to terms of service)

Any content that you enable Anthropic’s Claude to see or access can potentially override instructions or cause the model to make mistakes or perform unintended actions. Taking proper precautions, such as isolating Anthropic’s Claude from sensitive surfaces, is essential – including to avoid risks related to prompt injection. Before enabling or requesting permissions necessary to enable computer use features in your own products, inform end users of any relevant risks, and obtain their consent as appropriate.

Clean up

When you are done using this solution, make sure to clean up all the resources. Follow the instructions in the provided GitHub repository.

Conclusion

Organizations across industries face significant challenges with cross-application workflows that traditionally require manual data entry or complex custom integrations. The integration of Anthropic’s computer use capability with Amazon Bedrock Agents represents a transformative approach to these challenges.

By using Amazon Bedrock Agents as the orchestration layer, organizations can alleviate the need for custom API development for each application, benefit from comprehensive logging and tracing capabilities essential for enterprise deployment, and implement automation solutions quickly.

As you begin exploring computer use with Amazon Bedrock Agents, consider workflows in your organization that could benefit from this approach. From invoice processing to customer onboarding, HR documentation to compliance reporting, the potential applications are vast and transformative.

We’re excited to see how you will use Amazon Bedrock Agents with the computer use capability to securely streamline operations and reimagine business processes through AI-driven automation.

Resources

To learn more, refer to the following resources:


About the Authors

Eashan Kaushik is a Specialist Solutions Architect AI/ML at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.

Maira Ladeira Tanke is a Tech Lead for Agentic workloads in Amazon Bedrock at AWS, where she enables customers on their journey to develop autonomous AI systems. With over 10 years of experience in AI/ML. At AWS, Maira partners with enterprise customers to accelerate the adoption of agentic applications using Amazon Bedrock, helping organizations harness the power of foundation models to drive innovation and business transformation. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Adarsh Srikanth is a Software Development Engineer at Amazon Bedrock, where he develops AI agent services. He holds a master’s degree in computer science from USC and brings three years of industry experience to his role. He spends his free time exploring national parks, discovering new hiking trails, and playing various racquet sports.

Abishek Kumar is a Senior Software Engineer at Amazon, bringing over 6 years of valuable experience across both retail and AWS organizations. He has demonstrated expertise in developing generative AI and machine learning solutions, specifically contributing to key AWS services including SageMaker Autopilot, SageMaker Canvas, and AWS Bedrock Agents. Throughout his career, Abishek has shown passion for solving complex problems and architecting large-scale systems that serve millions of customers worldwide. When not immersed in technology, he enjoys exploring nature through hiking and traveling adventures with his wife.

Krishna Gourishetti is a Senior Software Engineer for the Bedrock Agents team in AWS. He is passionate about building scalable software solutions that solve customer problems. In his free time, Krishna loves to go on hikes.

Read More

Evaluating RAG applications with Amazon Bedrock knowledge base evaluation

Evaluating RAG applications with Amazon Bedrock knowledge base evaluation

Organizations building and deploying AI applications, particularly those using large language models (LLMs) with Retrieval Augmented Generation (RAG) systems, face a significant challenge: how to evaluate AI outputs effectively throughout the application lifecycle. As these AI technologies become more sophisticated and widely adopted, maintaining consistent quality and performance becomes increasingly complex.

Traditional AI evaluation approaches have significant limitations. Human evaluation, although thorough, is time-consuming and expensive at scale. Although automated metrics are fast and cost-effective, they can only evaluate the correctness of an AI response, without capturing other evaluation dimensions or providing explanations of why an answer is problematic. Furthermore, traditional automated evaluation metrics typically require ground truth data, which for many AI applications is difficult to obtain. Especially for those involving open-ended generation or retrieval augmented systems, defining a single “correct” answer is practically impossible. Finally, metrics such as ROUGE and F1 can be fooled by shallow linguistic similarities (word overlap) between the ground truth and the LLM response, even when the actual meaning is very different. These challenges make it difficult for organizations to maintain consistent quality standards across their AI applications, particularly for generative AI outputs.

Amazon Bedrock has recently launched two new capabilities to address these evaluation challenges: LLM-as-a-judge (LLMaaJ) under Amazon Bedrock Evaluations and a brand new RAG evaluation tool for Amazon Bedrock Knowledge Bases. Both features rely on the same LLM-as-a-judge technology under the hood, with slight differences depending on if a model or a RAG application built with Amazon Bedrock Knowledge Bases is being evaluated. These evaluation features combine the speed of automated methods with human-like nuanced understanding, enabling organizations to:

  • Assess AI model outputs across various tasks and contexts
  • Evaluate multiple evaluation dimensions of AI performance simultaneously
  • Systematically assess both retrieval and generation quality in RAG systems
  • Scale evaluations across thousands of responses while maintaining quality standards

These capabilities integrate seamlessly into the AI development lifecycle, empowering organizations to improve model and application quality, promote responsible AI practices, and make data-driven decisions about model selection and application deployment.

This post focuses on RAG evaluation with Amazon Bedrock Knowledge Bases, provides a guide to set up the feature, discusses nuances to consider as you evaluate your prompts and responses, and finally discusses best practices. By the end of this post, you will understand how the latest Amazon Bedrock evaluation features can streamline your approach to AI quality assurance, enabling more efficient and confident development of RAG applications.

Key features

Before diving into the implementation details, we examine the key features that make the capabilities of RAG evaluation on Amazon Bedrock Knowledge Bases particularly powerful. The key features are:

  1. Amazon Bedrock Evaluations
    • Evaluate Amazon Bedrock Knowledge Bases directly within the service
    • Systematically evaluate both retrieval and generation quality in RAG systems to change knowledge base build-time parameters or runtime parameters
  2. Comprehensive, understandable, and actionable evaluation metrics
    • Retrieval metrics: Assess context relevance and coverage using an LLM as a judge
    • Generation quality metrics: Measure correctness, faithfulness (to detect hallucinations), completeness, and more
    • Provide natural language explanations for each score in the output and on the console
    • Compare results across multiple evaluation jobs for both retrieval and generation
    • Metrics scores are normalized to 0 and 1 range
  3. Scalable and efficient assessment
    • Scale evaluation across thousands of responses
    • Reduce costs compared to manual evaluation while maintaining high quality standards
  4. Flexible evaluation framework
    • Support both ground truth and reference-free evaluations
    • Equip users to select from a variety of metrics for evaluation
    • Supports evaluating fine-tuned or distilled models on Amazon Bedrock
    • Provides a choice of evaluator models
  5. Model selection and comparison
    • Compare evaluation jobs across different generating models
    • Facilitate data-driven optimization of model performance
  6. Responsible AI integration
    • Incorporate built-in responsible AI metrics such as harmfulness, answer refusal, and stereotyping
    • Seamlessly integrate with Amazon Bedrock Guardrails

These features enable organizations to comprehensively assess AI performance, promote responsible AI development, and make informed decisions about model selection and optimization throughout the AI application lifecycle. Now that we’ve explained the key features, we examine how these capabilities come together in a practical implementation.

Feature overview

The Amazon Bedrock Knowledge Bases RAG evaluation feature provides a comprehensive, end-to-end solution for assessing and optimizing RAG applications. This automated process uses the power of LLMs to evaluate both retrieval and generation quality, offering insights that can significantly improve your AI applications.

The workflow is as follows, as shown moving from left to right in the following architecture diagram:

  1. Prompt dataset – Prepared set of prompts, optionally including ground truth responses
  2. JSONL file – Prompt dataset converted to JSONL format for the evaluation job
  3. Amazon Simple Storage Service (Amazon S3) bucket – Storage for the prepared JSONL file
  4. Amazon Bedrock Knowledge Bases RAG evaluation job – Core component that processes the data, integrating with Amazon Bedrock Guardrails and Amazon Bedrock Knowledge Bases.
  5. Automated report generation – Produces a comprehensive report with detailed metrics and insights at individual prompt or conversation level
  6. Analyze the report to derive actionable insights for RAG system optimization

Designing holistic RAG evaluations: Balancing cost, quality, and speed

RAG system evaluation requires a balanced approach that considers three key aspects: cost, speed, and quality. Although Amazon Bedrock Evaluations primarily focuses on quality metrics, understanding all three components helps create a comprehensive evaluation strategy. The following diagram shows how these components interact and feed into a comprehensive evaluation strategy, and the next sections examine each component in detail.

Cost and speed considerations

The efficiency of RAG systems depends on model selection and usage patterns. Costs are primarily driven by data retrieval and token consumption during retrieval and generation, and speed depends on model size and complexity as well as prompt and context size. For applications requiring high performance content generation with lower latency and costs, model distillation can be an effective solution to use for creating a generator model, for example. As a result, you can create smaller, faster models that maintain quality of larger models for specific use cases.

Quality assessment framework

Amazon Bedrock knowledge base evaluation provides comprehensive insights through various quality dimensions:

  • Technical quality through metrics such as context relevance and faithfulness
  • Business alignment through correctness and completeness scores
  • User experience through helpfulness and logical coherence measurements
  • Incorporates built-in responsible AI metrics such as harmfulness, stereotyping, and answer refusal.

Establishing baseline understanding

Begin your evaluation process by choosing default configurations in your knowledge base (vector or graph database), such as default chunking strategies, embedding models, and prompt templates. These are just some of the possible options. This approach establishes a baseline performance, helping you understand your RAG system’s current effectiveness across available evaluation metrics before optimization. Next, create a diverse evaluation dataset. Make sure this dataset contains a diverse set of queries and knowledge sources that accurately reflect your use case. The diversity of this dataset will provide a comprehensive view of your RAG application performance in production.

Iterative improvement process

Understanding how different components affect these metrics enables informed decisions about:

  • Knowledge base configuration (chunking strategy or embedding size or model) and inference parameter refinement
  • Retrieval strategy modifications (semantic or hybrid search)
  • Prompt engineering refinements
  • Model selection and inference parameter configuration
  • Choice between different vector stores including graph databases

Continuous evaluation and improvement

Implement a systematic approach to ongoing evaluation:

  • Schedule regular offline evaluation cycles aligned with knowledge base updates
  • Track metric trends over time to identify areas for improvement
  • Use insights to guide knowledge base refinements and generator model customization and selection

Prerequisites

To use the knowledge base evaluation feature, make sure that you have satisfied the following requirements:

  • An active AWS account.
  • Selected evaluator and generator models enabled in Amazon Bedrock. You can confirm that the models are enabled for your account on the Model access page of the Amazon Bedrock console.
  • Confirm the AWS Regions where the model is available and quotas.
  • Complete the knowledge base evaluation prerequisites related to AWS Identity and Access Management (IAM) creation and add permissions for an S3 bucket to access and write output data.
  • Have an Amazon Bedrock knowledge base created and sync your data such that it’s ready to be used by a knowledge base evaluation job.
  • If yo’re using a custom model instead of an on-demand model for your generator model, make sure you have sufficient quota for running a Provisioned Throughput during inference. Go to the Service Quotas console and check the following quotas:
    • Model units no-commitment Provisioned Throughputs across custom models
    • Model units per provisioned model for [your custom model name]
    • Both fields need to have enough quota to support your Provisioned Throughput model unit. Request a quota increase if necessary to accommodate your expected inference workload.

Prepare input dataset

To prepare your dataset for a knowledge base evaluation job, you need to follow two important steps:

  1. Dataset requirements:
    1. Maximum 1,000 conversations per evaluation job (1 conversation is contained in the conversationTurns key in the dataset format)
    2. Maximum 5 turns (prompts) per conversation
    3. File must use JSONL format (.jsonl extension)
    4. Each line must be a valid JSON object and complete prompt
    5. Stored in an S3 bucket with CORS enabled
  2. Follow the following format:
    1. Retrieve only evaluation jobs.

Special note: On March 20, 2025, the referenceContexts key will change to referenceResponses. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages/chunks retrieved from the Knowledge Base.

{
    "conversationTurns": [{
        ## required for Context Coverage metric
        "referenceContexts": [{
            "content": [{
                "text": "This is reference retrieved context"
            }]
        }],
        ## your prompt to the model
        "prompt": {
            "content": [{
                "text": "This is a prompt"
            }]
        }
    }]
}
  1. Retrieve and generate evaluation jobs
{
    "conversationTurns": [{
        ##optional
        "referenceResponses": [{
            "content": [{
                "text": "This is a reference response used as groud truth"
            }]
        }],
        ## your prompt to the model
        "prompt": {
            "content": [{
                "text": "This is a prompt"
            }]
        }
    }]
}

Start a knowledge base RAG evaluation job using the console

Amazon Bedrock Evaluations provides you with an option to run an evaluation job through a guided user interface on the console. To start an evaluation job through the console, follow these steps:

  1. On the Amazon Bedrock console, under Inference and Assessment in the navigation pane, choose Evaluations and then choose Knowledge Bases.
  2. Choose Create, as shown in the following screenshot.
  3. Give an Evaluation name, a Description, and choose an Evaluator model, as shown in the following screenshot. This model will be used as a judge to evaluate the response of the RAG application.
  4. Choose the knowledge base and the evaluation type, as shown in the following screenshot. Choose Retrieval only if you want to evaluate only the retrieval component and Retrieval and response generation if you want to evaluate the end-to-end retrieval and response generation. Select a model, which will be used for generating responses in this evaluation job.
  5. (Optional) To change inference parameters, choose configurations. You can update or experiment with different values of temperature, top-P, update knowledge base prompt templates, associate guardrails, update search strategy, and configure numbers of chunks retrieved. The following screenshot shows the Configurations screen.
  6. Choose the Metrics you would like to use to evaluate the RAG application, as shown in the following screenshot.
  7. Provide the S3 URI, as shown in step 3 for evaluation data and for evaluation results. You can use the Browse S3
  8. Select a service (IAM) role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, the knowledge base in the job, and the models being used in the job. You can also create a new IAM role in the evaluation setup and the service will automatically give the role the proper permissions for the job.
  9. Choose Create.
  10. You will be able to check the evaluation job In Progress status on the Knowledge Base evaluations screen, as shown in in the following screenshot.
  11. Wait for the job to be complete. This could be 10–15 minutes for a small job or a few hours for a large job with hundreds of long prompts and all metrics selected. When the evaluation job has been completed, the status will show as Completed, as shown in the following screenshot.
  12. When it’s complete, select the job, and you’ll be able to observe the details of the job. The following screenshot is the Metric summary.
  13. You should also observe a directory with the evaluation job name in the Amazon S3 path. You can find the output S3 path from your job results page in the evaluation summary section.
  14. You can compare two evaluation jobs to gain insights about how different configurations or selections are performing. You can view a radar chart comparing performance metrics between two RAG evaluation jobs, making it simple to visualize relative strengths and weaknesses across different dimensions, as shown in the following screenshot.

On the Evaluation details tab, examine score distributions through histograms for each evaluation metric, showing average scores and percentage differences. Hover over the histogram bars to check the number of conversations in each score range, helping identify patterns in performance, as shown in the following screenshots.

Start a knowledge base evaluation job using Python SDK and APIs

To use the Python SDK for creating a knowledge base evaluation job, follow these steps. First, set up the required configurations:

import boto3
from datetime import datetime

# Generate unique name for the job
job_name = f"kb-evaluation-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# Configure your knowledge base and model settings
knowledge_base_id = "<YOUR_KB_ID>"
evaluator_model = "mistral.mistral-large-2402-v1:0"
generator_model = "anthropic.claude-3-sonnet-20240229-v1:0"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"

# Specify S3 locations for evaluation data and output
input_data = "s3://<YOUR_BUCKET>/evaluation_data/input.jsonl"
output_path = "s3://<YOUR_BUCKET>/evaluation_output/"

# Configure retrieval settings
num_results = 10
search_type = "HYBRID"

# Create Bedrock client
bedrock_client = boto3.client('bedrock')

For retrieval-only evaluation, create a job that focuses on assessing the quality of retrieved contexts:

retrieval_job = bedrock_client.create_evaluation_job(
    jobName=job_name,
    jobDescription="Evaluate retrieval performance",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveConfig": {
                    "knowledgeBaseId": knowledge_base_id,
                    "knowledgeBaseRetrievalConfiguration": {
                        "vectorSearchConfiguration": {
                            "numberOfResults": num_results,
                            "overrideSearchType": search_type
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "Custom",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.ContextRelevance",
                    "Builtin.ContextCoverage"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

For a complete evaluation of both retrieval and generation, use this configuration:

retrieve_generate_job=bedrock_client.create_evaluation_job(
    jobName=job_name,
    jobDescription="Evaluate retrieval and generation",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveAndGenerateConfig": {
                    "type": "KNOWLEDGE_BASE",
                    "knowledgeBaseConfiguration": {
                        "knowledgeBaseId": knowledge_base_id,
                        "modelArn": generator_model,
                        "retrievalConfiguration": {
                            "vectorSearchConfiguration": {
                                "numberOfResults": num_results,
                                "overrideSearchType": search_type
                            }
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "Custom",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Helpfulness",
                    "Builtin.LogicalCoherence",
                    "Builtin.Faithfulness"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

To monitor the progress of your evaluation job, use this configuration:

# depending on job type, we can retrieve the ARN of the job and monitor to to take any downstream actions.
evaluation_job_arn = retrieval_job['jobArn']
evaluation_job_arn = retrieve_generate_job['jobArn']

response = bedrock_client.get_evaluation_job(
    jobIdentifier=evaluation_job_arn 
)
print(f"Job Status: {response['status']}")

Interpreting results

After your evaluation jobs are completed, Amazon Bedrock RAG evaluation provides a detailed comparative dashboard across the evaluation dimensions.

The evaluation dashboard includes comprehensive metrics, but we focus on one example, the completeness histogram shown below. This visualization represents how well responses cover all aspects of the questions asked. In our example, we notice a strong right-skewed distribution with an average score of 0.921. The majority of responses (15) scored above 0.9, while a small number fell in the 0.5-0.8 range. This type of distribution helps quickly identify if your RAG system has consistent performance or if there are specific cases needing attention.

Selecting specific score ranges in the histogram reveals detailed conversation analyses. For each conversation, you can examine the input prompt, generated response, number of retrieved chunks, ground truth comparison, and most importantly, the detailed score explanation from the evaluator model.

Consider this example response that scored 0.75 for the question, “What are some risks associated with Amazon’s expansion?” Although the generated response provided a structured analysis of operational, competitive, and financial risks, the evaluator model identified missing elements around IP infringement and foreign exchange risks compared to the ground truth. This detailed explanation helps in understanding not just what’s missing, but why the response received its specific score.

This granular analysis is crucial for systematic improvement of your RAG pipeline. By understanding patterns in lower-performing responses and specific areas where context retrieval or generation needs improvement, you can make targeted optimizations to your system—whether that’s adjusting retrieval parameters, refining prompts, or modifying knowledge base configurations.

Best practices for implementation

These best practices help build a solid foundation for your RAG evaluation strategy:

  1. Design your evaluation strategy carefully, using representative test datasets that reflect your production scenarios and user patterns. If you have large workloads greater than 1,000 prompts per batch, optimize your workload by employing techniques such as stratified sampling to promote diversity and representativeness within your constraints such as time to completion and costs associated with evaluation.
  2. Schedule periodic batch evaluations aligned with your knowledge base updates and content refreshes because this feature supports batch analysis rather than real-time monitoring.
  3. Balance metrics with business objectives by selecting evaluation dimensions that directly impact your application’s success criteria.
  4. Use evaluation insights to systematically improve your knowledge base content and retrieval settings through iterative refinement.
  5. Maintain clear documentation of evaluation jobs, including the metrics selected and improvements implemented based on results. The job creation configuration settings in your results pages can help keep a historical record here.
  6. Optimize your evaluation batch size and frequency based on application needs and resource constraints to promote cost-effective quality assurance.
  7. Structure your evaluation framework to accommodate growing knowledge bases, incorporating both technical metrics and business KPIs in your assessment criteria.

To help you dive deeper into the scientific validation of these practices, we’ll be publishing a technical deep-dive post that explores detailed case studies using public datasets and internal AWS validation studies. This upcoming post will examine how our evaluation framework performs across different scenarios and demonstrate its correlation with human judgments across various evaluation dimensions. Stay tuned as we explore the research and validation that powers Amazon Bedrock Evaluations.

Conclusion

Amazon Bedrock knowledge base RAG evaluation enables organizations to confidently deploy and maintain high-quality RAG applications by providing comprehensive, automated assessment of both retrieval and generation components. By combining the benefits of managed evaluation with the nuanced understanding of human assessment, this feature allows organizations to scale their AI quality assurance efficiently while maintaining high standards. Organizations can make data-driven decisions about their RAG implementations, optimize their knowledge bases, and follow responsible AI practices through seamless integration with Amazon Bedrock Guardrails.

Whether you’re building customer service solutions, technical documentation systems, or enterprise knowledge base RAG, Amazon Bedrock Evaluations provides the tools needed to deliver reliable, accurate, and trustworthy AI applications. To help you get started, we’ve prepared a Jupyter notebook with practical examples and code snippets. You can find it on our GitHub repository.

We encourage you to explore these capabilities in the Amazon Bedrock console and discover how systematic evaluation can enhance your RAG applications.


About the Authors

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Ayan Ray is a Senior Generative AI Partner Solutions Architect at AWS, where he collaborates with ISV partners to develop integrated Generative AI solutions that combine AWS services with AWS partner products. With over a decade of experience in Artificial Intelligence and Machine Learning, Ayan has previously held technology leadership roles at AI startups before joining AWS. Based in the San Francisco Bay Area, he enjoys playing tennis and gardening in his free time.

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Evangelia Spiliopoulou is an Applied Scientist in the AWS Bedrock Evaluation group, where the goal is to develop novel methodologies and tools to assist automatic evaluation of LLMs. Her overall work focuses on Natural Language Processing (NLP) research and developing NLP applications for AWS customers, including LLM Evaluations, RAG, and improving reasoning for LLMs. Prior to Amazon, Evangelia completed her Ph.D. at Language Technologies Institute, Carnegie Mellon University.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Read More

How GoDaddy built a category generation system at scale with batch inference for Amazon Bedrock

How GoDaddy built a category generation system at scale with batch inference for Amazon Bedrock

This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team of GoDaddy

Generative AI solutions have the potential to transform businesses by boosting productivity and improving customer experiences, and using large language models (LLMs) in these solutions has become increasingly popular. However, inference of LLMs as single model invocations or API calls doesn’t scale well with many applications in production.

With batch inference, you can run multiple inference requests asynchronously to process a large number of requests efficiently. You can also use batch inference to improve the performance of model inference on large datasets.

This post provides an overview of a custom solution developed by the for GoDaddy, a domain registrar, registry, web hosting, and ecommerce company that seeks to make entrepreneurship more accessible by using generative AI to provide personalized business insights to over 21 million customers—insights that were previously only available to large corporations. In this collaboration, the Generative AI Innovation Center team created an accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system.

Solution overview

GoDaddy wanted to enhance their product categorization system that assigns categories to products based on their names. For example:

Input: Fruit by the Foot Starburst

Output: color -> multi-colored, material -> candy, category -> snacks, product_line -> Fruit by the Foot,…

GoDaddy used an out-of-the-box Meta Llama 2 model to generate the product categories for six million products where a product is identified by an SKU. The generated categories were often incomplete or mislabeled. Moreover, employing an LLM for individual product categorization proved to be a costly endeavor. Recognizing the need for a more precise and cost-effective solution, GoDaddy sought an alternative approach that was a more accurate and cost-efficient way for product categorization to improve their customer experience.

This solution uses the following components to categorize products more accurately and efficiently:

The key steps are illustrated in the following figure:

  1. A JSONL file containing product data is uploaded to an S3 bucket, triggering the first Lambda function. Amazon Bedrock batch processes this single JSONL file, where each row contains input parameters and prompts. It generates an output JSONL file with a new model_output value appended to each row, corresponding to the input data.
  2. The Lambda function spins up an Amazon Bedrock batch processing endpoint and passes the S3 file location.
  3. The Amazon Bedrock endpoint performs the following tasks:
    1. It reads the product name data and generates a categorized output, including category, subcategory, season, price range, material, color, product line, gender, and year of first sale.
    2. It writes the output to another S3 location.
  4. The second Lambda function performs the following tasks:
    1. It monitors the batch processing job on Amazon Bedrock.
    2. It shuts down the endpoint when processing is complete.

The security measures are inherently integrated into the AWS services employed in this architecture. For detailed information, refer to the Security Best Practices section of this post.

We used a dataset that consisted of 30 labeled data points and 100,000 unlabeled test data points. The labeled data points were generated by llama2-7b and verified by a human subject matter expert (SME). As shown in the following screenshot of the sample ground truth, some fields have N/A or missing values, which isn’t ideal because GoDaddy wants a solution with high coverage for downstream predictive modeling. Higher coverage for each possible field can provide more business insights to their customers.

The distribution for the number of words or tokens per SKU shows mild outlier concern, suitable for bundling many products to be categorized in the prompts and potentially more efficient model response.

The solution delivers a comprehensive framework for generating insights within GoDaddy’s product categorization system. It’s designed to be compatible with a range of LLMs on Amazon Bedrock, features customizable prompt templates, and supports batch and real-time (online) inferences. Additionally, the framework includes evaluation metrics that can be extended to accommodate changes in accuracy requirements.

In the following sections, we look at the key components of the solution in more detail.

Batch inference

We used Amazon Bedrock for batch inference processing. Amazon Bedrock provides the CreateModelInvocationJob API to create a batch job with a unique job name. This API returns a response containing jobArn. Refer to the following code:

Request: POST /model-invocation-job HTTP/1.1

Content-type: application/json
{
  "clientRequestToken": "string",
  "inputDataConfig": {
    "s3InputDataConfig": {
      "s3Uri": "string",
      "s3InputFormat": "JSONL"
    }
   },
  "jobName": "string",
  "modelId": "string",
  "outputDataConfig": {
    "s3OutputDataConfig": {
      "s3Uri": "string"
    }
  },
  "roleArn": "string",
  "tags": [{
  "key": "string",
  "value": "string"
  }]
}

Response
HTTP/1.1 200 Content-type: application/json
{
  "jobArn": "string"
}

We can monitor the job status using GetModelInvocationJob with the jobArn returned on job creation. The following are valid statuses during the lifecycle of a job:

  • Submitted – The job is marked Submitted when the JSON file is ready to be processed by Amazon Bedrock for inference.
  • InProgress – The job is marked InProgress when Amazon Bedrock starts processing the JSON file.
  • Failed – The job is marked Failed if there was an error while processing. The error can be written into the JSON file as a part of modelOutput. If it was a 4xx error, it’s written in the metadata of the Job.
  • Completed – The job is marked Completed when the output JSON file is generated for the input JSON file and has been uploaded to the S3 output path submitted as a part of the CreateModelInvocationJob in outputDataConfig.
  • Stopped – The job is marked Stopped when a StopModelInvocationJob API is called on a job that is InProgress. A terminal state job (Succeeded or Failed) can’t be stopped using StopModelInvocationJob.

The following is example code for the GetModelInvocationJob API:

GET /model-invocation-job/jobIdentifier HTTP/1.1

Response:
{
  'ResponseMetadata': {
    'RequestId': '081afa52-189f-4e83-a3f9-aa0918d902f4',
    'HTTPStatusCode': 200,
    'HTTPHeaders': {
       'date': 'Tue, 09 Jan 2024 17:00:16 GMT',
       'content-type': 'application/json',
       'content-length': '690',
       'connection': 'keep-alive',
       'x-amzn-requestid': '081afa52-189f-4e83-a3f9-aa0918d902f4'
      },
     'RetryAttempts': 0
   },
  'jobArn': 'arn:aws:bedrock:<region>:<account-id>:model-invocation-job/<id>',
  'jobName': 'job47',
  'modelId': 'arn:aws:bedrock:<region>::foundation-model/anthropic.claude-instant-v1:2',
  'status': 'Submitted',
  'submitTime': datetime.datetime(2024, 1, 8, 21, 44, 38, 611000, tzinfo=tzlocal()),
  'lastModifiedTime': datetime.datetime(2024, 1, 8, 23, 5, 47, 169000, tzinfo=tzlocal()),
  'inputDataConfig': {'s3InputDataConfig': {'s3Uri': <path to input jsonl file>}},
  'outputDataConfig': {'s3OutputDataConfig': {'s3Uri': <path to output jsonl.out file>}}
}

When the job is complete, the S3 path specified in s3OutputDataConfig will contain a new folder with an alphanumeric name. The folder contains two files:

  • json.out – The following code shows an example of the format:
{
   "processedRecordCount":<number>,
   "successRecordCount":<number>,
   "errorRecordCount":<number>,
   "inputTokenCount":<number>,
   "outputTokenCount":<number>
}
  • <file_name>.jsonl.out – The following screenshot shows an example of the code, containing the successfully processed records under The modelOutput contains a list of categories for a given product name in JSON format.

We then process the jsonl.out file in Amazon S3. This file is parsed using LangChain’s PydanticOutputParser to generate a .csv file. The PydanticOutputParser requires a schema to be able to parse the JSON generated by the LLM. We created a CCData class that contains the list of categories to be generated for each product as shown in the following code example. Because we enable n-packing, we wrap the schema with a List, as defined in List_of_CCData.

class CCData(BaseModel):
   product_name: Optional[str] = Field(default=None, description="product name, which will be given as input")
   brand: Optional[str] = Field(default=None, description="Brand of the product inferred from the product name")
   color: Optional[str] = Field(default=None, description="Color of the product inferred from the product name")
   material: Optional[str] = Field(default=None, description="Material of the product inferred from the product name")
   price: Optional[str] = Field(default=None, description="Price of the product inferred from the product name")
   category: Optional[str] = Field(default=None, description="Category of the product inferred from the product name")
   sub_category: Optional[str] = Field(default=None, description="Sub-category of the product inferred from the product name")
   product_line: Optional[str] = Field(default=None, description="Product Line of the product inferred from the product name")
   gender: Optional[str] = Field(default=None, description="Gender of the product inferred from the product name")
   year_of_first_sale: Optional[str] = Field(default=None, description="Year of first sale of the product inferred from the product name")
   season: Optional[str] = Field(default=None, description="Season of the product inferred from the product name")

class List_of_CCData(BaseModel): 
   list_of_dict: List[CCData]

We also use OutputFixingParser to handle situations where the initial parsing attempt fails. The following screenshot shows a sample generated .csv file.

Prompt engineering

Prompt engineering involves the skillful crafting and refining of input prompts. This process entails choosing the right words, phrases, sentences, punctuation, and separator characters to efficiently use LLMs for diverse applications. Essentially, prompt engineering is about effectively interacting with an LLM. The most effective strategy for prompt engineering needs to vary based on the specific task and data, specifically, data card generation and GoDaddy SKUs.

Prompts consist of particular inputs from the user that direct LLMs to produce a suitable response or output based on a specified task or instruction. These prompts include several elements, such as the task or instruction itself, the surrounding context, full examples, and the input text that guides LLMs in crafting their responses. The composition of the prompt will vary based on factors like the specific use case, data availability, and the nature of the task at hand. For example, in a Retrieval Augmented Generation (RAG) use case, we provide additional context and add a user-supplied query in the prompt that asks the LLM to focus on contexts that can answer the query. In a metadata generation use case, we can provide the image and ask the LLM to generate a description and keywords describing the image in a specific format.

In this post, we briefly distribute the prompt engineering solutions into two steps: output generation and format parsing.

Output generation

The following are best practices and considerations for output generation:

  • Provide simple, clear and complete instructions – This is the general guideline for prompt engineering work.
  • Use separator characters consistently – In this use case, we use the newline character n
  • Deal with default output values such as missing – For this use case, we don’t want special values such as N/A or missing, so we put multiple instructions in line, aiming to exclude the default or missing values.
  • Use few-shot prompting – Also termed in-context learning, few-shot prompting involves providing a handful of examples, which can be beneficial in helping LLMs understand the output requirements more effectively. In this use case, 0–10 in-context examples were tested for both Llama 2 and Anthropic’s Claude models.
  • Use packing techniques – We combined multiple SKU and product names into one LLM query, so that some prompt instructions can be shared across different SKUs for cost and latency optimization. In this use case, 1–10 packing numbers were tested for both Llama 2 and Anthropic’s Claude models.
  • Test for good generalization – You should keep a hold-out test set and correct responses to check if your prompt modifications generalize.
  • Use additional techniques for Anthropic’s Claude model families – We incorporated the following techniques:
    • Enclosing examples in XML tags:
<example>
H: <question> The list of product names is:
{few_shot_product_name} </question>
A: <response> The category information generated with absolutely no missing value, in JSON format is:
{few_shot_field} </response>
</example>
  • Using the Human and Assistant annotations:
nnHuman:
...
...
nnAssistant:
  • Guiding the assistant prompt:
nnAssistant: Here are the answer with NO missing, unknown, null, or N/A values (in JSON format):
  • Use additional techniques for Llama model families – For Llama 2 model families, you can enclose examples in [INST] tags:
[INST]
If the list of product names is:
{few_shot_product_name}
[/INST]

Then the answer with NO missing, unknown, null, or N/A values is (in JSON format):

{few_shot_field}

[INST]
If the list of product names is:
{product_name}
[/INST]

Then the answer with NO missing, unknown, null, or N/A values is (in JSON format):

Format parsing

The following are best practices and considerations for format parsing:

  • Refine the prompt with modifiers – Refinement of task instructions typically involves altering the instruction, task, or question part of the prompt. The effectiveness of these techniques varies based on the task and data. Some beneficial strategies in this use case include:
    • Role assumption – Ask the model to assume it’s playing a role. For example:

You are a Product Information Manager, Taxonomist, and Categorization Expert who follows instruction well.

  • Prompt specificity: Being very specific and providing detailed instructions to the model can help generate better responses for the required task.

EVERY category information needs to be filled based on BOTH product name AND your best guess. If you forget to generate any category information, leave it as missing or N/A, then an innocent people will die.

  • Output format description – We provided the JSON format instructions through a JSON string directly, as well as through the few-shot examples indirectly.
  • Pay attention to few-shot example formatting – The LLMs (Anthropic’s Claude and Llama) are sensitive to subtle formatting differences. Parsing time was significantly improved after several iterations on few-shot examples formatting. The final solution is as follows:
few_shot_field='{"list_of_dict"' +
':[' +
', n'.join([true_df.iloc[i].to_json() for i in range(num_few_shot)]) +
']}'
  • Use additional techniques for Anthropic’s Claude model families – For the Anthropic’s Claude model, we instructed it to format the output in JSON format:
{
    "list_of_dict": [{
        "some_category": "your_generated_answer",
        "another_category": "your_generated_answer",
    },
    {
        <category information for the 2st product name, in json format>
    },
    {
        <category information for the 3st product name, in json format>
    },
// ... {additional product information, in json format} ...
    }]
}
  • Use additional techniques for Llama 2 model families – For the Llama 2 model, we instructed it to format the output in JSON format as follows:

Format your output in the JSON format (ensure to escape special character):
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:

{“properties”: {“list_of_dict”: {“title”: “List Of Dict”, “type”: “array”, “items”: {“$ref”: “#/definitions/CCData”}}}, “required”: [“list_of_dict”], “definitions”: {“CCData”: {“title”: “CCData”, “type”: “object”, “properties”: {“product_name”: {“title”: “Product Name”, “description”: “product name, which will be given as input”, “type”: “string”}, “brand”: {“title”: “Brand”, “description”: “Brand of the product inferred from the product name”, “type”: “string”}, “color”: {“title”: “Color”, “description”: “Color of the product inferred from the product name”, “type”: “string”}, “material”: {“title”: “Material”, “description”: “Material of the product inferred from the product name”, “type”: “string”}, “price”: {“title”: “Price”, “description”: “Price of the product inferred from the product name”, “type”: “string”}, “category”: {“title”: “Category”, “description”: “Category of the product inferred from the product name”, “type”: “string”}, “sub_category”: {“title”: “Sub Category”, “description”: “Sub-category of the product inferred from the product name”, “type”: “string”}, “product_line”: {“title”: “Product Line”, “description”: “Product Line of the product inferred from the product name”, “type”: “string”}, “gender”: {“title”: “Gender”, “description”: “Gender of the product inferred from the product name”, “type”: “string”}, “year_of_first_sale”: {“title”: “Year Of First Sale”, “description”: “Year of first sale of the product inferred from the product name”, “type”: “string”}, “season”: {“title”: “Season”, “description”: “Season of the product inferred from the product name”, “type”: “string”}}}}}

Models and parameters

We used the following prompting parameters:

  • Number of packings – 1, 5, 10
  • Number of in-context examples – 0, 2, 5, 10
  • Format instruction – JSON format pseudo example (shorter length), JSON format full example (longer length)

For Llama 2, the model choices were meta.llama2-13b-chat-v1 or meta.llama2-70b-chat-v1. We used the following LLM parameters:

{
    "temperature": 0.1,
    "top_p": 0.9,
    "max_gen_len": 2048,
}

For Anthropic’s Claude, the model choices were anthropic.claude-instant-v1 and anthropic.claude-v2. We used the following LLM parameters:

{
   "temperature": 0.1,
   "top_k": 250,
   "top_p": 1,
   "max_tokens_to_sample": 4096,
   "stop_sequences": ["nnHuman:"],
   "anthropic_version": "bedrock-2023-05-31"
}

The solution is straightforward to extend to other LLMs hosted on Amazon Bedrock, such as Amazon Titan (switch the model ID to amazon.titan-tg1-large, for example), Jurassic (model ID ai21.j2-ultra), and more.

Evaluations

The framework includes evaluation metrics that can be extended further to accommodate changes in accuracy requirements. Currently, it involves five different metrics:

  • Content coverage – Measures portions of missing values in the output generation step.
  • Parsing coverage – Measures portions of missing samples in the format parsing step:
    • Parsing recall on product name – An exact match serves as a lower bound for parsing completeness (parsing coverage is the upper bound for parsing completeness) because in some cases, two virtually identical product names need to be normalized and transformed to be an exact match (for example, “Nike Air Jordan” and “nike. air Jordon”).
    • Parsing precision on product name – For an exact match, we use a similar metric to parsing recall, but use precision instead of recall.
  • Final coverage – Measures portions of missing values in both output generation and format parsing steps.
  • Human evaluation – Focuses on holistic quality evaluation such as accuracy, relevance, and comprehensiveness (richness) of the text generation.

Results

The following are the approximate sample input and output lengths under some best performing settings:

  • Input length for Llama 2 model family – 2,068 tokens for 10-shot, 1,585 tokens for 5-shot, 1,319 tokens for 2-shot
  • Input length for Anthropic’s Claude model family – 1,314 tokens for 10-shot, 831 tokens for 5-shot, 566 tokens for 2-shot, 359 tokens for zero-shot
  • Output length with 5-packing – Approximately 500 tokens

Quantitative results

The following table summarizes our consolidated quantitative results.

  • To be concise, the table contains only some of our final recommendations for each model types.
  • The metrics used are latency and accuracy.
  • The best model and results are highlighted in green color and in bold font.
Config Latency Accuracy
Batch process service Model Prompt Batch process latency (5 packing) Near-real-time process latency (1 packing) Programmatic evaluation (coverage)
test set = 20 test set = 5k GoDaddy rqmt @ 5k Recall on parsing exact match Final content coverage
Amazon Bedrock batch inference Llama2-13b zero-shot n/a n/a 3600s n/a n/a n/a
5-shot (template12) 65.4s 1704s 3600s 72/20=3.6s 92.60% 53.90%
Llama2-70b zero-shot n/a n/a 3600s n/a n/a n/a
5-shot (template13) 139.6s 5299s 3600s 156/20=7.8s 98.30% 61.50%
Claude-v1 (instant) zero-shot (template6) 29s 723s 3600s 44.8/20=2.24s 98.50% 96.80%
5-shot (template12) 30.3s 644s 3600s 51/20=2.6s 99% 84.40%
Claude-v2 zero-shot (template6) 82.2s 1706s 3600s 104/20=5.2s 99% 84.40%
5-shot (template14) 49.1s 1323s 3600s 104/20=5.2s 99.40% 90.10%

The following tables summarize the scaling effect in batch inference.

  • When scaling from 5,000 to 100,000 samples, only eight times more computation time was needed.
  • Performing categorization with individual LLM calls for each product would have increased the inference time for 100,000 products by approximately 40 times compared to the batch processing method.
  • The accuracy in coverage remained stable, and cost scaled approximately linearly.
Batch process service Model Prompt Batch process latency (5 packing) Near-real-time process latency (1 packing)
test set = 20 test set = 5k GoDaddy rqmt @ 5k test set = 100k
Amazon Bedrock batch Claude-v1 (instant) zero-shot (template6) 29s 723s 3600s 5733s 44.8/20=2.24s
Amazon Bedrock batch Anthropic’s Claude-v2 zero-shot (template6) 82.2s 1706s 3600s 7689s 104/20=5.2s
Batch process service Near-real-time process latency (1 packing) Programmatic evaluation (coverage)
Parsing recall on product name (test set = 5k) Parsing recall on product name (test set = 100k) Final content coverage (test set = 5k) Final content coverage (test set = 100k)
Amazon Bedrock batch 44.8/20=2.24s 98.50% 98.40% 96.80% 96.50%
Amazon Bedrock batch 104/20=5.2s 99% 98.80% 84.40% 97%

The following table summarizes the effect of n-packing. Llama 2 has an output length limit of 2,048 and fits up to around 20 packing. Anthropic’s Claude has a higher limit. We tested on 20 ground truth samples for 1, 5, and 10 packing and selected results from all model and prompt templates. The scaling effect on latency was more obvious in the Anthropic’s Claude model family than Llama 2. Anthropic’s Claude had better generalizability than Llama 2 when extending the packing numbers in output.

We only tried a few shots with Llama 2 models, which showed improved accuracy over zero-shot.

Batch process service Model Prompt Latency (test set = 20) Accuracy (final coverage)
npack = 1 npack= 5 npack = 10 npack = 1 npack= 5 npack = 10
Amazon Bedrock batch inference Llama2-13b 5-shot (template12) 72s 65.4s 65s 95.90% 93.20% 88.90%
Llama2-70b 5-shot (template13) 156s 139.6s 150s 85% 97.70% 100%
Claude-v1 (instant) zero-shot (template6) 45s 29s 27s 99.50% 99.50% 99.30%
5-shot (template12) 51.3s 30.3s 27.4s 99.50% 99.50% 100%
Claude-v2 zero-shot (template6) 104s 82.2s 67s 85% 97.70% 94.50%
5-shot (template14) 104s 49.1s 43.5s 97.70% 100% 99.80%

Qualitative results

We noted the following qualitative results:

  • Human evaluation – The categories generated were evaluated qualitatively by GoDaddy SMEs. The categories were found to be of good quality.
  • Learnings – We used an LLM in two separate calls: output generation and format parsing. We observed the following:
    • For this use case, we saw Llama 2 didn’t perform well in format parsing but was relatively capable in output generation. To be consistent and make a fair comparison, we required the LLM used in both calls to be the same—the API calls in both steps should all be invoked to llama2-13b-chat-v1, or they should all be invoked to anthropic.claude-instant-v1. However, GoDaddy chose Llama 2 as the LLM for category generation. For this use case, we found that using Llama 2 in output generation only and using Anthropic’s Claude in format parsing was suitable due to Llama 2’s relative lower model capability.
    • Format parsing is improved through prompt engineering (JSON format instruction is critical) to reduce the latency. For example, with Anthropic’s Claude-Instant on a 20-test set and averaging multiple prompt templates, the latency can be reduced by approximately 77% (from 90 seconds to 20 seconds). This directly eliminates the necessity of using a JSON fine-tuned version of the LLM.
  • Llama2 – We observed the following:
    • Llama2-13b and Llama2-70b models both need the full instruction as format_instruction() in zero-shot prompts.
    • Llama2-13b seems to be worse in content coverage and formatting (for example, it can’t correctly escape char, \“), which can incur significant parsing time and cost and also degrade accuracy.
    • Llama 2 shows clear performance drops and instability when the packing number varies among 1, 5, and 10, indicating poorer generalizability compared to the Anthropic’s Claude model family.
  • Anthropic’s Claude – We observed the following:
    • Anthropic’s Claude-Instant and Claude-v2, regardless of using zero-shot or few-shot prompting, need only partial format instruction instead of the full instruction format_instruction(). It shortens the input length, and is therefore more cost-effective. It also shows Anthropic’s Claude’s better capability in following instructions.
    • Anthropic’s Claude generalizes well when varying packing numbers among 1, 5, and 10.

Business takeaways

We had the following key business takeaways:

  • Improved latency – Our solution inferences 5,000 products in 12 minutes, which is 80% faster than GoDaddy’s needs (5,000 products in 1 hour). Using batch inference in Amazon Bedrock demonstrates efficient batch processing capabilities and anticipates further scalability with AWS planning to deploy more cloud instances. The expansion will lead to increased time and cost savings.
  • More cost-effectiveness – The solution built by the Generative AI Innovation Center using Anthropic’s Claude-Instant is 8% more affordable than the existing proposal using Llama2-13b while also providing 79% more coverage.
  • Enhanced accuracy – The deliverable produces 97% category coverage on both the 5,000 and 100,000 hold-out test set, exceeding GoDaddy’s needs at 90%. The comprehensive framework is able to facilitate future iterative improvements over the current model parameters and prompt templates.
  • Qualitative assessment – The category generation is in satisfactory quality through human evaluation by GoDaddy SMEs.

Technical takeaways

We had the following key technical takeaways:

  • The solution features both batch inference and near real-time inference (2 seconds per product) capability and multiple backend LLM selections.
  • Anthropic’s Claude-Instant with zero-shot is the clear winner:
    • It was best in latency, cost, and accuracy on the 5,000 hold-out test set.
    • It showed better generalizability to higher packing numbers (number of SKUs in one query), with potentially more cost and latency improvement.
  • Iteration on prompt templates shows improvement on all these models, suggesting that good prompt engineering is a practical approach for the categorization generation task.
  • Input-wise, increasing to 10-shot may further improve performance, as observed in small-scale science experiments, but also increase the cost by around 30%. Therefore, we tested at most 5-shot in large-scale batch experiments.
  • Output-wise, increasing to 10-packing or even 20-packing (Anthropic’s Claude only; Llama 2 has 2,048 output length limit) might further improve latency and cost (because more SKUs can share the same input instructions).
  • For this use case, we saw Anthropic’s Claude model family having better accuracy and generalizability, for example:
    • Final category coverage performance was better with Anthropic’s Claude-Instant.
    • When increasing packing numbers from 1, 5, to 10, Anthropic’s Claude-Instant showed improvement in latency and stable accuracy in comparison to Llama 2.
    • To achieve the final categories for the use case, we noticed that Anthropic’s Claude required a shorter prompt input to follow the instruction and had a longer output length limit for a higher packing number.

Next steps for GoDaddy

The following are the recommendations that the GoDaddy team is considering as a part of future steps:

  • Dataset enhancement – Aggregate a larger set of ground truth examples and expand programmatic evaluation to better monitor and refine the model’s performance. On a related note, if the product names can be normalized by domain knowledge, the cleaner input is also helpful for better LLM responses. For example, the product name ”<product_name> Power t-shirt, ladyfit vest or hoodie” can prompt the LLM to respond for multiple SKUs, instead of one SKU (similarly, “<product_name> – $5 or $10 or $20 or $50 or $100”).
  • Human evaluation – Increase human evaluations to provide higher generation quality and alignment with desired outcomes.
  • Fine-tuning – Consider fine-tuning as a potential strategy for enhancing category generation when a more extensive training dataset becomes available.
  • Prompt engineering – Explore automatic prompt engineering techniques to enhance category generation, particularly when additional training data becomes available.
  • Few-shot learning – Investigate techniques such as dynamic few-shot selection and crafting in-context examples based on the model’s parameter knowledge to enhance the LLMs’ few-shot learning capabilities.
  • Knowledge integration – Improve the model’s output by connecting LLMs to a knowledge base (internal or external database) and enabling it to incorporate more relevant information. This can help to reduce LLM hallucinations and enhance relevance in responses.

Conclusion

In this post, we shared how the Generative AI Innovation Center team worked with GoDaddy to create a more accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system. We implemented n-packing techniques and used Anthropic’s Claude and Meta Llama 2 models to improve latency. We experimented with different prompts to improve the categorization with LLMs and found that Anthropic’s Claude model family gave the better accuracy and generalizability than the Llama 2 model family. GoDaddy team will test this solution on a larger dataset and evaluate the categories generated from the recommended approaches.

If you’re interested in working with the AWS Generative AI Innovation Center, please reach out.

Security Best Practices

References


About the Authors

Vishal Singh is a Data Engineering leader at the Data and Analytics team of GoDaddy. His key focus area is towards building data products and generating insights from them by application of data engineering tools along with generative AI.

Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.

Meghana Ashok is a Machine Learning Engineer at the Generative AI Innovation Center. She collaborates closely with customers, guiding them in developing secure, cost-efficient, and resilient solutions and infrastructure tailored to their generative AI needs.

Karan Sindwani is an Applied Scientist at AWS where he works with AWS customers across different verticals to accelerate their use of Gen AI and AWS Cloud services to solve their business challenges.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he uses his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

Benchmarking customized models on Amazon Bedrock using LLMPerf and LiteLLM

Benchmarking customized models on Amazon Bedrock using LLMPerf and LiteLLM

Open foundation models (FMs) allow organizations to build customized AI applications by fine-tuning for their specific domains or tasks, while retaining control over costs and deployments. However, deployment can be a significant portion of the effort, often requiring 30% of project time because engineers must carefully optimize instance types and configure serving parameters through careful testing. This process can be both complex and time-consuming, requiring specialized knowledge and iterative testing to achieve the desired performance.

Amazon Bedrock Custom Model Import simplifies deployments of custom models by offering a straightforward API for model deployment and invocation. You can upload model weights and let AWS handle an optimal, fully managed deployment. This makes sure that deployments are performant and cost effective. Amazon Bedrock Custom Model Import also handles automatic scaling, including scaling to zero. When not in use and there are no invocations for 5 minutes, it scales to zero. You pay only for what you use in 5-minute increments. It also handles scaling up, automatically increasing the number of active model copies when higher concurrency is required. These features make Amazon Bedrock Custom Model Import an attractive solution for organizations looking to use custom models on Amazon Bedrock providing simplicity and cost-efficiency.

Before deploying these models in production, it’s crucial to evaluate their performance using benchmarking tools. These tools help to proactively detect potential production issues such as throttling and verify that deployments can handle expected production loads.

This post begins a blog series exploring DeepSeek and open FMs on Amazon Bedrock Custom Model Import. It covers the process of performance benchmarking of custom models in Amazon Bedrock using popular open source tools: LLMPerf and LiteLLM. It includes a notebook that includes step-by-step instructions to deploy a DeepSeek-R1-Distill-Llama-8B model, but the same steps apply for any other model supported by Amazon Bedrock Custom Model Import.

Prerequisites

This post requires an Amazon Bedrock custom model. If you don’t have one in your AWS account yet, follow the instructions from Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import.

Using open source tools LLMPerf and LiteLLM for performance benchmarking

To conduct performance benchmarking, you will use LLMPerf, a popular open-source library for benchmarking foundation models. LLMPerf simulates load tests on model invocation APIs by creating concurrent Ray Clients and analyzing their responses. A key advantage of LLMPerf is wide support of foundation model APIs. This includes LiteLLM, which supports all models available on Amazon Bedrock.

Setting up your custom model invocation with LiteLLM

LiteLLM is a versatile open source tool that can be used both as a Python SDK and a proxy server (AI gateway) for accessing over 100 different FMs using a standardized format. LiteLLM standardizes inputs to match each FM provider’s specific endpoint requirements. It supports Amazon Bedrock APIs, including InvokeModel and Converse APIs, and FMs available on Amazon Bedrock, including imported custom models.

To invoke a custom model with LiteLLM, you use the model parameter (see Amazon Bedrock documentation on LiteLLM). This is a string that follows the bedrock/provider_route/model_arn format.

The provider_route indicates the LiteLLM implementation of request/response specification to use. DeepSeek R1 models can be invoked using their custom chat template using the DeepSeek R1 provider route, or with the Llama chat template using the Llama provider route.

The model_arn is the model Amazon Resource Name (ARN) of the imported model. You can get the model ARN of your imported model in the console or by sending a ListImportedModels request.

For example, the following script invokes the custom model using the DeepSeek R1 chat template.

import time
from litellm import completion

while True:
    try:
        response = completion(
            model=f"bedrock/deepseek_r1/{model_id}",
            messages=[{"role": "user", "content": """Given the following financial data:
        - Company A's revenue grew from $10M to $15M in 2023
        - Operating costs increased by 20%
        - Initial operating costs were $7M
        
        Calculate the company's operating margin for 2023. Please reason step by step."""},
                      {"role": "assistant", "content": "<think>"}],
            max_tokens=4096,
        )
        print(response['choices'][0]['message']['content'])
        break
    except:
        time.sleep(60)

After the invocation parameters for the imported model have been verified, you can configure LLMPerf for benchmarking.

Configuring a token benchmark test with LLMPerf

To benchmark performance, LLMPerf uses Ray, a distributed computing framework, to simulate realistic loads. It spawns multiple remote clients, each capable of sending concurrent requests to model invocation APIs. These clients are implemented as actors that execute in parallel. llmperf.requests_launcher manages the distribution of requests across the Ray Clients, and allows for simulation of various load scenarios and concurrent request patterns. At the same time, each client will collect performance metrics during the requests, including latency, throughput, and error rates.

Two critical metrics for performance include latency and throughput:

  • Latency refers to the time it takes for a single request to be processed.
  • Throughput measures the number of tokens that are generated per second.

Selecting the right configuration to serve FMs typically involves experimenting with different batch sizes while closely monitoring GPU utilization and considering factors such as available memory, model size, and specific requirements of the workload. To learn more, see Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference. Although Amazon Bedrock Custom Model Import simplifies this by offering pre-optimized serving configurations, it’s still crucial to verify your deployment’s latency and throughput.

Start by configuring token_benchmark.py, a sample script that facilitates the configuration of a benchmarking test. In the script, you can define parameters such as:

  • LLM API: Use LiteLLM to invoke Amazon Bedrock custom imported models.
  • Model: Define the route, API, and model ARN to invoke similarly to the previous section.
  • Mean/standard deviation of input tokens: Parameters to use in the probability distribution from which the number of input tokens will be sampled.
  • Mean/standard deviation of output tokens: Parameters to use in the probability distribution from which the number of output tokens will be sampled.
  • Number of concurrent requests: The number of users that the application is likely to support when in use.
  • Number of completed requests: The total number of requests to send to the LLM API in the test.

The following script shows an example of how to invoke the model. See this notebook for step-by-step instructions on importing a custom model and running a benchmarking test.

python3 ${{LLM_PERF_SCRIPT_DIR}}/token_benchmark_ray.py \
--model "bedrock/llama/{model_id}" \
--mean-input-tokens {mean_input_tokens} \
--stddev-input-tokens {stddev_input_tokens} \
--mean-output-tokens {mean_output_tokens} \
--stddev-output-tokens {stddev_output_tokens} \
--max-num-completed-requests ${{LLM_PERF_MAX_REQUESTS}} \
--timeout 1800 \
--num-concurrent-requests ${{LLM_PERF_CONCURRENT}} \
--results-dir "${{LLM_PERF_OUTPUT}}" \
--llm-api litellm \
--additional-sampling-params '{{}}'

At the end of the test, LLMPerf will output two JSON files: one with aggregate metrics, and one with separate entries for every invocation.

Scale to zero and cold-start latency

One thing to remember is that because Amazon Bedrock Custom Model Import will scale down to zero when the model is unused, you need to first make a request to make sure that there is at least one active model copy. If you obtain an error indicating that the model isn’t ready, you need to wait for approximately ten seconds and up to 1 minute for Amazon Bedrock to prepare at least one active model copy. When ready, run a test invocation again, and proceed with benchmarking.

Example scenario for DeepSeek-R1-Distill-Llama-8B

Consider a DeepSeek-R1-Distill-Llama-8B model hosted on Amazon Bedrock Custom Model Import, supporting an AI application with low traffic of no more than two concurrent requests. To account for variability, you can adjust parameters for token count for prompts and completions. For example:

  • Number of clients: 2
  • Mean input token count: 500
  • Standard deviation input token count: 25
  • Mean output token count: 1000
  • Standard deviation output token count: 100
  • Number of requests per client: 50

This illustrative test takes approximately 8 minutes. At the end of the test, you will obtain a summary of results of aggregate metrics:

inter_token_latency_s
    p25 = 0.010615988283217918
    p50 = 0.010694698716183695
    p75 = 0.010779359342088015
    p90 = 0.010945443657517748
    p95 = 0.01100556307365132
    p99 = 0.011071086908721675
    mean = 0.010710014800224604
    min = 0.010364670612635254
    max = 0.011485444453299149
    stddev = 0.0001658793389904756
ttft_s
    p25 = 0.3356793452499005
    p50 = 0.3783651359990472
    p75 = 0.41098671700046907
    p90 = 0.46655246950049334
    p95 = 0.4846706690498647
    p99 = 0.6790834719300077
    mean = 0.3837810468001226
    min = 0.1878921090010408
    max = 0.7590946710006392
    stddev = 0.0828713133225014
end_to_end_latency_s
    p25 = 9.885957818500174
    p50 = 10.561580732000039
    p75 = 11.271923759749825
    p90 = 11.87688222009965
    p95 = 12.139972019549713
    p99 = 12.6071144856102
    mean = 10.406450886010116
    min = 2.6196457750011177
    max = 12.626598834998731
    stddev = 1.4681851822617253
request_output_throughput_token_per_s
    p25 = 104.68609252502657
    p50 = 107.24619111072519
    p75 = 108.62997591951486
    p90 = 110.90675007239598
    p95 = 113.3896235445618
    p99 = 116.6688412475626
    mean = 107.12082450567561
    min = 97.0053466021563
    max = 129.40680882698936
    stddev = 3.9748004356837137
number_input_tokens
    p25 = 484.0
    p50 = 500.0
    p75 = 514.0
    p90 = 531.2
    p95 = 543.1
    p99 = 569.1200000000001
    mean = 499.06
    min = 433
    max = 581
    stddev = 26.549294727074212
number_output_tokens
    p25 = 1050.75
    p50 = 1128.5
    p75 = 1214.25
    p90 = 1276.1000000000001
    p95 = 1323.75
    p99 = 1372.2
    mean = 1113.51
    min = 339
    max = 1392
    stddev = 160.9598415942952
Number Of Errored Requests: 0
Overall Output Throughput: 208.0008834264341
Number Of Completed Requests: 100
Completed Requests Per Minute: 11.20784995697034

In addition to the summary, you will receive metrics for individual requests that can be used to prepare detailed reports like the following histograms for time to first token and token throughput.

Analyzing performance results from LLMPerf and estimating costs using Amazon CloudWatch

LLMPerf gives you the ability to benchmark the performance of custom models served in Amazon Bedrock without having to inspect the specifics of the serving properties and configuration of your Amazon Bedrock Custom Model Import deployment. This information is valuable because it represents the expected end user experience of your application.

In addition, the benchmarking exercise can serve as a valuable tool for cost estimation. By using Amazon CloudWatch, you can observe the number of active model copies that Amazon Bedrock Custom Model Import scales to in response to the load test. ModelCopy is exposed as a CloudWatch metric in the AWS/Bedrock namespace and is reported using the imported model ARN as a label. The plot for the ModelCopy metric is shown in the figure below. This data will assist in estimating costs, because billing is based on the number of active model copies at a given time.

Conclusion

While Amazon Bedrock Custom Model Import simplifies model deployment and scaling, performance benchmarking remains essential to predict production performance, and compare models across key metrics such as cost, latency, and throughput.

To learn more, try the example notebook with your custom model.

Additional resources:


About the Authors

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on the serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon Bedrock. In his spare time, Paras enjoys spending time with his family and biking around the Bay Area.

Prashant Patel is a Senior Software Development Engineer in AWS Bedrock. He’s passionate about scaling large language models for enterprise applications. Prior to joining AWS, he worked at IBM on productionizing large-scale AI/ML workloads on Kubernetes. Prashant has a master’s degree from NYU Tandon School of Engineering. While not at work, he enjoys traveling and playing with his dogs.

Read More

Creating asynchronous AI agents with Amazon Bedrock

Creating asynchronous AI agents with Amazon Bedrock

The integration of generative AI agents into business processes is poised to accelerate as organizations recognize the untapped potential of these technologies. Advancements in multimodal artificial intelligence (AI), where agents can understand and generate not just text but also images, audio, and video, will further broaden their applications. This post will discuss agentic AI driven architecture and ways of implementing.

The emergence of generative AI agents in recent years has contributed to the transformation of the AI landscape, driven by advances in large language models (LLMs) and natural language processing (NLP). Companies like Anthropic, Cohere, and Amazon have made significant strides in developing powerful language models capable of understanding and generating human-like content across multiple modalities, revolutionizing how businesses integrate and utilize artificial intelligence in their processes.

These AI agents have demonstrated remarkable versatility, being able to perform tasks ranging from creative writing and code generation to data analysis and decision support. Their ability to engage in intelligent conversations, provide context-aware responses, and adapt to diverse domains has revolutionized how businesses approach problem-solving, customer service, and knowledge dissemination.

One of the most significant impacts of generative AI agents has been their potential to augment human capabilities through both synchronous and asynchronous patterns. In synchronous orchestration, just like in traditional process automation, a supervisor agent orchestrates the multi-agent collaboration, maintaining a high-level view of the entire process while actively directing the flow of information and tasks. This approach allows businesses to offload repetitive and time-consuming tasks in a controlled, predictable manner.

Alternatively, asynchronous choreography follows an event-driven pattern where agents operate autonomously, triggered by events or state changes in the system. In this model, agents publish events or messages that other agents can subscribe to, creating a workflow that emerges from their collective behavior. These patterns have proven particularly valuable in enhancing customer experiences, where agents can provide round-the-clock support, resolve issues promptly, and deliver personalized recommendations through either orchestrated or event-driven interactions, leading to increased customer satisfaction and loyalty.

Agentic AI architecture

Agentic AI architecture is a shift in process automation through autonomous agents towards the capabilities of AI, with the purpose of imitating cognitive abilities and enhancing the actions of traditional autonomous agents. This architecture can enable businesses to streamline operations, enhance decision-making processes, and automate complex tasks in new ways.

Much like traditional business process automation through technology, the agentic AI architecture is the design of AI systems designed to resolve complex problems with limited or indirect human intervention. These systems are composed of multiple AI agents that converse with each other or execute complex tasks through a series of choreographed or orchestrated processes. This approach empowers AI systems to exhibit goal-directed behavior, learn from experience, and adapt to changing environments.

The difference between a single agent invocation and a multi-agent collaboration lies in the complexity and the number of agents involved in the process.

When you interact with a digital assistant like Alexa, you’re typically engaging with a single agent, also known as a conversational agent. This agent processes your request, such as setting a timer or checking the weather, and provides a response without needing to consult other agents.

Now, imagine expanding this interaction to include multiple agents working together. Let’s start with a simple travel booking scenario:

Your interaction begins with telling a travel planning agent about your desired trip. In this first step, the AI model, in this case an LLM, is acting as an interpreter and user experience interface between your natural language input and the structured information needed by the travel planning system. It’s processing your request, which might be a complex statement like “I want to plan a week-long beach vacation in Hawaii for my family of four next month,” and extracting key details such as the destination, duration, number of travelers, and approximate dates.

The LLM is also likely to infer additional relevant information that wasn’t explicitly stated, such as the need for family-friendly accommodations or activities. It might ask follow-up questions to clarify ambiguous points or gather more specific preferences. Essentially, the LLM is transforming your casual, conversational input into a structured set of travel requirements that can be used by the specialized booking agents in the subsequent steps of the workflow.

This initial interaction sets the foundation for the entire multi-agent workflow, making sure that the travel planning agent has a clear understanding of your needs before engaging other specialized agents.

By adding another agent, the flight booking agent, the travel planning agent can call upon it to find suitable flights. The travel planning agent needs to provide the flight booking agent with relevant information (dates, destinations), and wait for and process the flight booking agent’s response, to incorporate the flight options into its overall plan

Now, let’s add another agent to the workflow; a hotel booking agent to support finding accommodations. With this addition, the travel planning agent must also communicate with the hotel booking agent, which needs to make sure that the hotel dates align with the flight dates and provide the information back to the overall plan to include both flight and hotel options.

As we continue to add agents, such as a car rental agent or a local activities agent, each new addition receives relevant information from the travel planning agent, performs its specific task, and returns its results to be incorporated into the overall plan. The travel planning agent acts not only as the user experience interface, but also as a coordinator, deciding when to involve each specialized agent and how to combine their inputs into a cohesive travel plan.

This multi-agent workflow allows for more complex tasks to be accomplished by taking advantage of the specific capabilities of each agent. The system remains flexible, because agents can be added or removed based on the specific needs of each request, without requiring significant changes to the existing agents and minimal change to the overall workflow.

For more on the benefits of breaking tasks into agents, see How task decomposition and smaller LLMs can make AI more affordable.

Process automation with agentic AI architecture

The preceding scenario, just like in traditional process automation, is a common orchestration pattern, where the multi-agent collaboration is orchestrated by a supervisor agent. The supervisor agent acts like a conductor leading an orchestra, telling each instrument when to play and how to harmonize with others. For this approach, Amazon Bedrock Agents enables generative AI applications to execute multi-step tasks orchestrated by an agent and create a multi-agent collaboration with Amazon Bedrock Agents to solve complex tasks. This is done by designating an Amazon Bedrock agent as a supervisor agent, associating one or more collaborator agents with the supervisor. For more details, read on creating and configuring Amazon Bedrock Agents and Use multi-agent collaboration with Amazon Bedrock Agents.

The following diagram illustrates the supervisor agent methodology.

Supervisor agent methodology

Supervisor agent methodology

Following traditional process automation patterns, the other end of the spectrum to synchronous orchestration would be asynchronous choreography: an asynchronous event-driven multi-agent workflow. In this approach, there would be no central orchestrating agent (supervisor). Agents operate autonomously where actions are triggered by events or changes in a system’s state and agents publish events or messages that other agents can subscribe to. In this approach, the workflow emerges from the collective behavior of the agents reacting to events asynchronously. It’s more like a jazz improvisation, where each musician responds to what others are playing without a conductor. The following diagram illustrates this event-driven workflow.

Event-driven workflow methodology

Event-driven workflow methodology

The event-driven pattern in asynchronous systems operates without predefined workflows, creating a dynamic and potentially chaotic processing environment. While agents subscribe to and publish messages through a central event hub, the flow of processing is determined organically by the message requirements and the available subscribed agents. Although the resulting pattern may resemble a structured workflow when visualized, it’s important to understand that this is emergent behavior rather than orchestrated design. The absence of centralized workflow definitions means that message processing occurs naturally based on publication timing and agent availability, creating a fluid and adaptable system that can evolve with changing requirements.

The choice between synchronous orchestration and asynchronous event-driven patterns fundamentally shapes how agentic AI systems operate and scale. Synchronous orchestration, with its supervisor agent approach, provides precise control and predictability, making it ideal for complex processes requiring strict oversight and sequential execution. This pattern excels in scenarios where the workflow needs to be tightly managed, audited, and debugged. However, it can create bottlenecks as all operations must pass through the supervisor agent. Conversely, asynchronous event-driven systems offer greater flexibility and scalability through their distributed nature. By allowing agents to operate independently and react to events in real-time, these systems can handle dynamic scenarios and adapt to changing requirements more readily. While this approach may introduce more complexity in tracking and debugging workflows, it excels in scenarios requiring high scalability, fault tolerance, and adaptive behavior. The decision between these patterns often depends on the specific requirements of the system, balancing the need for control and predictability against the benefits of flexibility and scalability.

Getting the best of both patterns

You can use a single agent to route messages to other agents based on the context of the event data (message) at runtime, with no prior knowledge of the downstream agents, without having to rely on each agent subscribing to an event hub. This is traditionally known as the message broker or event broker pattern, which for the purpose of this article we will call an agent broker pattern, to represent brokering of messages to AI agents. The agent broker pattern is a hybrid approach that combines elements of both centralized synchronous orchestration and distributed asynchronous event-driven systems.

The key to this pattern is that a single agent acts as a central hub for message distribution but doesn’t control the entire workflow. The broker agent determines where to send each message based on its content or metadata, making routing decisions at runtime. The processing agents are decoupled from each other and from the message source, only interacting with the broker to receive messages. The agent broker pattern is different from the supervisor pattern because it awaits a response from collaborating agents by routing a message to an agent and not awaiting a response. The following diagram illustrates the agent broker methodology.

Agent broker methodology

Agent broker methodology

Following an agent broker pattern, the system is still fundamentally event-driven, with actions triggered by the arrival of messages. New agents can be added to handle specific types of messages without changing the overall system architecture. Understanding how to implement this type of pattern will be explained later in this post.

This pattern is often used in enterprise messaging systems, microservices architectures, and complex event processing systems. It provides a balance between the structure of orchestrated workflows and the flexibility of pure event-driven systems.

Agentic architecture with the Amazon Bedrock Converse API

Traditionally, we might have had to sacrifice some flexibility in the broker pattern by having to update the routing logic in the broker when adding additional processes (agents) to the architecture. This is, however, not the case when using the Amazon Bedrock Converse API. With the Converse API, we can call a tool to complete an Amazon Bedrock model response. The only change is the additional agent added to the collaboration stored as configuration outside of the broker.

To let a model use a tool to complete a response for a message, the message and the definitions for one or more tools (agents) are sent to the model. If the model determines that one of the tools can help generate a response, it returns a request to use the tool.

AWS AppConfig, a capability of AWS Systems Manager, is used to store each of the agents’ tool context data as a single configuration in a managed data store, to be sent to the Converse API tool request. By using AWS Lambda as the message broker to receive all message and send requests to the Converse API with the tool context stored in AWS AppConfig, the architecture allows for adding additional agents to the system without having to update the routing logic, by ‘registering’ agents as ‘tool context’ in the configuration stored in AWS AppConfig, to be read by Lambda at run time (event message received). For more information about when to use AWS Config, see AWS AppConfig use cases.

Implementing the agent broker pattern

The following diagram demonstrates how Amazon EventBridge and Lambda act as a central message broker, with the Amazon Bedrock Converse API to let a model use a tool in a conversation to dynamically route messages to appropriate AI agents.

Agent broker architecture diagram

Agent broker architecture

Messages sent to EventBridge are routed through an EventBridge rule to Lambda. There are three tasks the EventBridge Lambda function performs as the agent broker:

  1. Query AWS AppConfig for all agents’ tool context. An agent tool context is a description of the agent’s capability along with the Amazon Resource Name (ARN) or URL of the agent’s message ingress.
  2. Provide the agent tool context along with the inbound event message to the Amazon Bedrock LLM through the Converse API; in this example, using an Amazon Bedrock tools-compatible LLM. The LLM, using the Converse API, combines the event message context compared to the agent tool context to provide a response back to the requesting Lambda function, containing the recommended tool or tools that should be used to process the message.
  3. Receive the response from the Converse API request containing one or more tools that should be called to process the event message, and hands the event message to the ingress of the recommended tools.

In this example, the architecture demonstrates brokering messages asynchronously to an Amazon SageMaker based agent, an Amazon Bedrock agent, and an external third-party agent, all from the same agent broker.

Although the brokering Lambda function could connect directly to the SageMaker or Amazon Bedrock agent API, the architecture provides for adaptability and scalability in message throughput, allowing messages from the agent broker to be queued, in this example with Amazon Simple Queue Service (Amazon SQS), and processed according to the capability of the receiving agent. For adaptability, the Lambda function subscribed to the agent ingress queue provides additional system prompts (pre-prompting of the LLM for specific tool context) and message formatted, and required functions for the expected input and output of the agent request.

To add new agents to the system, the only integration requirements are to update the AWS AppConfig with the new agent tool context (description of the agents’ capability and ingress endpoint), and making sure the brokering Lambda function has permissions to write to the agent ingress endpoint.

Agents can be added to the system without rewriting the Lambda function or integration that requires downtime, allowing the new agent to be used on the next instantiation of the brokering Lambda function.

Implementing the supervisor pattern with an agent broker

Building upon the agent broker pattern, the architecture can be extended to handle more complex, stateful interactions. Although the broker pattern effectively uses AWS AppConfig and Amazon Bedrock’s Converse API tool use capability for dynamic routing, its unidirectional nature has limitations. Events flow in and are distributed to agents, but complex scenarios like travel booking require maintaining context across multiple agent interactions. This is where the supervisor pattern provides additional capabilities without compromising the flexible routing we achieved with the broker pattern.

Using the example of the travel booking agent: the example has the broker agent and several task-based agents that events will be pushed to. When processing a request like “Book a 3-night trip to Sydney from Melbourne during the first week of September for 2 people”, we encounter several challenges. Although this statement contains clear intent, it lacks critical details that the agent might need, such as:

  1. Specific travel dates
  2. Accommodation preferences and room configurations

The broker pattern alone can’t effectively manage these information gaps while maintaining context between agent interactions. This is where adding the capability of a supervisor to the broker agent provides:

  • Contextual awareness between events and agent invocations
  • Bi-directional information flow capabilities

The following diagram illustrates the supervisor pattern workflow

Supervisor pattern architecture diagram

Supervisor pattern architecture

When a new event enters the system, the workflow initiates the following steps:

  1. The event is assigned a unique identifier for tracking
  2. The supervisor performs the following actions:
    • Evaluates which agents to invoke (brokering)
    • Creates a new state record with the identifier and timestamp
    • Provides this contextual information to the selected agents along with their invocation parameters
  3. Agents process their tasks and emit ‘task completion’ events back to EventBridge
  4. The supervisor performs the following actions:
    • Collects and processes completed events
    • Evaluates the combined results and context
    • Determines if additional agent invocations are needed
    • Continues this cycle until all necessary actions are completed

This pattern handles scenarios where agents might return varying results or request additional information. The supervisor can either:

  • Derive missing information from other agent responses
  • Request additional information from the source
  • Coordinate with other agents to resolve information gaps

To handle information gaps without architectural modifications, we can introduce an answers agent to the existing system. This agent operates within the same framework as other agents, but specializes in context resolution. When agents report incomplete information or require clarification, the answers agent can:

  • Process queries about missing information
  • Emit task completion events with enhanced context
  • Allow the supervisor to resume workflow execution with newly available information, the same way that it would after another agent emits its task-completion event.

This enhancement enables complex, multi-step workflows while maintaining the system’s scalability and flexibility. The supervisor can manage dependencies between agents, handle partial completions, and make sure that the necessary information is gathered before finalizing tasks.

Implementation considerations:

Implementing the supervisor pattern on top of the existing broker agent architecture provides the advantages of both the broker pattern and the complex state management of orchestration. The state management can be handled through Amazon DynamoDB, and maintaining the use of EventBridge for event routing and AWS AppConfig for agent configuration. The Amazon Bedrock Converse API continues to play a crucial role in agent selection, but now with added context from the supervisor’s state management. This allows you to preserve the dynamic routing capabilities we established with the broker pattern while adding the sophisticated workflow management needed for complex, multi-step processes.

Conclusion

Agentic AI architecture, powered by Amazon Bedrock and AWS services, represents a leap forward in the evolution of automated AI systems. By combining the flexibility of event-driven systems with the power of generative AI, this architecture enables businesses to create more adaptive, scalable, and intelligent automated processes. The agent broker pattern offers a robust solution for dynamically routing complex tasks to specialized AI agents, and the agent supervisor pattern extends these capabilities to handle sophisticated, context-aware workflows.

These patterns take advantage of the strengths of the Amazon Bedrock’s Converse API, Lambda, EventBridge, and AWS AppConfig to create a flexible and extensible system. The broker pattern excels at dynamic routing and seamless agent integration, while the supervisor pattern adds crucial state management and contextual awareness for complex, multi-step processes. Together, they provide a comprehensive framework for building sophisticated AI systems that can handle both simple routing and complex, stateful interactions.

This architecture not only streamlines operations, but also opens new possibilities for innovation and efficiency across various industries. Whether implementing simple task routing or orchestrating complex workflows requiring maintained context, organizations can build scalable, maintainable AI systems that evolve with their needs while maintaining operational stability.

To get started with an agentic AI architecture, consider the following next steps:

  • Explore Amazon Bedrock – If you haven’t already, sign up for Amazon Bedrock and experiment with its powerful generative AI models and APIs. Familiarize yourself with the Converse API and its tool use capabilities.
  • Prototype your own agent broker – Use the architecture outlined in this post as a starting point to build a proof-of-concept agent broker system tailored to your organization’s needs. Start small with a few specialized agents and gradually expand.
  • Identify use cases – Analyze your current business processes to identify areas where an agentic AI architecture could drive significant improvements. Consider complex, multi-step tasks that could benefit from AI assistance.
  • Stay informed – Keep up with the latest developments in AI and cloud technologies. AWS regularly updates its offerings, so stay tuned for new features that could enhance your agentic AI systems.
  • Collaborate and shareJoin AI and cloud computing communities to share your experiences and learn from others. Consider contributing to open-source projects or writing about your implementation to help advance the field.
  • Invest in training – Make sure your team has the necessary skills to work with these advanced AI technologies. Consider AWS training and certification programs to build expertise in your organization.

By embracing an agentic AI architecture, you’re not just optimizing your current processes – you’re positioning your organization at the forefront of the AI revolution. Start your journey today and unlock the full potential of AI-driven automation for your business.


About the Authors

aaron sempfAaron Sempf is Next Gen Tech Lead for the AWS Partner Organization in Asia-Pacific and Japan. With over 20 years in distributed system engineering design and development, he focuses on solving for large scale complex integration and event driven systems. In his spare time, he can be found coding prototypes for autonomous robots, IoT devices, distributed solutions, and designing agentic architecture patterns for generative AI assisted business automation.

josh tothJoshua Toth is a Senior Prototyping Engineer with over a decade of experience in software engineering and distributed systems. He specializes in solving complex business challenges through technical prototypes, demonstrating the art of the possible. With deep expertise in proof of concept development, he focuses on bridging the gap between emerging technologies and practical business applications. In his spare time, he can be found developing next-generation interactive demonstrations and exploring cutting-edge technological innovations.

sara van de moosdijkSara van de Moosdijk, simply known as Moose, is an AI/ML Specialist Solution Architect at AWS. She helps AWS customers and partners build and scale AI/ML solutions through technical enablement, support, and architectural guidance. Moose spends her free time figuring out how to fit more books in her overflowing bookcase.

Read More

How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries

How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries

The Qwen 2.5 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (text in/text out and code out). The Qwen 2.5 fine tuned text-only models are optimized for multilingual dialogue use cases and outperform both previous generations of Qwen models, and many of the publicly available chat models based on common industry benchmarks.

At its core, Qwen 2.5 is an auto-regressive language model that uses an optimized transformer architecture. The Qwen2.5 collection can support over 29 languages and has enhanced role-playing abilities and condition-setting for chatbots.

In this post, we outline how to get started with deploying the Qwen 2.5 family of models on an Inferentia instance using Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker using the Hugging Face Text Generation Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are also supported.

Preparation

Hugging Face provides two tools that are frequently used when using AWS Inferentia and AWS Trainium: Text Generation Inference (TGI) containers, which provide support for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.

The first time a model is run on Inferentia or Trainium, you compile the model to make sure that you have a version that will perform optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face along with the Optimum Neuron cache will transparently supply a compiled model when available. If you’re using a different model with the Qwen2.5 architecture, you might need to compile the model before deploying. For more information, see Compiling a model for Inferentia or Trainium.

You can deploy TGI as a docker container on an Inferentia or Trainium EC2 instance or on Amazon SageMaker.

Option 1: Deploy TGI on Amazon EC2 Inf2

In this example, you will deploy Qwen2.5-7B-Instruct on an inf2.xlarge instance. (See this article for detailed instructions on how to deploy an instance using the Hugging Face DLAMI.)

For this option, you SSH into the instance and create a .env file (where you’ll define your constants and specify where your model is cached) and a file named docker-compose.yaml (where you’ll define all of the environment parameters that you’ll need to deploy your model for inference). You can copy the following files for this use case.

  1. Create a .env file with the following content:
MODEL_ID='Qwen/Qwen2.5-7B-Instruct'
#MODEL_ID='/data/exportedmodel' 
HF_AUTO_CAST_TYPE='bf16' # indicates the auto cast type that was used to compile the model
MAX_BATCH_SIZE=4
MAX_INPUT_TOKENS=4000
MAX_TOTAL_TOKENS=4096

  1. Create a file named docker-compose.yaml with the following content:
version: '3.7'

services:
  tgi-1:
    image: ghcr.io/huggingface/neuronx-tgi:latest
    ports:
      - "8081:8081"
    environment:
      - PORT=8081
      - MODEL_ID=${MODEL_ID}
      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
      - HF_NUM_CORES=2
      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
      - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
      - MAX_CONCURRENT_REQUESTS=512
      #- HF_TOKEN=${HF_TOKEN} #only needed for gated models
    volumes:
      - $PWD:/data #can be removed if you aren't loading locally
    devices:
      - "/dev/neuron0"
  1. Use docker compose to deploy the model:

docker compose -f docker-compose.yaml --env-file .env up

  1. To confirm that the model deployed correctly, send a test prompt to the model:
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"Tell me about AWS.",
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content-Type: application/json'
  1. To confirm that the model can respond in multiple languages, try sending a prompt in Chinese:
#"Tell me how to open an AWS account"
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"告诉我如何开设 AWS 账户。", 
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content-Type: application/json'

Option 2: Deploy TGI on SageMaker

You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker using instructions on the Hugging Face Model Hub.

  1. From the Qwen 2.5 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium.

How to deploy the model on Amazon SageMaker

How to find the code you'll need to deploy the model using AWS Inferentia and Trainium

  1. Copy the example code into a SageMaker notebook, then choose Run.
  2. The notebook you copied will look like the following:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


region = boto3.Session().region_name
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Clean Up

Make sure that you terminate your EC2 instances and delete your SageMaker endpoints to avoid ongoing costs.

Terminate EC2 instances through the AWS Management Console.

Terminate a SageMaker endpoint through the console or with the following commands:

predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Inferentia deliver high performance and low cost for deploying Qwen2.5 models. We’re excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, see the AWS Neuron documentation.


About the Authors

Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups as well as the team at Hugging Face. Jim is a CISSP, part of the AWS AI/ML Technical Field Community, part of the Neuron Data Science community, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor’s degree in mathematics from Carnegie Mellon University and a master’s degree in economics from the University of Virginia.

Miriam Lebowitz ProfileMiriam Lebowitz is a Solutions Architect focused on empowering early-stage startups at AWS. She leverages her experience with AIML to guide companies to select and implement the right technologies for their business objectives, setting them up for scalable growth and innovation in the competitive startup world.

Rhia Soni is a Startup Solutions Architect at AWS. Rhia specializes in working with early stage startups and helps customers adopt Inferentia and Trainium. Rhia is also part of the AWS Analytics Technical Field Community and is a subject matter expert in Generative BI. Rhia holds a bachelor’s degree in Information Science from the University of Maryland.

Paul Aiuto is a Senior Solution Architect Manager focusing on Startups at AWS. Paul created a team of AWS Startup Solution architects that focus on the adoption of Inferentia and Trainium. Paul holds a bachelor’s degree in Computer Science from Siena College and has multiple Cyber Security certifications.

Read More

Revolutionizing customer service: MaestroQA’s integration with Amazon Bedrock for actionable insight

Revolutionizing customer service: MaestroQA’s integration with Amazon Bedrock for actionable insight

This post is cowritten with Harrison Hunter is the CTO and co-founder of MaestroQA.

MaestroQA augments call center operations by empowering the quality assurance (QA) process and customer feedback analysis to increase customer satisfaction and drive operational efficiencies. They assist with operations such as QA reporting, coaching, workflow automations, and root cause analysis.

In this post, we dive deeper into one of MaestroQA’s key features—conversation analytics, which helps support teams uncover customer concerns, address points of friction, adapt support workflows, and identify areas for coaching through the use of Amazon Bedrock. We discuss the unique challenges MaestroQA overcame and how they use AWS to build new features, drive customer insights, and improve operational inefficiencies.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, such as AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

The opportunity for open-ended conversation analysis at enterprise scale

MaestroQA serves a diverse clientele across various industries, including ecommerce, marketplaces, healthcare, talent acquisition, insurance, and fintech. All of these customers have a common challenge: the need to analyze a high volume of interactions with their customers. Analyzing these customer interactions is crucial to improving their product, improving their customer support, providing customer satisfaction, and identifying key industry signals. However, customer interaction data such as call center recordings, chat messages, and emails are highly unstructured and require advanced processing techniques in order to accurately and automatically extract insights.

When customers receive incoming calls at their call centers, MaestroQA employs its proprietary transcription technology, built by enhancing open source transcription models, to transcribe the conversations. After the data is transcribed, MaestroQA uses technology they have developed in combination with AWS services such as Amazon Comprehend to run various types of analysis on the customer interaction data. For example, MaestroQA offers sentiment analysis for customers to identify the sentiment of their end customer during the support interaction, enabling MaestroQA’s customers to sort their interactions and manually inspect the best or worst interactions. MaestroQA also offers a logic/keyword-based rules engine for classifying customer interactions based on other factors such as timing or process steps including metrics like Average Handle Time (AHT), compliance or process checks, and SLA adherence.

MaestroQA’s customers love these analysis features because they allow them to continuously improve the quality of their support and identify areas where they can improve their product to better satisfy their end customers. However, they were also interested in more advanced analysis, such as asking open-ended questions like “How many times did the customer ask for an escalation?” MaestroQA’s existing rules engine couldn’t always answer these types of queries because end-users could ask for the same outcome in many different ways. For example, “Can I speak to your manager?” and “I would like to speak to someone higher up” don’t share the same keywords, but are both asking for an escalation. MaestroQA needed a way to accurately classify customer interactions based on open-ended questions.

MaestroQA faced an additional hurdle: the immense scale of customer interactions their clients manage. With clients handling anywhere from thousands to millions of customer engagements monthly, there was a pressing need for comprehensive analysis of support team performance across this vast volume of interactions. Consequently, MaestroQA had to develop a solution capable of scaling to meet their clients’ extensive needs.

To start developing this product, MaestroQA first rolled out a product called AskAI. AskAI allowed customers to run open-ended questions on a targeted list of up to 1,000 conversations. For example, a customer might use MaestroQA’s filters to find customer interactions in Oregon within the past two months and then run a root cause analysis query such as “What are customers frustrated about in Oregon?” to find churn risk anecdotes. Their customers really liked this feature and surprised MaestroQA with the breadth of use cases they covered, including analyzing marketing campaigns, service issues, and product opportunities. Customers started to request the ability to run this type of analysis across all of their transcripts, which could number in the millions, so they could quantify the impact of what they were seeing and find instances of important issues.

Solution overview

MaestroQA decided to use Amazon Bedrock to address their customers’ need for advanced analysis of customer interaction transcripts. Amazon Bedrock’s broad choice of FMs from leading AI companies, along with its scalability and security features, made it an ideal solution for MaestroQA.

MaestroQA integrated Amazon Bedrock into their existing architecture using Amazon Elastic Container Service (Amazon ECS). The customer interaction transcripts are stored in an Amazon Simple Storage Service (Amazon S3) bucket.

The following architecture diagram demonstrates the request flow for AskAI. When a customer submits an analysis request through MaestroQA’s web application, an ECS cluster retrieves the relevant transcripts from Amazon S3, cleans and formats the prompt, sends them to Amazon Bedrock for analysis using the customer’s selected FM, and stores the results in a database hosted in Amazon Elastic Compute Cloud (Amazon EC2), where they can be retrieved by MaestroQA’s frontend web application.

Solution architecture

MaestroQA offers their customers the flexibility to choose from multiple FMs available through Amazon Bedrock, including Anthropic’s Claude 3.5 Sonnet, Anthropic’s Claude 3 Haiku, Mistral 7b/8x7b, Cohere’s Command R and R+, and Meta’s Llama 3.1 models. This allows customers to select the model that best suits their specific use case and requirements.

The following screenshot shows how the AskAI feature allows MaestroQA’s customers to use the wide variety of FMs available on Amazon Bedrock to ask open-ended questions such as “What are some of the common issues in these tickets?” and generate useful insights from customer service interactions.

Product screenshot

To handle the high volume of customer interaction transcripts and provide low-latency responses, MaestroQA takes advantage of the cross-Region inference capabilities of Amazon Bedrock. Originally, they were doing the load balancing themselves, distributing requests between available AWS US Regions (us-east-1, us-west-2, and so on) and available EU Regions (eu-west-3, eu-central-1, and so on) for their North American and European customers, respectively. Now, the cross-Region inference capability of Amazon Bedrock enables MaestroQA to achieve twice the throughput compared to single-Region inference, a critical factor in scaling their solution to accommodate more customers. MaestroQA’s team no longer has to spend time and effort to predict their demand fluctuations, which is especially key when usage increases for their ecommerce customers around the holiday season. Cross-Region inference dynamically routes traffic across multiple Regions, providing optimal availability for each request and smoother performance during these high-usage periods. MaestroQA monitors this setup’s performance and reliability using Amazon CloudWatch.

Benefits: How Amazon Bedrock added value

Amazon Bedrock has enabled MaestroQA to innovate faster and gain a competitive advantage by offering their customers powerful generative AI features for analyzing customer interaction transcripts. With Amazon Bedrock, MaestroQA can now provide their customers with the ability to run open-ended queries across millions of transcripts, unlocking valuable insights that were previously inaccessible.

The broad choice of FMs available through Amazon Bedrock allows MaestroQA to cater to their customers’ diverse needs and preferences. Customers can select the model that best aligns with their specific use case, finding the right balance between performance and price.

The scalability and cross-Region inference capabilities of Amazon Bedrock enable MaestroQA to handle high volumes of customer interaction transcripts while maintaining low latency, regardless of their customers’ geographical locations.

MaestroQA takes advantage of the robust security features and ethical AI practices of Amazon Bedrock to bolster customer confidence. These measures make sure that client data remains secure during processing and isn’t used for model training by third-party providers. Additionally, Amazon Bedrock availability in Europe, coupled with its geographic control capabilities, allows MaestroQA to seamlessly extend AI services to European customers. This expansion is achieved without introducing additional complexities, thereby maintaining operational efficiency while adhering to Regional data regulations.

The adoption of Amazon Bedrock proved to be a game changer for MaestroQA’s compact development team. Its serverless architecture allowed the team to rapidly prototype and refine their application without the burden of managing complex hardware infrastructure. This shift enabled MaestroQA to channel their efforts into optimizing application performance rather than grappling with resource allocation. Moreover, Amazon Bedrock offers seamless compatibility with their existing AWS environment, allowing for a smooth integration process and further streamlining their development workflow. MaestroQA was able to use their existing authentication process with AWS Identity and Access Management (IAM) to securely authenticate their application to invoke large language models (LLMs) within Amazon Bedrock. They were also able to use the familiar AWS SDK to quickly and effortlessly integrate Amazon Bedrock into their application.

Overall, by using Amazon Bedrock, MaestroQA is able to provide their customers with a powerful and flexible solution for extracting valuable insights from their customer interaction data, driving continuous improvement in their products and support processes.

Success metrics

The early results have been remarkable.

A lending company uses MaestroQA to detect compliance risks on 100% of their conversations. Before, agents would raise internal escalations if a consumer complained about the loan or expressed being in a vulnerable state. However, this process was manual and error prone, and the lending company would miss many of these risks. Now, they are able to detect compliance risks with almost 100% accuracy.

A medical device company, who is required to report device issues to the FDA, no longer relies solely on agents to report internally customer-reported issues, but uses this service to analyze all of their conversations to make sure all complaints are flagged.

An education company has been able to replace their manual survey scores with an automated customer sentiment score that increased their sample size from 15% to 100% of conversations.

The best is yet to come.

Conclusion

Using AWS, MaestroQA was able to innovate faster and gain a competitive advantage. Companies from different industries such as financial services, healthcare and life sciences, and EdTech all share the common desire to provide better customer services for their clients. MaestroQA was able to enable them to do that by quickly pivoting to offer powerful generative AI features that solved tangible business problems and enhanced overall compliance.

Check out MaestroQA’s feature AskAI and their LLM-powered AI Classifiers if you’re interested in better understanding your customer conversations and survey scores. For more about Amazon Bedrock, see Get started with Amazon Bedrock and learn about features such as cross-Region inference to help scale your generative AI features globally.


About the Authors

Carole Suarez is a Senior Solutions Architect at AWS, where she helps guide startups through their cloud journey. Carole specializes in data engineering and holds an array of AWS certifications on a variety of topics including analytics, AI, and security. She is passionate about learning languages and is fluent in English, French, and Tagalog.

Ben Gruher is a Generative AI Solutions Architect at AWS, focusing on startup customers. Ben graduated from Seattle University where he obtained bachelor’s and master’s degrees in Computer Science and Data Science.

Harrison Hunter is the CTO and co-founder of MaestroQA where he leads the engineer and product teams. Prior to MaestroQA, Harrison studied computer science and AI at MIT.

Read More

Optimize hosting DeepSeek-R1 distilled models with Hugging Face TGI on Amazon SageMaker AI

Optimize hosting DeepSeek-R1 distilled models with Hugging Face TGI on Amazon SageMaker AI

DeepSeek-R1, developed by AI startup DeepSeek AI, is an advanced large language model (LLM) distinguished by its innovative, multi-stage training process. Instead of relying solely on traditional pre-training and fine-tuning, DeepSeek-R1 integrates reinforcement learning to achieve more refined outputs. The model employs a chain-of-thought (CoT) approach that systematically breaks down complex queries into clear, logical steps. Additionally, it uses NVIDIA’s parallel thread execution (PTX) constructs to boost training efficiency, and a combined framework of supervised fine-tuning (SFT) and group robust policy optimization (GRPO) makes sure its results are both transparent and interpretable.

In this post, we demonstrate how to optimize hosting DeepSeek-R1 distilled models with Hugging Face Text Generation Inference (TGI) on Amazon SageMaker AI.

Model Variants

The current DeepSeek model collection consists of the following models:

  • DeepSeek-V3 – An LLM that uses a Mixture-of-Experts (MoE) architecture. MoE models like DeepSeek-V3 and Mixtral replace the standard feed-forward neural network in transformers with a set of parallel sub-networks called experts. These experts are selectively activated for each input, allowing the model to efficiently scale to a much larger size without a corresponding increase in computational cost. For example, DeepSeek-V3 is a 671-billion-parameter model, but only 37 billion parameters (approximately 5%) are activated during the output of each token. DeepSeek-V3-Base is the base model from which the R1 variants are derived.
  • DeepSeek-R1-Zero – A fine-tuned variant of DeepSeek-V3 based on using reinforcement learning to guide CoT reasoning capabilities, without any SFT done prior. According to the DeepSeek R1 paper, DeepSeek-R1-Zero excelled at reasoning behaviors but encountered challenges with readability and language mixing.
  • DeepSeek-R1 – Another fine-tuned variant of DeepSeek-V3-Base, built similarly to DeepSeek-R1-Zero, but with a multi-step training pipeline. DeepSeek-R1 starts with a small amount of cold-start data prior to the GRPO process. It also incorporates SFT data through rejection sampling, combined with supervised data generated from DeepSeek-V3 to retrain DeepSeek-V3-base. After this, the retrained model goes through another round of RL, resulting in the DeepSeek-R1 model checkpoint.
  • DeepSeek-R1-Distill – Variants of Hugging Face’s Qwen and Meta’s Llama based on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. The distilled model variants are a result of fine-tuning Qwen or Llama models through knowledge distillation, where DeepSeek-R1 acts as the teacher and Qwen or Llama as the student. These models retain their existing architecture while gaining additional reasoning capabilities through a distillation process. They are exclusively fine-tuned using SFT and don’t incorporate any RL techniques.

The following figure illustrates the performance of DeepSeek-R1 compared to other state-of-the-art models on standard benchmark tests, such as MATH-500, MMLU, and more.

Hugging Face Text Generation Inference (TGI)

Hugging Face Text Generation Inference (TGI) is a high-performance, production-ready inference framework optimized for deploying and serving large language models (LLMs) efficiently. It is designed to handle the demanding computational and latency requirements of state-of-the-art transformer models, including Llama, Falcon, Mistral, Mixtral, and GPT variants – for a full list of TGI supported models refer to supported models.

Amazon SageMaker AI provides a managed way to deploy TGI-optimized models, offering deep integration with Hugging Face’s inference stack for scalable and cost-efficient LLM deployment. To learn more about Hugging Face TGI support on Amazon SageMaker AI, refer to this announcement post and this documentation on deploy models to Amazon SageMaker AI.

Key Optimizations in Hugging Face TGI

Hugging Face TGI is built to address the challenges associated with deploying large-scale text generation models, such as inference latency, token throughput, and memory constraints.

The benefits of using TGI include:

  • Tensor parallelism – Splits large models across multiple GPUs for efficient memory utilization and computation
  • Continuous batching – Dynamically batches multiple inference requests to maximize token throughput and reduce latency
  • Quantization – Lowers memory usage and computational cost by converting model weights to INT8 or FP16
  • Speculative decoding – Uses a smaller draft model to speed up token prediction while maintaining accuracy
  • Key-value cache optimization – Reduces redundant computations for faster response times in long-form text generation
  • Token streaming – Streams tokens in real time for low-latency applications like chatbots and virtual assistants

Runtime TGI Arguments

TGI containers support runtime configurations that provide greater control over LLM deployments. These configurations allow you to adjust settings such as quantization, model parallel size (tensor parallel size), maximum tokens, data type (dtype), and more using container environment variables.

Notable runtime parameters influencing your model deployment include:

  1. HF_MODEL_ID : This parameter specifies the identifier of the model to load, which can be a model ID from the Hugging Face Hub (e.g., meta-llama/Llama-3.2-11B-Vision-Instruct) or Simple Storage Service (S3) URI containing the model files.
  2. HF_TOKEN : This parameter variable provides the access token required to download gated models from the Hugging Face Hub, such as Llama or Mistral.
  3. SM_NUM_GPUS : This parameter specifies the number of GPUs to use for model inference, allowing the model to be sharded across multiple GPUs for improved performance.
  4. MAX_CONCURRENT_REQUESTS : This parameter controls the maximum number of concurrent requests that the server can handle, effectively managing the load and ensuring optimal performance.
  5. DTYPE : This parameter sets the data type for the model weights during loading, with options like float16 or bfloat16, influencing the model’s memory consumption and computational performance.

There are additional optional runtime parameters that are already pre-optimized in TGI containers to maximize performance on host hardware. However, you can modify them to exercise greater control over your LLM inference performance:

  1. MAX_TOTAL_TOKENS: This parameter sets the upper limit on the combined number of input and output tokens a deployment can handle per request, effectively defining the “memory budget” for client interactions.
  2. MAX_INPUT_TOKENS : This parameter specifies the maximum number of tokens allowed in the input prompt of each request, controlling the length of user inputs to manage memory usage and ensure efficient processing.
  3. MAX_BATCH_PREFILL_TOKENS : This parameter caps the total number of tokens processed during the prefill stage across all batched requests, a phase that is both memory-intensive and compute-bound, thereby optimizing resource utilization and preventing out-of-memory errors.

For a complete list of runtime configurations, please refer to text-generation-launcher arguments.

DeepSeek Deployment Patterns with TGI on Amazon SageMaker AI

Amazon SageMaker AI offers a simple and streamlined approach to deploy DeepSeek-R1 models with just a few lines of code. Additionally, SageMaker endpoints support automatic load balancing and autoscaling, enabling your LLM deployment to scale dynamically based on incoming requests. During non-peak hours, the endpoint can scale down to zero, optimizing resource usage and cost efficiency.

The table below summarizes all DeepSeek-R1 models available on the Hugging Face Hub, as uploaded by the original model provider, DeepSeek.

Model

# Total Params

# Activated Params

Context Length

Download

DeepSeek-R1-Zero

671B

37B

128K

🤗 deepseek-ai/DeepSeek-R1-Zero

DeepSeek-R1

671B

37B

128K

🤗 deepseek-ai/DeepSeek-R1

DeepSeek AI also offers distilled versions of its DeepSeek-R1 model to offer more efficient alternatives for various applications.

Model

Base Model

Download

DeepSeek-R1-Distill-Qwen-1.5B

Qwen2.5-Math-1.5B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

DeepSeek-R1-Distill-Qwen-7B

Qwen2.5-Math-7B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Llama-8B

Llama-3.1-8B

🤗 deepseek-ai/DeepSeek-R1-Distill-Llama-8B

DeepSeek-R1-Distill-Qwen-14B

Qwen2.5-14B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

DeepSeek-R1-Distill-Qwen-32B

Qwen2.5-32B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

DeepSeek-R1-Distill-Llama-70B

Llama-3.3-70B-Instruct

🤗 deepseek-ai/DeepSeek-R1-Distill-Llama-70B

There are two ways to deploy LLMs, such as DeepSeek-R1 and its distilled variants, on Amazon SageMaker:

Option 1: Direct Deployment from Hugging Face Hub

The easiest way to host DeepSeek-R1 in your AWS account is by deploying it (along with its distilled variants) using TGI containers. These containers simplify deployment through straightforward runtime environment specifications. The architecture diagram below shows a direct download from the Hugging Face Hub, ensuring seamless integration with Amazon SageMaker.

The following code shows how to deploy the DeepSeek-R1-Distill-Llama-8B model to a SageMaker endpoint, directly from the Hugging Face Hub.

import sagemaker
from sagemaker.huggingface import (
    HuggingFaceModel, 
    get_huggingface_llm_image_uri
)

role = sagemaker.get_execution_role()
session = sagemaker.Session()

# select the latest 3+ version container 
deploy_image_uri = get_huggingface_llm_image_uri(
     "huggingface", 
     version="3.0.1" 
)

deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        ...
    },
    role=role,
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model" # optional
)

pretrained_tgi_predictor = deepseek_tgi_model.deploy(
   endpoint_name="deepseek-r1-llama-8b-endpoint", # optional 
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=600,
    wait=False, # set to true to wait for endpoint InService
)

To deploy other distilled models, simply update the HF_MODEL_ID to any of the DeepSeek distilled model variants, such as deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, or deepseek-ai/DeepSeek-R1-Distill-Llama-70B.

Option 2: Deployment from a Private S3 Bucket

To deploy models privately within your AWS account, upload the DeepSeek-R1 model weights to a S3 bucket and set HF_MODEL_ID to the corresponding S3 bucket prefix. TGI will then retrieve and deploy the model weights from S3, eliminating the need for internet downloads during each deployment. This approach reduces model loading latency by keeping the weights closer to your SageMaker endpoints and enable your organization’s security teams to perform vulnerability scans before deployment. SageMaker endpoints also support auto-scaling, allowing DeepSeek-R1 to scale horizontally based on incoming request volume while seamlessly integrating with elastic load balancing.

Deploying a DeepSeek-R1 distilled model variant from S3 follows the same process as option 1, with one key difference: HF_MODEL_ID points to the S3 bucket prefix instead of the Hugging Face Hub. Before deployment, you must first download the model weights from the Hugging Face Hub and upload them to your S3 bucket.

deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "s3://my-model-bucket/path/to/model",
        ...
    }, 
          vpc_config={ 
                 "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
    "SecurityGroupIds": ["sg-zzzzzzzz"] 
     },
    role=role, 
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model-s3" # optional
)

Deployment Best Practices

The following are some best practices to consider when deploying DeepSeek-R1 models on SageMaker AI:

  • Deploy within a private VPC – It’s recommended to deploy your LLM endpoints inside a virtual private cloud (VPC) and behind a private subnet, preferably with no egress. See the following code:
deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        ...
    },
    role=role,
          vpc_config={ 
                 "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
                 "SecurityGroupIds": ["sg-zzzzzzzz"] 
     },
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model" # optional
) 
  • Implement guardrails for safety and compliance – Always apply guardrails to validate incoming and outgoing model responses for safety, bias, and toxicity. You can use Amazon Bedrock Guardrails to enforce these protections on your SageMaker endpoint responses.

Inference Performance Evaluation

This section presents examples of the inference performance of DeepSeek-R1 distilled variants on Amazon SageMaker AI. Evaluating LLM performance across key metrics—end-to-end latency, throughput, and resource efficiency—is crucial for ensuring responsiveness, scalability, and cost-effectiveness in real-world applications. Optimizing these metrics directly enhances user experience, system reliability, and deployment feasibility at scale.

All DeepSeek-R1 Qwen (1.5B, 7B, 14B, 32B) and Llama (8B, 70B) variants are evaluated against four key performance metrics:

  1. End-to-End Latency
  2. Throughput (Tokens per Second)
  3. Time to First Token
  4. Inter-Token Latency

Please note that the main purpose of this performance evaluation is to give you an indication about relative performance of distilled R1 models on different hardware. We didn’t try to optimize the performance for each model/hardware/use case combination. These results should not be treated like a best possible performance of a particular model on a particular instance type. You should always perform your own testing using your own datasets and input/output sequence length.

If you are interested in running this evaluation job inside your own account, refer to our code on GitHub.

Scenarios

We tested the following scenarios:

  • Tokens – Two input token lengths were used to evaluate the performance of DeepSeek-R1 distilled variants hosted on SageMaker endpoints. Each test was executed 100 times, with concurrency set to 1, and the average values across key performance metrics were recorded. All models were run with dtype=bfloat16.
    • Short-length test – 512 input tokens, 256 output tokens.
    • Medium-length test – 3,072 input tokens, 256 output tokens.
  • Hardware – Several instance families were tested, including p4d (NVIDIA A100), g5 (NVIDIA A10G), g6 (NVIDIA L4), and g6e (NVIDIA L40s), each equipped with 1, 4, or 8 GPUs per instance. For additional details regarding pricing and instance specifications, refer to Amazon SageMaker AI pricing. In the following table, a green cell indicates a model was tested on a specific instance type, and a red cell indicates the model wasn’t tested because the instance type or size was excessive for the model or unable to run because of insufficient GPU memory.

Box Plots

In the following sections we use a box plot to visualize model performance. A box is a concise visual summary that displays a dataset’s median, interquartile range (IQR), and potential outliers using a box for the middle 50% of the data with whiskers extending to the smallest and largest non-outlier values. By examining the median’s placement within the box, the box’s size, and the whiskers’ lengths, you can quickly assess the data’s central tendency, variability, and skewness.

DeepSeek-R1-Distill-Qwen-1.5B

A single GPU instance is sufficient to host one or multiple concurrent DeepSeek-R1-Distill-Qwen-1.5B serving workers on TGI. In this test, a single worker was deployed, and performance was evaluated across the four outlined metrics. Results show that the ml.g5.xlarge outperforms the ml.g6.xlarge across all metrics.

DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Qwen-7B was tested on ml.g5.xlarge, ml.g5.2xlarge, ml.g6.xlarge, ml.g6.2xlarge, and ml.g6e.xlarge. The ml.g6e.xlarge instance performed the best, followed by ml.g5.2xlarge, ml.g5.xlarge, and the ml.g6 instances.

DeepSeek-R1-Distill-Llama-8B

Similar to the 7B variant, DeepSeek-R1-Distill-Llama-8B was benchmarked across ml.g5.xlarge, ml.g5.2xlarge, ml.g6.xlarge, ml.g6.2xlarge, and ml.g6e.xlarge, with ml.g6e.xlarge demonstrating the highest performance among all instances.

DeepSeek-R1-Distill-Qwen-14B

DeepSeek-R1-Distill-Qwen-14B was tested on ml.g5.12xlarge, ml.g6.12xlarge, ml.g6e.2xlarge, and ml.g6e.xlarge. The ml.g5.12xlarge exhibited the highest performance, followed by ml.g6.12xlarge. Although at a lower performance profile, DeepSeek-R1-14B can also be deployed on the single GPU g6e instances due to their larger memory footprint.

DeepSeek-R1-Distill-Qwen-32B

DeepSeek-R1-Distill-Qwen-32B requires more than 48GB of memory, making ml.g5.12xlarge, ml.g6.12xlarge, and ml.g6e.12xlarge suitable for performance comparison. In this test, ml.g6e.12xlarge delivered the highest performance, followed by ml.g5.12xlarge, with ml.g6.12xlarge ranking third.

DeepSeek-R1-Distill-Llama-70B

DeepSeek-R1-Distill-Llama-70B was tested on ml.g5.48xlarge, ml.g6.48xlarge, ml.g6e.12xlarge, ml.g6e.48xlarge, and ml.p4dn.24xlarge. The best performance was observed on ml.p4dn.24xlarge, followed by ml.g6e.48xlarge, ml.g6e.12xlarge, ml.g5.48xlarge, and finally ml.g6.48xlarge.

Clean Up

To avoid incurring cost after completing your evaluation, ensure you delete the endpoints you created earlier.

import boto3

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=<region>)

# Delete endpoint
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)

Conclusion

In this blog, you learned about the current versions of DeepSeek models and how to use the Hugging Face TGI containers to simplify the deployment of DeepSeek-R1 distilled models (or any other LLM) on Amazon SageMaker AI to just a few lines of code. You also learned how to deploy models directly from the Hugging Face Hub for quick experimentation and from a private S3 bucket to provide enhanced security and model deployment performance.

Finally, you saw an extensive evaluation of all DeepSeek-R1 distilled models across four key inference performance metrics using 13 different NVIDIA accelerator instance types. This analysis provides valuable insights to help you select the optimal instance type for your DeepSeek-R1 deployment. All code used to analyze DeepSeek-R1 distilled model variants are available on GitHub.


About the Authors

Pranav Murthy is a Worldwide Technical Lead and Sr. GenAI Data Scientist at AWS. He helps customers build, train, deploy, evaluate, and monitor Machine Learning (ML), Deep Learning (DL), and Generative AI (GenAI) workloads on Amazon SageMaker. Pranav specializes in multimodal architectures, with deep expertise in computer vision (CV) and natural language processing (NLP). Previously, he worked in the semiconductor industry, developing AI/ML models to optimize semiconductor processes using state-of-the-art techniques. In his free time, he enjoys playing chess, training models to play chess, and traveling. You can find Pranav on LinkedIn.

Simon Pagezy is a Cloud Partnership Manager at Hugging Face, dedicated to making cutting-edge machine learning accessible through open source and open science. With a background in AI/ML consulting at AWS, he helps organizations leverage the Hugging Face ecosystem on their platform of choice.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.

Read More

Exploring creative possibilities: A visual guide to Amazon Nova Canvas

Exploring creative possibilities: A visual guide to Amazon Nova Canvas

Compelling AI-generated images start with well-crafted prompts. In this follow-up to our Amazon Nova Canvas Prompt Engineering Guide, we showcase a curated gallery of visuals generated by Nova Canvas—categorized by real-world use cases—from marketing and product visualization to concept art and design exploration.

Each image is paired with the prompt and parameters that generated it, providing a practical starting point for your own AI-driven creativity. Whether you’re crafting specific types of images, optimizing workflows, or simply seeking inspiration, this guide will help you unlock the full potential of Amazon Nova Canvas.

Solution overview

Getting started with Nova Canvas is straightforward. You can access the model through the Image Playground on the AWS Management Console for Amazon Bedrock, or through APIs. For detailed setup instructions, including account requirements and necessary permissions, visit our documentation on Creative content generation with Amazon Nova. Our previous post on prompt engineering best practices provides comprehensive guidance on crafting effective prompts.

A visual guide to Amazon Nova Canvas

In this gallery, we showcase a diverse range of images and the prompts used to generate them, highlighting how Amazon Nova Canvas adapts to various use cases—from marketing and product design to storytelling and concept art.

All images that follow were generated using Nova Canvas at a 1280x720px resolution with a CFG scale of 6.5, seed of 0, and the Premium setting for image quality. This resolution also matches the image dimensions expected by Nova Reel, allowing you to take these images into Amazon Nova Reel to experiment with video generation.

Landscapes

Overhead perspective of winding river delta, capturing intricate branching waterways and sediment patterns. Soft morning light revealing subtle color gradations between water and land. Revealing landscape’s hidden fluid dynamics from bird’s-eye view. Sparse arctic tundra landscape at twilight, expansive white terrain with isolated rock formations silhouetted against a deep blue sky. Low-contrast black and white composition capturing the infinite horizon, with subtle purple hues in the shadows. Ultra-wide-angle perspective emphasizing the vastness of negative space and geological simplicity.
Wide-angle aerial shot of patchwork agricultural terrain at golden hour, with long shadows accentuating the texture and topography of the land. Emphasis on the interplay of light and shadow across the geometric field divisions. Dynamic drone perspective of a dramatic shoreline at golden hour, capturing long shadows cast by towering sea stacks and coastal cliffs. Hyper-detailed imagery showcasing the interplay of warm sunlight on rocky textures and the cool, foamy edges of incoming tides.
Dramatic wide-angle shot of a rugged mountain range at sunset, with a lone tree silhouetted in the foreground, creating a striking focal point. Wide-angle capture of a hidden beach cove, surrounded by towering cliffs, with a shipwreck partially visible in the shallow waters.

Character portraits

A profile view of a weathered fisherman, silhouetted against a pastel dawn sky. The rim lighting outlines the shape of his beard and the texture of his knit cap. Rendered with high contrast to emphasize the rugged contours of his face and the determined set of his jaw.
A weathered fisherman with a thick gray beard and a knit cap, framed against the backdrop of a misty harbor at dawn. The image captures him in a medium shot, revealing more of his rugged attire. Cool, blue tones dominate the scene, contrasting with the warm highlights on his face. An intimate portrait of a seasoned fisherman, his face filling the frame. His thick gray beard is flecked with sea spray, and his knit cap is pulled low over his brow. The warm glow of sunset bathes his weathered features in golden light, softening the lines of his face while still preserving the character earned through years at sea. His eyes reflect the calm waters of the harbor behind him.
A seaside cafe at sunrise, with a seasoned barista’s silhouette visible through the window. Their kind smile is illuminated by the warm glow of the rising sun, creating a serene atmosphere. The image has a dreamy, soft-focus quality with pastel hues.
A dynamic profile shot of a barista in motion, captured mid-conversation with a customer. Their smile is genuine and inviting, with laugh lines accentuating their seasoned experience. The cafe’s interior is rendered in soft bokeh, maintaining the cinematic feel with a shallow depth of field. A front-facing portrait of an experienced barista, their welcoming smile framed by the sleek espresso machine. The background bustles with blurred cafe activity, while the focus remains sharp on the barista’s friendly demeanor. The lighting is contrasty, enhancing the cinematic mood.

Fashion photography

A model with sharp cheekbones and platinum pixie cut in a distressed leather bomber jacket stands amid red smoke in an abandoned subway tunnel. Wide-angle lens, emphasizing tunnel’s converging lines, strobed lighting creating a sense of motion.
A model with sharp cheekbones and platinum pixie cut wears a distressed leather bomber jacket, posed against a stark white cyclorama. Low-key lighting creates deep shadows, emphasizing the contours of her face. Shot from a slightly lower angle with a medium format camera, highlighting the jacket’s texture. Close-up portrait of a model with defined cheekbones and a platinum pixie cut, emerging from an infinity pool while wearing a wet distressed leather bomber jacket. Shot from a low angle with a tilt-shift lens, blurring the background for a dreamy fashion magazine aesthetic.
A model with sharp cheekbones and platinum pixie cut is wearing a distressed leather bomber jacket, caught mid-laugh at a backstage fashion show. Black and white photojournalistic style, natural lighting. Side profile of a model with defined cheekbones and a platinum pixie cut, standing still amidst the chaos of Chinatown at midnight. The distressed leather bomber jacket contrasts with the blurred neon lights in the background, creating a sense of urban solitude.

Product photography

A flat lay featuring a premium matte metal water bottle with bamboo accents, placed on a textured linen cloth. Eco-friendly items like a cork notebook, a sprig of eucalyptus, and a reusable straw are arranged around it. Soft, natural lighting casts gentle shadows, emphasizing the bottle’s matte finish and bamboo details. The background is an earthy tone like beige or light gray, creating a harmonious and sustainable composition. Angled perspective of the premium water bottle with bamboo elements, positioned on a natural jute rug. Surrounding it are earth-friendly items: a canvas tote bag, a stack of recycled paper notebooks, and a terracotta planter with air-purifying plants. Warm, golden hour lighting casts long shadows, emphasizing textures and creating a cozy, sustainable atmosphere. The scene evokes a sense of eco-conscious home or office living.
An overhead view of the water bottle’s bamboo cap, partially unscrewed to reveal the threaded metal neck. Soft, even lighting illuminates the entire scene, showcasing the natural variations in the bamboo’s color and grain. The bottle’s matte metal body extends out of frame, creating a minimalist composition that draws attention to the sustainable materials and precision engineering. An angled view of a premium matte metal water bottle with bamboo accents, showcasing its sleek profile. The background features a soft blur of a serene mountain lake. Golden hour sunlight casts a warm glow on the bottle’s surface, highlighting its texture. Captured with a shallow depth of field for product emphasis.
A pair of premium over-ear headphones with a matte black finish and gold accents, arranged in a flat lay on a clean white background. Organic leaves for accents. small notepad, pencils, and a carrying case are neatly placed beside the headphones, creating a symmetrical and balanced composition. Bright, diffused lighting eliminates shadows, emphasizing the sleek design without distractions. A shadowless, crisp aesthetic.
An overhead shot of premium over-ear headphones resting on a reflective surface, showcasing the symmetry of the design. Dramatic side lighting accentuates the curves and edges, casting subtle shadows that highlight the product’s premium build quality. An extreme macro shot focusing on the junction where the leather ear cushion meets the metallic housing of premium over-ear headphones. Sharp details reveal the precise stitching and material textures, while selective focus isolates this area against a softly blurred, dark background, showcasing the product’s premium construction.
An overhead shot of premium over-ear headphones resting on a reflective surface, showcasing the symmetry of the design. Dramatic side lighting casts long shadows, accentuating the curves of the headband and the depth of the ear cups against a minimalist white background. A dynamic composition of premium over-ear headphones floating in space, with the headband and ear cups slightly separated to showcase individual components. Rim lighting outlines each piece, while a gradient background adds depth and sophistication.
A smiling student holding up her smartphone, displaying a green matte screen for easy image replacement, in a classroom setting. Overhead view of a young man typing on a laptop with a green matte screen, surrounded by work materials on a wooden table.

Food photography

Monochromatic macarons arranged in precise geometric pattern. Strong shadow play. Architectural lighting. Minimal composition.
A pyramid of macarons in ombre pastels, arranged on a matte black slate surface. Dramatic side lighting from left. Close-up view highlighting texture of macaron shells. Garnished with edible gold leaf accents. Shot at f/2 aperture for shallow depth of field. Disassembled macaron parts in zero-g chamber. Textured cookie halves, viscous filling streams, and scattered almond slivers drifting. High-contrast lighting with subtle shadows on off-white. Wide-angle shot showcasing full dispersal pattern.

Architectural design

A white cubic house with floor-to-ceiling windows, interior view from living room. Double-height space, floating steel staircase, polished concrete floors. Late afternoon sunbeams streaming across minimal furnishings. Ultra-wide architectural lens. A white cubic house with floor-to-ceiling windows, kitchen and dining space. Monolithic marble island, integrated appliances, dramatic shadows from skylight above. Shot from a low angle with a wide-angle lens, emphasizing the height and openness of the space, late afternoon golden hour light streaming in.
An angular white modernist house featuring expansive glass walls, photographed for Architectural Digest’s cover. Misty morning atmosphere, elongated infinity pool creating a mirror image, three-quarter aerial view, lush coastal vegetation framing the scene.
A white cubic house with floor-to-ceiling windows presented as detailed architectural blueprints. Site plan view showing landscaping and property boundaries, technical annotations, blue background with white lines, precise measurements and zoning specifications visible. A white cubic house with floor-to-ceiling windows in precise isometric projection. X-ray style rendering revealing internal framework, electrical wiring, and plumbing systems. Technical cross-hatching on load-bearing elements and foundation.

Concept art

A stylized digital painting of a bustling plaza in a futuristic eco-city, with soft impressionistic brushstrokes. Crystalline towers frame the scene, while suspended gardens create a canopy overhead. Holographic displays and eco-friendly vehicles add life to the foreground. Dreamlike and atmospheric, with glowing highlights in sapphire and rose gold. A stylized digital painting of an elevated park in a futuristic eco-city, viewed from a high angle, with soft impressionistic brushstrokes. Crystalline towers peek through a canopy of trees, while winding elevated walkways connect floating garden platforms. People relax in harmony with nature. Dreamlike and atmospheric, with glowing highlights in jade and amber.
Concept art of a floating garden platform in a futuristic city, viewed from below. Translucent roots and hanging vines intertwine with advanced technology, creating a mesmerizing canopy. Soft bioluminescent lights pulse through the vegetation, casting ethereal patterns on the ocean’s surface. A gradient of deep purples and blues dominates the twilight sky.
An enchanted castle atop a misty cliff at sunrise, warm golden light bathing the ivy-covered spires. A wide-angle view capturing a flock of birds soaring past the tallest tower, set against a dramatic sky with streaks of orange and pink. Mystical ambiance and dynamic composition. A magical castle rising from morning fog on a rugged cliff face, bathed in cool blue twilight. A low-angle shot showcasing the castle’s imposing silhouette against a star-filled sky, with a crescent moon peeking through wispy clouds. Mysterious mood and vertical composition emphasizing height.
An enchanted fortress clinging to a mist-shrouded cliff, caught in the moment between night and day. A panoramic view from below, revealing the castle’s reflection in a tranquil lake at the base of the cliff. Ethereal pink and purple hues in the sky, with a V-formation of birds flying towards the castle. Serene atmosphere and balanced symmetry.

Illustration

Japanese ink wash painting of a cute baby dragon with pearlescent mint-green scales and tiny wings curled up in a nest made of cherry blossom petals. Delicate brushstrokes, emphasis on negative space. Art nouveau-inspired composition centered on an endearing dragon hatchling with gleaming mint-green scales. Sinuous morning glory stems and blossoms intertwine around the subject, creating a harmonious balance. Soft, dreamy pastels and characteristic decorative elements frame the scene.
Watercolor scene of a cute baby dragon with pearlescent mint-green scales crouched at the edge of a garden puddle, tiny wings raised. Soft pastel flowers and foliage frame the composition. Loose, wet-on-wet technique for a dreamy atmosphere, with sunlight glinting off ripples in the puddle.
A playful, hand-sculpted claymation-style baby dragon with pearlescent mint scales and tiny wings, sitting on a puffy marshmallow cloud. Its soft, rounded features and expressive googly eyes give it a lively, mischievous personality as it giggles and flaps its stubby wings, trying to take flight in a candy-colored sky. A whimsical, animated-style render of a baby dragon with pearlescent mint scales nestled in a bed of oversized, bioluminescent flowers. The floating island garden is bathed in the warm glow of sunset, with fireflies twinkling like stars. Dynamic lighting accentuates the dragon’s playful expression.

Graphic design

A set of minimalist icons for a health tracking app. Dual-line design with 1.5px stroke weight on solid backgrounds. Each icon uses teal for the primary line and a lighter shade for the secondary line, with ample negative space. Icons maintain consistent 64x64px dimensions with centered compositions. Clean, professional aesthetic suitable for both light and dark modes. Stylized art deco icons for fitness tracking. Geometric abstractions of health symbols with gold accents. Balanced designs incorporating circles, triangles, and zigzag motifs. Clean and sophisticated.
Set of charming wellness icons for digital health tracker. Organic, hand-drawn aesthetic with soft, curvy lines. Uplifting color combination of lemon yellow and fuchsia pink. Subtle size variations among icons for a dynamic, handcrafted feel.
Lush greenery tapestry in 16:9 panoramic view. Detailed monstera leaves overlap in foreground, giving way to intricate ferns and tendrils. Emerald and sage watercolor washes create atmospheric depth. Foliage density decreases towards center, suggesting an enchanted forest clearing. Modern botanical line drawing in 16:9 widescreen. Forest green single-weight outlines of stylized foliage. Negative space concentrated in the center for optimal text placement. Geometric simplification of natural elements with a focus on curves and arcs.
3D sculptural typography spelling out “BRAVE” with each letter made from a different material, arranged in a dynamic composition. Experimental typographic interpretation of “BRAVE” using abstract, interconnected geometric shapes that flow and blend organically. Hyper-detailed textures reminiscent of fractals and natural patterns create a mesmerizing, otherworldly appearance with sharp contrast.
A dreamy photograph overlaid with delicate pen-and-ink drawings, blending reality and fantasy to reveal hidden magic in ordinary moments. Surreal digital collage blending organic and technological elements in a futuristic style.
Abstract figures emerging from digital screens, gradient color transitions, mixed textures, dynamic composition, conceptual narrative style. Abstract humanoid forms materializing from multiple digital displays, vibrant color gradients flowing between screens, contrasting smooth and pixelated textures, asymmetrical layout with visual tension, surreal storytelling aesthetic.
Abstract figures emerging from digital screens, glitch art aesthetic with RGB color shifts, fragmented pixel clusters, high contrast scanlines, deep shadows cast by volumetric lighting.

Conclusion

The examples showcased here are just the beginning of what’s possible with Amazon Nova Canvas. For even greater control, you can guide generations with reference images, use custom color palettes, or make precise edits—such as swapping backgrounds or refining details— with simple inputs. Plus, with built-in safeguards such as watermarking and content moderation, Nova Canvas offers a responsible and secure creative experience. Whether you’re a professional creator, a marketing team, or an innovator with a vision, Nova Canvas provides the tools to bring your ideas to life.

We invite you to explore these possibilities yourself and discover how Nova Canvas can transform your creative process. Stay tuned for our next installment, where we’ll dive into the exciting world of video generation with Amazon Nova Reel.

Ready to start creating? Visit the Amazon Bedrock console today and bring your ideas to life with Nova Canvas. For more information about features, specifications, and additional examples, explore our documentation on creative content generation with Amazon Nova.


About the authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Kris Schultz has spent over 25 years bringing engaging user experiences to life by combining emerging technologies with world class design. As Sr. Solutions Architect within Amazon AGI, he influences the development of Amazon’s first-party generative AI models. Kris is passionate about empowering users and creators of all types with generative AI tools and knowledge.

Sanju Sunny is a Generative AI Design Technologist with AWS Prototyping & Cloud Engineering (PACE), specializing in strategy, engineering, and customer experience. He collaborates with customers across diverse industries, leveraging Amazon’s customer-obsessed innovation mechanisms to rapidly conceptualize, validate, and prototype innovative products, services, and experiences.

Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.

Read More