October 2024 – Page 90

Deploy a serverless web application to edit images using Amazon Bedrock

Generative AI adoption among various industries is revolutionizing different types of applications, including image editing. Image editing is used in various sectors, such as graphic designing, marketing, and social media. Users rely on specialized tools for editing images. Building a custom solution for this task can be complex. However, by using various AWS services, you can quickly deploy a serverless solution to edit images. This approach can give your teams access to image editing foundation models (FMs) using Amazon Bedrock.

Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that’s best suited for your use case. Amazon Bedrock is serverless, so you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using AWS tools without having to manage infrastructure.

Amazon Titan Image Generator G1 is an AI FM available with Amazon Bedrock that allows you to generate an image from text, or upload and edit your own image. Some of the key features we focus on include inpainting and outpainting.

This post introduces a solution that simplifies the deployment of a web application for image editing using AWS serverless services. We use AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon Bedrock with the Amazon Titan Image Generator G1 model to build an application to edit images using prompts. We cover the inner workings of the solution to help you understand the function of each service and how they are connected to give you a complete solution. At the time of writing this post, Amazon Titan Image Generator G1 comes in two versions; for this post, we use version 2.

Solution overview

The following diagram provides an overview and highlights the key components. The architecture uses Amazon Cognito for user authentication and Amplify as the hosting environment for our frontend application. A combination of API Gateway and a Lambda function is used for our backend services, and Amazon Bedrock integrates with the FM model, enabling users to edit the image using prompts.

Prerequisites

You must have the following in place to complete the solution in this post:

An AWS account
FM access in Amazon Bedrock for Amazon Titan Image Generator G1 v2 in the same AWS Region where you will deploy this solution
The accompanying AWS CloudFormation template downloaded from the aws-samples GitHub repo.

Deploy solution resources using AWS CloudFormation

When you run the AWS CloudFormation template, the following resources are deployed:

Amazon Cognito resources:
- User pool: CognitoUserPoolforImageEditApp
- App client: ImageEditApp
Lambda resources:
- Function: <Stack name>-ImageEditBackend-<auto-generated>
AWS Identity Access Management (IAM) resources:
- IAM role: <Stack name>-ImageEditBackendRole-<auto-generated>
- IAM inline policy: AmazonBedrockAccess (this policy allows Lambda to invoke Amazon Bedrock FM amazon.titan-image-generator-v2:0)
API Gateway resources:
- Rest API: ImageEditingAppBackendAPI
- Methods:
  - OPTIONS – Added header mapping for CORS
  - POST – Lambda integration
- Authorization: Through Amazon Cognito using CognitoAuthorizer

After you deploy the CloudFormation template, copy the following from the Outputs tab to be used during the deployment of Amplify:

userPoolId
userPoolClientId
invokeUrl

Deploy the Amplify application

You have to manually deploy the Amplify application using the frontend code found on GitHub. Complete the following steps:

Download the frontend code from the GitHub repo.
Unzip the downloaded file and navigate to the folder.
In the js folder, find the config.js file and replace the values of XYZ for userPoolId, userPoolClientId, and invokeUrl with the values you collected from the CloudFormation stack outputs. Set the region value based on the Region where you’re deploying the solution.

The following is an example config.js file:

window._config = {
    cognito: {
        userPoolId: 'XYZ', // e.g. us-west-2_uXboG5pAb
        userPoolClientId: 'XYZ', // e.g. 25ddkmj4v6hfsfvruhpfi7n4hv
        region: 'XYZ// e.g. us-west-2
    },
    api: {
        invokeUrl: 'XYZ' // e.g. https://rc7nyt4tql.execute-api.us-west-2.amazonaws.com/prod,
    }
};

Select all the files and compress them as shown in the following screenshot.

Make sure you zip the contents and not the top-level folder. For example, if your build output generates a folder named AWS-Amplify-Code, navigate into that folder and select all the contents, and then zip the contents.

Use the new .zip file to manually deploy the application in Amplify.

After it’s deployed, you will receive a domain that you can use in later steps to access the application.

Create a test user in the Amazon Cognito user pool.

An email address is required for this user because you will need to mark the email address as verified.

Return to the Amplify page and use the domain it automatically generated to access the application.

Use Amazon Cognito for user authentication

Amazon Cognito is an identity platform that you can use to authenticate and authorize users. We use Amazon Cognito in our solution to verify the user before they can use the image editing application.

Upon accessing the Image Editing Tool URL, you will be prompted to sign in with a previously created test user. For first-time sign-ins, users will be asked to update their password. After this process, the user’s credentials are validated against the records stored in the user pool. If the credentials match, Amazon Cognito will issue a JSON Web Token (JWT). In the API payload to be sent section of the page, you will notice that the Authorization field has been updated with the newly issued JWT.

Use Lambda for backend code and Amazon Bedrock for generative AI function

The backend code is hosted on Lambda, and launched by user requests routed through API Gateway. The Lambda function process the request payload and forwards it to Amazon Bedrock. The reply from Amazon Bedrock follows the same route as the initial request.

Use API Gateway for API management

API Gateway streamlines API management, allowing developers to deploy, maintain, monitor, secure, and scale their APIs effortlessly. In our use case, API Gateway serves as the orchestrator for the application logic and provides throttling to manage the load to the backend. Without API Gateway, you would need to use the JavaScript SDK in the frontend to interact directly with the Amazon Bedrock API, bringing more work to the frontend.

Use Amplify for frontend code

Amplify offers a development environment for building secure, scalable mobile and web applications. It allows developers to focus on their code rather than worrying about the underlying infrastructure. Amplify also integrates with many Git providers. For this solution, we manually upload our frontend code using the method outlined earlier in this post.

Image editing tool walkthrough

Navigate to the URL provided after you created the application in Amplify and sign in. At first login attempt, you’ll be asked to reset your password.

As you follow the steps for this tool, you will notice the API Payload to be Sent section on the right side updating dynamically, reflecting the details mentioned in the corresponding steps that follow.

Step 1: Create a mask on your image

To create a mask on your image, choose a file (JPEG, JPG, or PNG).

After the image is loaded, the frontend converts the file into base64 and base_image value is updated.

As you select a portion of the image you want to edit, a mask will be created, and mask value is updated with a new base64 value. You can also use the stroke size option to adjust the area you are selecting.

You now have the original image and the mask image encoded in base64. (The Amazon Titan Image Generator G1 model requires the inputs to be in base64 encoding.)

Step 2: Write a prompt and set your options

Write a prompt that describes what you want to do with the image. For this example, we enter Make the driveway clear and empty. This is reflected in the prompt on the right.

You can choose from the following image editing options: inpainting and outpainting. The value for mode is updated depending on your selection.

Use inpainting to remove masked elements and replace them with background pixels
Use outpainting to extend the pixels of the masked image to the image boundaries

Choose Send to API to send the payload to the API gateway. This action invokes the Lambda function, which validates the received payload. If the payload is validated successfully, the Lambda function proceeds to invoke the Amazon Bedrock API for further processing.

The Amazon Bedrock API generates two image outputs in base64 format, which are transmitted back to the frontend application and rendered as visual images.

Step 3: View and download the result

The following screenshot shows the results of our test. You can download the results or provide an updated prompt to get a new output.

Testing and troubleshooting

When you initiate the Send to API action, the system performs a validation check. If required information is missing or incorrect, it will display an error notification. For instance, if you attempt to send an image to the API without providing a prompt, an error message will appear on the right side of the interface, alerting you to the missing input, as shown in the following screenshot.

Clean up

If you decide to discontinue using the Image Editing Tool, you can follow these steps to remove the Image Editing Tool, its associated resources deployed using AWS CloudFormation, and the Amplify deployment:

Delete the CloudFormation stack:
1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
2. Locate the stack you created during the deployment process (you assigned a name to it).
3. Select the stack and choose Delete.
Delete the Amplify application and its resources. For instructions, refer to Clean Up Resources.

Conclusion

In this post, we explored a sample solution that you can use to deploy an image editing application by using AWS serverless services and generative AI services. We used Amazon Bedrock and an Amazon Titan FM that allows you to edit images by using prompts. By adopting this solution, you gain the advantage of using AWS managed services, so you don’t have to maintain the underlying infrastructure. Get started today by deploying this sample solution.

Additional resources

To learn more about Amazon Bedrock, see the following resources:

To learn more about the Amazon Titan Image Generator G1 model, see the following resources:

About the Authors

Salman Ahmed is a Senior Technical Account Manager in AWS Enterprise Support. He enjoys helping customers in the travel and hospitality industry to design, implement, and support cloud infrastructure. With a passion for networking services and years of experience, he helps customers adopt various AWS networking services. Outside of work, Salman enjoys photography, traveling, and watching his favorite sports teams.

Sergio Barraza is a Senior Enterprise Support Lead at AWS, helping energy customers design and optimize cloud solutions. With a passion for software development, he guides energy customers through AWS service adoption. Outside work, Sergio is a multi-instrument musician playing guitar, piano, and drums, and he also practices Wing Chun Kung Fu.

Ravi Kumar is a Senior Technical Account Manager in AWS Enterprise Support who helps customers in the travel and hospitality industry to streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience. In his free time, Ravi enjoys creative activities like painting. He also likes playing cricket and traveling to new places.

Ankush Goyal is a Enterprise Support Lead in AWS Enterprise Support who helps customers streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience.

Brilliant words, brilliant writing: Using AWS AI chips to quickly deploy Meta LLama 3-powered applications

Many organizations are building generative AI applications powered by large language models (LLMs) to boost productivity and build differentiated experiences. These LLMs are large and complex and deploying them requires powerful computing resources and results in high inference costs. For businesses and researchers with limited resources, the high inference costs of generative AI models can be a barrier to enter the market, so more efficient and cost-effective solutions are needed. Most generative AI use cases involve human interaction, which requires AI accelerators that can deliver real time response rates with low latency. At the same time, the pace of innovation in generative AI is increasing, and it’s becoming more challenging for developers and researchers to quickly evaluate and adopt new models to keep pace with the market.

One of ways to get started with LLMs such as Llama and Mistral are by using Amazon Bedrock. However, customers who want to deploy LLMs in their own self-managed workflows for greater control and flexibility of underlying resources can use these LLMs optimized on top of AWS Inferentia2-powered Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances. In this blog post, we will introduce how to use an Amazon EC2 Inf2 instance to cost-effectively deploy multiple industry-leading LLMs on AWS Inferentia2, a purpose-built AWS AI chip, helping customers to quickly test and open up an API interface to facilitate performance benchmarking and downstream application calls at the same time.

Model introduction

There are many popular open source LLMs to choose from, and for this blog post, we will review three different use cases based on model expertise using Meta-Llama-3-8B-Instruct, Mistral-7B-instruct-v0.2, and CodeLlama-7b-instruct-hf.

Model name	Release company	Number of parameters	Release time	Model capabilities
Meta-Llama-3-8B-Instruct	Meta	8 billion	April 2024	Language understanding, translation, code generation, inference, chat
Mistral-7B-Instruct-v0.2	Mistral AI	7.3 billion	March 2024	Language understanding, translation, code generation, inference, chat
CodeLlama-7b-Instruct-hf	Meta	7 billion	August 2023	Code generation, code completion, chat

Meta-Llama-3-8B-Instruct is a popular language models, released by Meta AI in April 2024. The Llama 3 model has improved pre-training, instant comprehension, output generation, coding, inference, and math skills. The Meta AI team says that Llama 3 has the potential to be the initiator of a new wave of innovation in AI. The Llama 3 model is available in two publicly released versions, 8B and 70B. At the time of writing, Llama 3.1 instruction-tuned models are available in 8B, 70B, and 405B versions. In this blog post, we will use the Meta-Llama-3-8B-Instruct model, but the same process can be followed for Llama 3.1 models.

Mistral-7B-instruct-v0.2, released by Mistral AI in March 2024, marks a major milestone in the development of the publicly available foundation model. With its impressive performance, efficient architecture, and wide range of features, Mistral 7B v0.2 sets a new standard for user-friendly and powerful AI tools. The model excels at tasks ranging from natural language processing to coding, making it an invaluable resource for researchers, developers, and businesses. In this blog post, we will use the Mistral-7B-instruct-v0.2 model, but the same process can be followed for the Mistral-7B-instruct-v0.3 model.

CodeLlama-7b-instruct-hf is a collection of models published by Meta AI. It is an LLM that uses text prompts to generate code. Code Llama is aimed at code tasks, making developers’ workflow faster and more efficient and lowering the learning threshold for coders. Code Llama has the potential to be used as a productivity and educational tool to help programmers write more powerful and well-documented software.

Solution architecture

The solution uses a client-server architecture, and the client uses the HuggingFace Chat UI to provide a chat page that can be accessed on a PC or mobile device. Server-side model inference uses Hugging Face’s Text Generation Inference, an efficient LLM inference framework that runs in a Docker container. We pre-compiled the model using Hugging Face’s Optimum Neuron and uploaded the compilation results to Hugging Face Hub. We have also added a model switching mechanism to the HuggingFace Chat UI to control the loading of different models in the Text Generation Inference container through a scheduler (Scheduler).

Solution highlights

All components are deployed on an Inf2 instance with a single chip instance (inf2.xl or inf2.8xl), and users can experience the effects of multiple models on one instance.
With the client-server architecture, users can flexibly replace either the client or the server side according to their actual needs. For example, the model can be deployed in Amazon SageMaker, and the frontend Chat UI can be deployed on the Node server. To facilitate the demonstration, we deployed both the front and back ends on the same Inf2 server.
Using a publicly available framework, users can customize frontend pages or models according to their own needs.
Using an API interface for Text Generation Inference facilitates quick access for users using the API.
Deployment using AWS Cloudformation, suitable for all types of businesses and developers within the enterprise.

Main components

The following are the main components of the solution.

Hugging Face Optimum Neuron

Optimum Neuron is an interface between the HuggingFace Transformers library and the AWS Neuron SDK. It provides a set of tools for model load, training, and inference for single and multiple accelerator setups of different downstream tasks. In this article, we mainly used Optimum Neuron’s export interface. To deploy the HuggingFace Transformers model on Neuron devices, the model needs to be compiled and exported to a serialized format before the inference is performed. The export interface is pre-compiled (Ahead of-time compilation (AOT)) using the Neuron compiler (Neuronx-cc), and the model is converted into a serialized and optimized TorchScript module. This is shown in the following figure.

During the compilation process, we introduced a tensor parallelism mechanism to split the weights, data, and computations between the two NeuronCores. For more compilation parameters, see Export a model to Inferentia.

Hugging Face’s Text Generation Inference (TGI)

Text Generation Inference (TGI) is a framework written in Rust and Python for deploying and serving LLMs. TGI provides high performance text generation services for the most popular publicly available foundation LLMs. Its main features are:

Simple launcher that provides inference services for many LLMs
Supports both generate and stream interfaces
Token stream using server-sent events (SSE)
Supports AWS Inferentia, Trainium, NVIDIA GPUs and other accelerators

HuggingFace Chat UI

HuggingFace Chat UI is an open-source chat tool built by SvelteKit and can be deployed to Cloudflare, Netlify, Node, and so on. It has the following main features:

Page can be customized
Conversation records can be stored, and chat records are stored in MongoDB
Supports operation on PC and mobile terminals
The backend can connect to Text Generation Inference and supports API interfaces such as Anthropic, Amazon SageMaker, and Cohere
Compatible with various publicly available foundation models (Llama series, Mistral/Mixtral series, Falcon, and so on.

Thanks to the page customization capabilities of the Hugging Chat UI, we’ve added a model switching function, so users can switch between different models on the same EC2 Inf2 instance.

Solution deployment

Before deploying the solution, make sure you have an inf2.xl or inf2.8xl usage quota in the us-east-1 (Virginia) or us-west-2 (Oregon) AWS Region. See the reference link for how to apply for a quota.
Sign in to the AWS Management Consol and switch the Region to us-east-1 (Virginia) or us-west-2 (Oregon) in the upper right corner of the console page.
Enter Cloudformation in the service search box and choose Create stack.
Select Choose an existing template, and then select Amazon S3 URL.
If you plan to use an existing virtual private cloud (VPC), use the steps in a; if you plan to create a new VPC to deploy, use the steps in b.
1. Use an existing VPC.
  1. Enter https://zz-common.s3.amazonaws.com/tmp/tgiui/20240501/launch_server_default_vpc_ubuntu22.04.yaml in the Amazon S3 URL.
  2. Stack name: Enter the stack name.
  3. InstanceType: select inf2.xl (lower cost) or inf2.8xl (better performance).
  4. KeyPairName (optional): if you want to sign in to the Inf2 instance, enter the KeyPairName name.
  5. VpcId: Select VPC.
  6. PublicSubnetId: Select a public subnet.
  7. VolumeSize: Enter the size of the EC2 instance EBS storage volume. The minimum value is 80 GB.
  8. Choose Next, then Next again. Choose Submit.
2. Create a new VPC.
  1. Enter https://zz-common.s3.amazonaws.com/tmp/tgiui/20240501/launch_server_new_vpc_ubuntu22.04.yaml in the Amazon S3 URL.
  2. Stack name: Enter the stack name.
  3. InstanceType: Select inf2.xl or inf2.8xl.
  4. KeyPairName (optional): If you want to sign in to the Inf2 instance, enter the KeyPairName name.
  5. VpcId: Leave as New.
  6. PublicSubnetId: Leave as New.
  7. VolumeSize: Enter the size of the EC2 instance EBS storage volume. The minimum value is 80 GB.
Choose Next, and then Next again. Then choose Submit.6. After creating the stack, wait for the resources to be created and started (about 15 minutes). After the stack status is displayed as CREATE_COMPLETE, choose Outputs. Choose the URL where the key is the corresponding value location for Public endpoint for the web server (close all VPN connections and firewall programs).

User interface

After the solution is deployed, users can access the preceding URL on the PC or mobile phone. On the page, the Llama3-8B model will be loaded by default. Users can switch models in the menu settings, select the model name to be activated in the model list, and choose Activate to switch models. Switching models requires reloading the new model into the Inferentia 2 accelerator memory. This process takes about 1 minute. During this process, users can check the loading status of the new model by choosing Retrieve model status. If the status is Available, it indicates that the new model has been successfully loaded.

The effects of the different models are shown in the following figure:

The following figures shows the solution in a browser on a PC:

API interface and performance testing

The solution uses a Text Generation Inference Inference Server, which supports /generate and /generate_stream interfaces and uses port 8080 by default. You can make API calls by replacing <IP> that follows with the IP address deployed previously.

The /generate interface is used to return all responses to the client at once after generating all tokens on the server side.

curl <IP>:8080/generate
    -X POST
    -d '{"inputs”: "Calculate the distance from Beijing to Shanghai"}'
    -H 'Content-Type: application/json'

/generate_stream is used to reduce waiting delays and enhance the user experience by receiving tokens one by one when the model output length is relatively large.

curl <IP>:8080/generate_stream 
    -X POST
    -d '{"inputs”: "Write an essay on the mental health of elementary school students with no more than 300 words. "}' 
    -H 'Content-Type: application/json'

Here is a sample code to use requests interface in python.

import requests
url = "http://<IP>:8080/generate"
headers = {"Content-Type": "application/json"}
data = {"inputs": "Calculate the distance from Beijing to Shanghai","parameters":{
    "max_new_tokens":200
  }
}
response = requests.post(url, headers=headers, json=data)
print(response.text)

Summary

In this blog post, we introduced methods and examples of deploying popular LLMs on AWS AI chips, so that users can quickly experience the productivity improvements provided by LLMs. The model deployed on Inf2 instance has been validated by multiple users and scenarios, showing strong performance and wide applicability. AWS is continuously expanding its application scenarios and features to provide users with efficient and economical computing capabilities. See Inf2 Inference Performance to check the types and list of models supported on the Inferentia2 chip. Contact us to give feedback on your needs or ask questions about deploying LLMs on AWS AI chips.

References

About the authors

Zheng Zhang is a technical expert for Amazon Web Services machine learning products, focus on Amazon Web Services-based accelerated computing and GPU instances. He has rich experiences on large-scale model training and inference acceleration in machine learning.

Bingyang Huang is a Go-To-Market Specialist of Accelerated Computing at GCR SSO GenAI team. She has experience on deploying the AI accelerator on customer’s production environment. Outside of work, she enjoys watching films and exploring good foods.

Tian Shi is Senior Solution Architect at Amazon Web Services. He has rich experience in cloud computing, data analysis, and machine learning and is currently dedicated to research and practice in the fields of data science, machine learning, and serverless. His translations include Machine Learning as a Service, DevOps Practices Based on Kubernetes, Practical Kubernetes Microservices, Prometheus Monitoring Practice, and CoreDNS Study Guide in the Cloud Native Era.

Chuan Xie is a Senior Solution Architect at Amazon Web Services Generative AI, responsible for the design, implementation, and optimization of generative artificial intelligence solutions based on the Amazon Cloud. River has many years of production and research experience in the communications, ecommerce, internet and other industries, and rich practical experience in data science, recommendation systems, LLM RAG, and others. He has multiple AI-related product technology invention patents.

Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 2

In Part 1 of this series, we explored best practices for creating accurate and reliable agents using Amazon Bedrock Agents. Amazon Bedrock Agents help you accelerate generative AI application development by orchestrating multistep tasks. Agents use the reasoning capability of foundation models (FMs) to create a plan that decomposes the problem into multiple steps. The model is augmented with the developer-provided instruction to create an orchestration plan and then carry out the plan. The agent can use company APIs and external knowledge through Retrieval Augmented Generation (RAG).

In this second part, we dive into the architectural considerations and development lifecycle practices that can help you build robust, scalable, and secure intelligent agents. Whether you are just starting to explore the world of conversational AI or looking to optimize your existing agent deployments, this comprehensive guide can provide valuable long-term insights and practical tips to help you achieve your goals.

Enable comprehensive logging and observability

From the outset of your agent development journey, you should implement thorough logging and observability practices. This is crucial for debugging, auditing, and troubleshooting your agents. The first step to achieve comprehensive logging is to enable Amazon Bedrock model invocation logging to capture prompts and responses securely in your account.

Amazon Bedrock Agents also provides you with traces, a detailed overview of the steps being orchestrated by the agents, the underlying prompts invoking the FM, the references being returned from the knowledge bases, and code being generated by the agent. Trace events are streamed in real time, which allows you to customize UX cues to keep the end-user informed about the progress of their request. You can log your agent’s traces and use them to track and troubleshoot your agents.

When moving agent applications to production, it’s a best practice to set up a monitoring workflow to continuously analyze your logs. You can do so by either creating a custom solution or using an open source solution such as Bedrock-ICYM.

Use infrastructure as code

Just as you would with any other software development project, you should use infrastructure as code (IaC) frameworks to facilitate iterative and reliable deployment. This lets you create repeatable and production-ready agents that can be readily reproduced, tested, and monitored. Amazon Bedrock Agents allows you to write IaC code with AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), or Terraform. We also recommend that you get started using our Agent Blueprints construct. We provide blueprint templates of the most common capabilities of Amazon Bedrock Agents, which can be deployed and updated with a single AWS CDK command.

When creating agents that use action groups, you can specify your function definitions as a JSON object to the agent or provide an API schema in the OpenAPI schema format. If you already have an OpenAPI schema for your application, the best practice is to start with it. Make sure the functions have proper natural language descriptions, because your agent will use them to understand when to use each function. If you’re starting with no existing schema, the simplest way to provide tool metadata for your agent is to use simple JSON function definitions. Either way, you can use the Amazon Bedrock console to quickly create a default AWS Lambda function to get started implementing your actions or tools.

After you start to scale the development of agents, you should consider the reusability of the agent’s components. Using IaC will allow you to have predefined guardrails using Amazon Bedrock Guardrails, knowledge bases using Amazon Bedrock Knowledge Bases, and action groups that are reused over multiple agents.

Building agents that run tasks requires function definitions and Lambda functions. Another best practice is to use generative AI to accelerate the development and maintenance of this code. You can do so directly with the invoke model functionality in Amazon Bedrock, using the Amazon Q Developer support or even by creating an AWS PartyRock application that creates a framework of your Lambda function based on your action group metadata. You can directly generate the IaC required for creating your agents with function definitions and Lambda connections using generative AI. Independently of the approach selected, creating a test pipeline that validates and runs the IaC will help you optimize your agent solutions.

Use SessionState for additional agent context

You can use SessionState to provide additional context to your agent. You can pass information that is only available to the Lambda function in the action groups using SessionAttribute and information that should be available to your prompt as SessionPromptAttribute. For example, if you want to pass a user authentication token for your action to use, it’s best placed as a SessionAttribute. If you want to pass information that the large language model (LLM) needs to reason about, such as the current date and timestamp to define relative dates, it’s best placed as a SessionPromptAttribute. This lets your agent infer things like the number of days before your next payment due date or how many hours it has been since you placed your order using the reasoning capabilities of the underlying LLM model.

Optimize model selection for cost and performance

A key part of the agent building process is to select the underlying FM for your agent (or for each sub-agent). Experiment with available FMs to select the best one for your application based on cost, latency, and accuracy requirements. Implement automated testing pipelines to collect evaluation metrics, enabling data-driven decisions on model selection. This approach allows you to use faster, cheaper models like Anthropic’s Claude 3 Haiku on Amazon Bedrock for simple agents, and more complex applications can use more advanced models like Anthropic’s Claude 3.5 Sonnet or Anthropic’s Claude 3 Opus.

Implement robust testing frameworks

Automating the evaluation of your agent, or any generative AI-powered system, can accelerate the development process and make sure you provide your customers with the best possible solution. You should evaluate on multiple dimensions, including cost, latency, and accuracy of your agents. Use frameworks like Agent Evaluation to assess agent behavior against predefined criteria. By using the Amazon Bedrock agent versioning and alias features, you can unlock A/B testing as part of your deployment stages. You should define different aspects of agent behavior, such as formal or informal HR assistant tone, that can be tested with a subset of your user group. You can then make different agent versions available for each group during initial deployments and evaluate the agent behavior for each group. Amazon Bedrock Agents has built-in versioning capabilities to help you with this key part of testing. The following figure shows how the HR agent can be updated after a testing and evaluation phase to create a new alias pointing to the selected version of the agent for the model invocation.

Use LLMs for test case generation

You can use LLMs to generate test cases based on expected use cases for your agent. As a best practice, you should select a different LLM to generate data than the one that is powering your agent. This approach can significantly accelerate the building of comprehensive test suites, providing thorough coverage of potential scenarios. For example, you could use the following prompt to create test cases for an HR assistant agent that helps employees booking holidays:

Generate the conversation back and forward between an employee and an employee 
assistant agent. The employee is trying to reserve time off. 
The agent has access to functions for checking the available employee's time off, 
booking and updating time off, and sending notifications that a new time off booking 
has been completed. Here's a sample conversation between an employee and an employee 
assistant agent for booking time off. Your conversation should have at least 3 
interactions between the agent and the employee. The employee starts by saying hello.

Design robust confirmation and security mechanisms

Implement robust confirmation mechanisms for critical actions in your agent’s workflow. Clearly state in your instructions that the agent should ask for user confirmation before running certain functions, especially those that modify data or perform sensitive operations. This step helps move beyond proof of concept or prototype stages, verifying that your agent operates reliably in production environments. For instance, the following instruction tells your agent to confirm that a vacation request action should be run before updating the database for the user:

You are an HR agent, helping employees … [other instructions removed for brevity]

Before creating, editing or deleting a time-off request, ask for user confirmation
for your actions. Include sufficient information with that ask to be clear about
the action that will be taken. DO NOT provide the function name itself but rather focus
on the actions being executed using natural language.

You can also use the requireConfirmation field for function schema definition or the
x-requireConfirmation field for API schema definition during the creation of a new action to enable the Amazon Bedrock Agents built-in functionality for user confirmation request before invoking an action in an action group.

Implement flexible authorization and encryption

You should provide customer managed keys to encrypt your agent’s resources, and confirm that your AWS Identity and Access Management (IAM) permissions follow the least privilege approach, limiting your agent to only have access to required resources and actions. When implementing action groups, take advantage of the sessionAttributes parameter of your sessionState to provide information about your user roles and permissions so that your action can implement fine-grained permissions (see the following sample code). Another best practice is to use the knowledgeBaseConfigurations parameter of the sessionState to provide extra configurations to your knowledge base, such as the user group defining the documents that a user should have access to through knowledge base metadata filtering.

Integrate responsible AI practices

When developing generative AI applications, you should apply responsible AI practices to create systems in an ethical, transparent, and accountable manner. Amazon Bedrock features help you develop your responsible AI practices in a scalable manner. When creating agents, you should implement Amazon Bedrock Guardrails to avoid sensitive topics, filter user input and agent output from harmful content, and redact sensitive information to protect user privacy. You can create organization-level guardrails that can be reused across multiple generative AI applications, thereby preserving consistent responsible AI practices. After you create a guardrail, you can associate it with your agent using the Amazon Bedrock Agents built-in guardrails connection (see the following sample code).

Build a reusable actions catalog and scale gradually

After the successful deployment of your first agent, you can plan to reuse common functionalities, such as action groups, knowledge bases, and guardrails, for other applications. Amazon Bedrock Agents support the creation of agents manually using the AWS Management Console, using code with the SDKs available for the agent API, or using IaC with CloudFormation templates, the AWS CDK, or Terraform templates. To reuse functionality, the best practice is to create and deploy them using IaC and reuse the components across applications. The following figure shows an example of the reusability of a utilities action group across two agents: an HR assistant and a banking assistant.

Follow a crawl-walk-run methodology when scaling agent usage

The final best practice that we would like to highlight is to follow the crawl-walk-run methodology. Start with an internal application (crawl), followed with applications made available for a smaller, controlled set of external users (walk), and finally scale your applications to all customers (run) and eventually use multi-agent collaboration. This approach helps you build reliable agents that support mission-critical business operations, while minimizing risks associated with the rollout of new technology. The following figure illustrates this process.

Conclusion

By following these architectural and development lifecycle best practices, you’ll be well-equipped to create robust, scalable, and secure agents that can effectively serve your users and integrate seamlessly with your existing systems.

For examples to get started, check out the Amazon Bedrock samples repository. To learn more about Amazon Bedrock Agents, get started with the Amazon Bedrock Workshop and the standalone Amazon Bedrock Agents Workshop, which provides a deeper dive. Additionally, check out the service introduction video from AWS re:Invent 2023.

About the Authors

Maira Ladeira Tanke is a Senior Generative AI Data Scientist at AWS. With a background in machine learning, she has over 10 years of experience architecting and building AI applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through generative AI solutions on Amazon Bedrock. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.

Navneet Sabbineni is a Software Development Manager at AWS Bedrock. With over 9 years of industry experience as a software developer and manager, he has worked on building and maintaining scalable distributed services for AWS, including generative AI services like Amazon Bedrock Agents and conversational AI services like Amazon Lex. Outside of work, he enjoys traveling and exploring the Pacific Northwest with his family and friends.

Monica Sunkara is a Senior Applied Scientist at AWS, where she works on Amazon Bedrock Agents. With over 10 years of industry experience, including 6 years at AWS, Monica has contributed to various AI and ML initiatives such as Alexa Speech Recognition, Amazon Transcribe, and Amazon Lex ASR. Her work spans speech recognition, natural language processing, and large language models. Recently, she worked on adding function calling capabilities to Amazon Titan text models. Monica holds a degree from Cornell University, where she conducted research on object localization under the supervision of Prof. Andrew Gordon Wilson before joining Amazon in 2018.

NVIDIA Works With Deloitte to Deploy Digital AI Agents for Healthcare

Ahead of a visit to the hospital for a surgical procedure, patients often have plenty of questions about what to expect — and can be plenty nervous.

To help minimize presurgery jitters, NVIDIA and Deloitte are developing AI agents using NVIDIA AI to bring the next generation of digital, frontline teammates to patients before they even step foot inside the hospital.

These virtual teammates can have natural, human-like conversations with patients, answer a wide range of questions and provide supporting guidance prior to preadmission appointments at hospitals.

Working with NVIDIA, Deloitte has developed Frontline AI Teammate for use in settings like hospitals, where the digital avatar can have practical conversations — in any language — that give the end user, such as a patient, instant answers to pressing questions.

Powered by the NVIDIA AI Enterprise software platform, Frontline AI Teammate includes avatars, generative AI and large language models.

”Avatar-based conversational AI agents offer an incredible opportunity to reduce the productivity paradox that our healthcare system faces with digitization,” said Niraj Dalmia, partner at Deloitte Canada. “It could possibly be the complementary innovation that reduces administrative burden, complements our healthcare human resources to free up capacity and helps solve for patient experience challenges.”

Next-Gen Technologies Powering Digital Humans

Digital humans can provide lifelike interactions that can enhance experiences for doctors and patients.

Developers can tap into NVIDIA NIM microservices, which streamline the path for developing AI-powered applications and moving AI models into production, to craft digital humans for healthcare industry applications. NIM includes an easily adaptable NIM Agent Blueprint developers can use to create interactive, AI-driven avatars that are ideal for telehealth — as well as NVIDIA NeMo Retriever, an industry-leading embedding, retrieval and re-ranking model that allows for fast responses based on up-to-date healthcare data.

Customizable digital humans — like James, an interactive demo developed by NVIDIA — can handle tasks such as scheduling appointments, filling out intake forms and answering questions about upcoming health services. This can make healthcare services more efficient and more accessible to patients.

In addition to NIM microservices, James uses NVIDIA ACE and ElevenLabs digital human technologies to provide natural, low-latency responses.

NVIDIA ACE is a suite of AI, graphics and simulation technologies for bringing digital humans to life. It can integrate every aspect of a digital human into healthcare applications — from speech and translation abilities capable of understanding diverse accents and languages, to realistic animations of facial and body movements.

Deloitte’s Frontline AI Teammate, powered by the NVIDIA AI Enterprise platform and built on Deloitte’s Conversational AI Framework, is designed to deliver human-to-machine experiences in healthcare settings. Developed within the NVIDIA Omniverse platform, Deloitte’s lifelike avatar can respond to complex, domain-specific questions that are pivotal in healthcare delivery.

The avatar uses NVIDIA Riva for fluid, multilingual communication, helping ensure no patient is left behind due to language barriers. It’s also equipped with the NeMo Megatron-Turing 530B large language model for accurate understanding and processing of patient data. These advanced capabilities can make clinical visits less intimidating, especially for patients who may feel uneasy about medical environments.

Personalized Experiences for Hospital Patients

Patients can get overwhelmed with the amount of pre-operative information. Typically, they have only one preadmission appointment, many weeks before the surgery, which can leave them with lingering questions and escalating concerns. The stress of a serious diagnosis may prevent them from asking all the necessary questions during these brief interactions.

This can result in patients arriving unprepared for their preadmission appointments, lacking knowledge about the appointment’s purpose, duration, location and necessary documents, and potentially leading to delays or even rescheduling of their surgeries.

To enhance patient preparation and reduce pre-procedure anxiety, The Ottawa Hospital is using AI agents, powered by NVIDIA and Deloitte technologies, to provide more consistent, accurate and continuous access to information.

With the digital teammate, patients can experience benefits including:

24/7 access to the digital teammate using a smartphone, tablet or home computer.
Reliable, preapproved answers to detailed questions, including information around anesthesia or the procedure itself.
Post-surgery consultation to resolve any questions about the recovery process, potentially improving treatment adherence and health outcomes.

In user acceptance testing of the digital teammate conducted this summer, a majority of the testers noted that responses provided were clear, relevant and met the needs of the given interaction.

“The Frontline AI Teammate offers a novel and innovative solution to help combat our health human resource crisis — it has the potential to reduce the administrative burden, giving back time to healthcare providers to provide the quality care our population deserves and expects from The Ottawa Hospital,” said Mathieu LeBreton, digital experience lead at The Ottawa Hospital. “The opportunity to explore these technologies is well-timed, given the planning of the New Campus Development, a new hospital project in Ottawa. Proper identification of the problems we are trying to solve is imperative to ensure this is done responsibly and transparently.”

Deloitte is working with other hospitals and healthcare institutions to deploy digital agents. A patient-facing pilot with Ottawa Hospital is expected to go live by the end of the year.

Developers can get started by accessing the digital human NIM Agent Blueprint.

NVIDIA Works With Deloitte to Deploy Digital AI Agents for Healthcare

Ahead of a visit to the hospital for a surgical procedure, patients often have plenty of questions about what to expect — and can be plenty nervous.

These virtual teammates can have natural, human-like conversations with patients, answer a wide range of questions and provide supporting guidance prior to preadmission appointments at hospitals.

Powered by the NVIDIA AI Enterprise software platform, Frontline AI Teammate includes avatars, generative AI and large language models.

Next-Gen Technologies Powering Digital Humans

Digital humans can provide lifelike interactions that can enhance experiences for doctors and patients.

In addition to NIM microservices, James uses NVIDIA ACE and ElevenLabs digital human technologies to provide natural, low-latency responses.

Personalized Experiences for Hospital Patients

With the digital teammate, patients can experience benefits including:

24/7 access to the digital teammate using a smartphone, tablet or home computer.
Reliable, preapproved answers to detailed questions, including information around anesthesia or the procedure itself.
Post-surgery consultation to resolve any questions about the recovery process, potentially improving treatment adherence and health outcomes.

In user acceptance testing of the digital teammate conducted this summer, a majority of the testers noted that responses provided were clear, relevant and met the needs of the given interaction.

Deloitte is working with other hospitals and healthcare institutions to deploy digital agents. A patient-facing pilot with Ottawa Hospital is expected to go live by the end of the year.

Developers can get started by accessing the digital human NIM Agent Blueprint.

NVIDIA Works With Deloitte to Deploy Digital AI Agents for Healthcare

Ahead of a visit to the hospital for a surgical procedure, patients often have plenty of questions about what to expect — and can be plenty nervous.

These virtual teammates can have natural, human-like conversations with patients, answer a wide range of questions and provide supporting guidance prior to preadmission appointments at hospitals.

Powered by the NVIDIA AI Enterprise software platform, Frontline AI Teammate includes avatars, generative AI and large language models.

Next-Gen Technologies Powering Digital Humans

Digital humans can provide lifelike interactions that can enhance experiences for doctors and patients.

In addition to NIM microservices, James uses NVIDIA ACE and ElevenLabs digital human technologies to provide natural, low-latency responses.

Personalized Experiences for Hospital Patients

With the digital teammate, patients can experience benefits including:

24/7 access to the digital teammate using a smartphone, tablet or home computer.
Reliable, preapproved answers to detailed questions, including information around anesthesia or the procedure itself.
Post-surgery consultation to resolve any questions about the recovery process, potentially improving treatment adherence and health outcomes.

In user acceptance testing of the digital teammate conducted this summer, a majority of the testers noted that responses provided were clear, relevant and met the needs of the given interaction.

Deloitte is working with other hospitals and healthcare institutions to deploy digital agents. A patient-facing pilot with Ottawa Hospital is expected to go live by the end of the year.

Developers can get started by accessing the digital human NIM Agent Blueprint.

NVIDIA Works With Deloitte to Deploy Digital AI Agents for Healthcare

Ahead of a visit to the hospital for a surgical procedure, patients often have plenty of questions about what to expect — and can be plenty nervous.

These virtual teammates can have natural, human-like conversations with patients, answer a wide range of questions and provide supporting guidance prior to preadmission appointments at hospitals.

Powered by the NVIDIA AI Enterprise software platform, Frontline AI Teammate includes avatars, generative AI and large language models.

Next-Gen Technologies Powering Digital Humans

Digital humans can provide lifelike interactions that can enhance experiences for doctors and patients.

In addition to NIM microservices, James uses NVIDIA ACE and ElevenLabs digital human technologies to provide natural, low-latency responses.

Personalized Experiences for Hospital Patients

With the digital teammate, patients can experience benefits including:

24/7 access to the digital teammate using a smartphone, tablet or home computer.
Reliable, preapproved answers to detailed questions, including information around anesthesia or the procedure itself.
Post-surgery consultation to resolve any questions about the recovery process, potentially improving treatment adherence and health outcomes.

In user acceptance testing of the digital teammate conducted this summer, a majority of the testers noted that responses provided were clear, relevant and met the needs of the given interaction.

Deloitte is working with other hospitals and healthcare institutions to deploy digital agents. A patient-facing pilot with Ottawa Hospital is expected to go live by the end of the year.

Developers can get started by accessing the digital human NIM Agent Blueprint.

NVIDIA Works With Deloitte to Deploy Digital AI Agents for Healthcare

Ahead of a visit to the hospital for a surgical procedure, patients often have plenty of questions about what to expect — and can be plenty nervous.

These virtual teammates can have natural, human-like conversations with patients, answer a wide range of questions and provide supporting guidance prior to preadmission appointments at hospitals.

This demo shows one virtual representative in action, answering patient questions:

Powered by the NVIDIA AI Enterprise software platform, Frontline AI Teammate includes avatars, generative AI and large language models.

Next-Gen Technologies Powering Digital Humans

Digital humans can provide lifelike interactions that can enhance experiences for doctors and patients.

In addition to NIM microservices, James uses NVIDIA ACE and ElevenLabs digital human technologies to provide natural, low-latency responses.

Personalized Experiences for Hospital Patients

With the digital teammate, patients can experience benefits including:

24/7 access to the digital teammate using a smartphone, tablet or home computer.
Reliable, preapproved answers to detailed questions, including information around anesthesia or the procedure itself.
Post-surgery consultation to resolve any questions about the recovery process, potentially improving treatment adherence and health outcomes.

In user acceptance testing of the digital teammate conducted this summer, a majority of the testers noted that responses provided were clear, relevant and met the needs of the given interaction.

Deloitte is working with other hospitals and healthcare institutions to deploy digital agents. A patient-facing pilot with Ottawa Hospital is expected to go live by the end of the year.

Developers can get started by accessing the digital human NIM Agent Blueprint.

NVIDIA Works With Deloitte to Deploy Digital AI Agents for Healthcare

Ahead of a visit to the hospital for a surgical procedure, patients often have plenty of questions about what to expect — and can be plenty nervous.

These virtual teammates can have natural, human-like conversations with patients, answer a wide range of questions and provide supporting guidance prior to preadmission appointments at hospitals.

This demo shows one virtual representative in action, answering patient questions:

Powered by the NVIDIA AI Enterprise software platform, Frontline AI Teammate includes avatars, generative AI and large language models.

Next-Gen Technologies Powering Digital Humans

Digital humans can provide lifelike interactions that can enhance experiences for doctors and patients.

In addition to NIM microservices, James uses NVIDIA ACE and ElevenLabs digital human technologies to provide natural, low-latency responses.

Personalized Experiences for Hospital Patients

With the digital teammate, patients can experience benefits including:

24/7 access to the digital teammate using a smartphone, tablet or home computer.
Reliable, preapproved answers to detailed questions, including information around anesthesia or the procedure itself.
Post-surgery consultation to resolve any questions about the recovery process, potentially improving treatment adherence and health outcomes.

In user acceptance testing of the digital teammate conducted this summer, a majority of the testers noted that responses provided were clear, relevant and met the needs of the given interaction.

Deloitte is working with other hospitals and healthcare institutions to deploy digital agents. A patient-facing pilot with Ottawa Hospital is expected to go live by the end of the year.

Developers can get started by accessing the digital human NIM Agent Blueprint.

NVIDIA Works With Deloitte to Deploy Digital AI Agents for Healthcare

Ahead of a visit to the hospital for a surgical procedure, patients often have plenty of questions about what to expect — and can be plenty nervous.

These virtual teammates can have natural, human-like conversations with patients, answer a wide range of questions and provide supporting guidance prior to preadmission appointments at hospitals.

This demo shows one virtual representative in action, answering patient questions:

Powered by the NVIDIA AI Enterprise software platform, Frontline AI Teammate includes avatars, generative AI and large language models.

Next-Gen Technologies Powering Digital Humans

Digital humans can provide lifelike interactions that can enhance experiences for doctors and patients.

In addition to NIM microservices, James uses NVIDIA ACE and ElevenLabs digital human technologies to provide natural, low-latency responses.

Personalized Experiences for Hospital Patients

With the digital teammate, patients can experience benefits including:

24/7 access to the digital teammate using a smartphone, tablet or home computer.
Reliable, preapproved answers to detailed questions, including information around anesthesia or the procedure itself.
Post-surgery consultation to resolve any questions about the recovery process, potentially improving treatment adherence and health outcomes.

In user acceptance testing of the digital teammate conducted this summer, a majority of the testers noted that responses provided were clear, relevant and met the needs of the given interaction.

Deloitte is working with other hospitals and healthcare institutions to deploy digital agents. A patient-facing pilot with Ottawa Hospital is expected to go live by the end of the year.

Developers can get started by accessing the digital human NIM Agent Blueprint.

Solution overview

Prerequisites

Deploy solution resources using AWS CloudFormation

Deploy the Amplify application

Use Amazon Cognito for user authentication

Use Lambda for backend code and Amazon Bedrock for generative AI function

Use API Gateway for API management

Use Amplify for frontend code

Image editing tool walkthrough

Step 1: Create a mask on your image

Step 2: Write a prompt and set your options

Step 3: View and download the result

Testing and troubleshooting

Clean up

Conclusion

Additional resources

About the Authors

Model introduction

Solution architecture

Solution highlights

Main components

Hugging Face Optimum Neuron

Hugging Face’s Text Generation Inference (TGI)

HuggingFace Chat UI

Solution deployment

User interface

API interface and performance testing

Summary

References

About the authors

Enable comprehensive logging and observability

Use infrastructure as code

Use SessionState for additional agent context

Optimize model selection for cost and performance

Implement robust testing frameworks

Use LLMs for test case generation

Design robust confirmation and security mechanisms

Implement flexible authorization and encryption

Integrate responsible AI practices

Build a reusable actions catalog and scale gradually

Follow a crawl-walk-run methodology when scaling agent usage

Conclusion

About the Authors

Next-Gen Technologies Powering Digital Humans

Personalized Experiences for Hospital Patients

Next-Gen Technologies Powering Digital Humans

Personalized Experiences for Hospital Patients

Next-Gen Technologies Powering Digital Humans

Personalized Experiences for Hospital Patients

Next-Gen Technologies Powering Digital Humans

Personalized Experiences for Hospital Patients

Next-Gen Technologies Powering Digital Humans

Personalized Experiences for Hospital Patients

Next-Gen Technologies Powering Digital Humans

Personalized Experiences for Hospital Patients

Next-Gen Technologies Powering Digital Humans

Personalized Experiences for Hospital Patients

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.