November 2024 – Page 19

Amazon Bedrock Prompt Management is now available in GA

Today we are announcing the general availability of Amazon Bedrock Prompt Management, with new features that provide enhanced options for configuring your prompts and enabling seamless integration for invoking them in your generative AI applications.

Amazon Bedrock Prompt Management simplifies the creation, evaluation, versioning, and sharing of prompts to help developers and prompt engineers get better responses from foundation models (FMs) for their use cases. In this post, we explore the key capabilities of Amazon Bedrock Prompt Management and show examples of how to use these tools to help optimize prompt performance and outputs for your specific use cases.

New features in Amazon Bedrock Prompt Management

Amazon Bedrock Prompt Management offers new capabilities that simplify the process of building generative AI applications:

Structured prompts – Define system instructions, tools, and additional messages when building your prompts
Converse and InvokeModel API integration – Invoke your cataloged prompts directly from the Amazon Bedrock Converse and InvokeModel API calls

To showcase the new additions, let’s walk through an example of building a prompt that summarizes financial documents.

Create a new prompt

Complete the following steps to create a new prompt:

On the Amazon Bedrock console, in the navigation pane, under Builder tools, choose Prompt management.
Choose Create prompt.
Provide a name and description, and choose Create.

Build the prompt

Use the prompt builder to customize your prompt:

For System instructions, define the model’s role. For this example, we enter the following:
You are an expert financial analyst with years of experience in summarizing complex financial documents. Your task is to provide clear, concise, and accurate summaries of financial reports.
Add the text prompt in the User message box.

You can create variables by enclosing a name with double curly braces. You can later pass values for these variables at invocation time, which are injected into your prompt template. For this post, we use the following prompt:

Summarize the following financial document for {{company_name}} with ticker symbol {{ticker_symbol}}:
Please provide a brief summary that includes
1.	Overall financial performance
2.	Key numbers (revenue, profit, etc.)
3.	Important changes or trends
4.	Main points from each section
5.	Any future outlook mentioned
6.	Current Stock price
Keep it concise and easy to understand. Use bullet points if needed.
Document content: {{document_content}}

Configure tools in the Tools setting section for function calling.

You can define tools with names, descriptions, and input schemas to enable the model to interact with external functions and expand its capabilities. Provide a JSON schema that includes the tool information.

When using function calling, an LLM doesn’t directly use tools; instead, it indicates the tool and parameters needed to use it. Users must implement the logic to invoke tools based on the model’s requests and feed results back to the model. Refer to Use a tool to complete an Amazon Bedrock model response to learn more.

Choose Save to save your settings.

Compare prompt variants

You can create and compare multiple versions of your prompt to find the best one for your use case. This process is manual and customizable.

Choose Compare variants.
The original variant is already populated. You can manually add new variants by specifying the number you want to create.
For each new variant, you can customize the user message, system instruction, tools configuration, and additional messages.
You can create different variants for different models. Choose Select model to choose the specific FM for testing each variant.
Choose Run all to compare outputs from all prompt variants across the selected models.
If a variant performs better than the original, you can choose Replace original prompt to update your prompt.
On the Prompt builder page, choose Create version to save the updated prompt.

This approach allows you to fine-tune your prompts for specific models or use cases and makes it straightforward to test and improve your results.

Invoke the prompt

To invoke the prompt from your applications, you can now include the prompt identifier and version as part of the Amazon Bedrock Converse API call. The following code is an example using the AWS SDK for Python (Boto3):

import boto3

# Set up the Bedrock client
bedrock = boto3.client('bedrock-runtime')

# Example API call
response = bedrock.converse(
    modelId=<<insert prompt arn>>,
    promptVariables = '{ "company_name": { "text" : "<<insert company name>>"},"ticker_symbol": {"text" : "<<insert ticker symbol>>"},"document_content": {"text" : "<<Insert document content>>"}}'
)

# Print the response	
response_body = json.loads(bedrock_response.get('body').read())
print(response_body)

We have passed the prompt Amazon Resource Name (ARN) in the model ID parameter and prompt variables as a separate parameter, and Amazon Bedrock directly loads our prompt version from our prompt management library to run the invocation without latency overheads. This approach simplifies the workflow by enabling direct prompt invocation through the Converse or InvokeModel APIs, eliminating manual retrieval and formatting. It also allows teams to reuse and share prompts and track different versions.

For more information on using these features, including necessary permissions, see the documentation.

You can also invoke the prompts in other ways:

Use the Amazon Bedrock prompt flow. Include any prompt version from Amazon Bedrock Prompt Management using the prompt node in your flow.
Use the Amazon Bedrock SDK and the get_prompt This allows you to integrate the prompt information into your code application or, if preferred, using open source frameworks such as LangChain or LlamaIndex. Refer to Integrate Amazon Bedrock Prompt Management in LangChain applications for more details.

Now available

Amazon Bedrock Prompt Management is now generally available in the US East (N. Virginia), US West (Oregon), Europe (Paris), Europe (Ireland) , Europe (Frankfurt), Europe (London), South America (Sao Paulo), Asia Pacific (Mumbai), Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Sydney), and Canada (Central) AWS Regions. For pricing information, see Amazon Bedrock Pricing.

Conclusion

The general availability of Amazon Bedrock Prompt Management introduces powerful capabilities that enhance the development of generative AI applications. By providing a centralized platform to create, customize, and manage prompts, developers can streamline their workflows and work towards improving prompt performance. The ability to define system instructions, configure tools, and compare prompt variants empowers teams to craft effective prompts tailored to their specific use cases. With seamless integration into the Amazon Bedrock Converse API and support for popular frameworks, organizations can now effortlessly build and deploy AI solutions that are more likely to generate relevant output.

About the Authors

Dani Mitchell is a Generative AI Specialist Solutions Architect at AWS. He is focused on computer vision use cases and helping accelerate EMEA enterprises on their ML and generative AI journeys with Amazon SageMaker and Amazon Bedrock.

Ignacio Sánchez is a Spatial and AI/ML Specialist Solutions Architect at AWS. He combines his skills in extended reality and AI to help businesses improve how people interact with technology, making it accessible and more enjoyable for end-users.

Jensen Huang to Discuss AI’s Future with Masayoshi Son at AI Summit Japan

NVIDIA founder and CEO Jensen Huang will join SoftBank Group Chairman and CEO Masayoshi Son in a fireside chat at NVIDIA AI Summit Japan to discuss the transformative role of AI and more..

Taking place on November 12-13, the invite-only event at The Prince Park Tower in Tokyo’s Minato district will gather industry leaders to explore advancements in generative AI, robotics and industrial digitalization.

Call to action: Tickets for the event are sold out, but tune in via livestream or watch on-demand sessions.

Over 50 sessions and live demos will showcase innovations from NVIDIA and its partners, covering everything from large language models, known as LLMs, to AI-powered robotics and digital twins.

Huang and Son will discuss AI’s transformative role and efforts driving the AI field.

Son has invested in companies around the world that show potential for AI-driven growth through SoftBank Vision Funds. Huang has steered NVIDIA’s rise to a global leader in AI and accelerated computing.

One major topic: Japan’s AI infrastructure initiative, supported by NVIDIA and local firms. This investment is central to the country’s AI ambitions.

Leaders from METI and experts like Shunsuke Aoki from Turing Inc. will dig into how sovereign AI fosters innovation and strengthens Japan’s technological independence.

On Wednesday, November 13, two key sessions will offer deeper insights into Japan’s AI journey:

The Present and Future of Generative AI in Japan: Professor Yutaka Matsuo of the University of Tokyo will explore the advances of generative AI and its impact on policy and business strategy. Expect discussions on the opportunities and challenges Japan faces as it pushes forward with AI innovations.
Sovereign AI and Its Role in Japan’s Future: A panel of four experts will dive into the concept of sovereign AI. Speakers like Takuya Watanabe of METI and Hironobu Tamba of SoftBank will discuss how sovereign AI can accelerate business strategies and strengthen Japan’s technological independence.

These sessions highlight how Japan is positioning itself at the forefront of AI development. Practical insights into the next wave of AI innovation and policy are on the agenda.

Experts from Sakana AI, Sony, Tokyo Science University and Yaskawa Electric will be among those presenting breakthroughs across sectors like healthcare, robotics and data centers.

The summit will also feature hands-on workshops, including a full-day session on Tuesday, November 12, titled “Building RAG Agents with LLM.”

Led by NVIDIA experts, this workshop will offer practical experience in developing retrieval-augmented generation, or RAG, agents using large-scale language models.

With its mix of forward-looking discussions and real-world applications, the NVIDIA AI Summit Tokyo will highlight Japan’s ongoing advancements in AI and its contributions to the global AI landscape.

Tune in to the fireside chat between Son and Huang via livestream or watch on-demand sessions.

How Zalando optimized large-scale inference and streamlined ML operations on Amazon SageMaker

This post is cowritten with Mones Raslan, Ravi Sharma and Adele Gouttes from Zalando.

Zalando SE is one of Europe’s largest ecommerce fashion retailers with around 50 million active customers. Zalando faces the challenge of regular (weekly or daily) discount steering for more than 1 million products, also referred to as markdown pricing. Markdown pricing is a pricing approach that adjusts prices over time and is a common strategy to maximize revenue from goods that have a limited lifespan or are subject to seasonal demand (Sul 2023).

Because many items are ordered ahead of season and not replenished afterwards, businesses have an interest in selling the products evenly throughout the season. The main rationale is to avoid overstock and understock situations. An overstock situation would lead to high costs after the season ends, and an understock situation would lead to lost sales because customers would choose to buy at competitors.

To address this issue, discount steering is an effective approach because it influences item-level demand and therefore stock levels.

The markdown pricing algorithmic solution Zalando relies on is a forecast-then-optimize approach (Kunz et al. 2023 and Streeck et al. 2024). A high-level description of the markdown pricing algorithm solution can be broken down into four steps:

Discount-dependent forecast – Using past data, forecast future discount-dependent quantities that are relevant for determining the future profit of an item. The following are important metrics that need to be forecasted:
- 1. Demand – How many items will be sold in the next X weeks for different discounts?
  2. Return rate – What share of sold items will be returned by the customer?
  3. Return time – When will a returned item reappear in the warehouse so that it can be sold again?
  4. Fulfillment costs – How much will shipping and returning an item cost?
  5. Residual value – At what price can an item be realistically sold after the end of the season?
Determine an optimal discount – Use the forecasts from Step 1 as input to maximize profit as a function of discount, which is subject to business and stock constraints. Concrete details can be found in Streeck et al. 2024.
Recommendations – Discount recommendations determined in Step 2 are incorporated into the shop or overwritten by pricing managers.
Data collection – Updated shop prices lead to updated demand. The new information is used to enhance the training sets used in Step 1 for forecasting discounts.

The following diagram illustrates this workflow.

The focus of this post is on Step 1, creating a discount-dependent forecast. Depending on the complexity of the problem and the structure of underlying data, the predictive models at Zalando range from simple statistical averages, over tree-based models to a Transformer-based deep learning architecture (Kunz et al. 2023).

Regardless of the models used, they all include data preprocessing, training, and inference over several billions of records containing weekly data spanning multiple years and markets to produce forecasts. Operating such large-scale forecasting requires resilient, reusable, reproducible, and automated machine learning (ML) workflows with fast experimentation and continuous improvements.

In this post, we present the implementation and orchestration of the forecast model’s training and inference. The solution was built in a recent collaboration between AWS Professional Services, under which Well-Architected machine learning design principles were followed.

The result of the collaboration is a blueprint that is being reused for similar use cases within Zalando.

Motivation for streamlined ML operations and large-scale inference

As mentioned earlier, discount steering of more than a million items every week requires generating a large amount of forecast records (approximately 10 billion). Effective discount steering calls for continuous improvement of forecasting accuracy.

To improve forecasting accuracy, all involved ML models need to be retrained, and predictions need to be produced weekly, and in some cases daily.

Given the amount of data and nature of ML models in question, training and inference takes from several hours to multiple days. Any error in the process represents risks in terms of operational costs and opportunity costs because Zalando’s commercial pricing team expects results according to defined service level objectives (SLOs).

If an ML model training or inference fails in any given week, an ML model with outdated data is used to generate the forecast records. This has a direct impact on revenue for Zalando because the forecasts and discounts are less accurate when using outdated data.

In this context, our motivation for streamlining ML operations (MLOps) can be summarized as follows:

Speed up experimentation and evaluation, and enable rapid prototyping and provide sufficient time to meet SLOs
Design the architecture in a templated approach with the objective of supporting multiple model training and inference, providing a unified ML infrastructure and enabling automated integration for training and inference
Provide scalability to accommodate different types of forecasting models (also supporting GPU) and growing datasets
Make end-to-end ML pipelines and experimentation repeatable, fault-tolerant, and traceable

To achieve these objectives, we explored several distributed computing tools.

During our analysis phase, we discovered two key factors that influenced our choice of distributed computing tool. First, our input datasets were stored in the columnar Parquet format, spread across multiple partitions. Second, the required inference operations exhibited embarrassingly parallel characteristics, meaning they could be run independently without necessitating inter-node communication. These factors guided our decision-making process for selecting the most suitable distributed computing tool.

We explored multiple big data processing solutions and decided to use an Amazon SageMaker Processing job for the following reasons:

It’s highly configurable, with support of pre-built images, custom cluster requirements, and containers. This makes it straightforward to manage and scale with no overhead of inter-node communication.
Amazon SageMaker supports effortless experimentation with Amazon SageMaker Studio.
SageMaker Processing integrates seamlessly with AWS Identity and Access Management (IAM), Amazon Simple Storage Service (Amazon S3), AWS Step Functions, and other AWS services.
SageMaker Processing supports the option to upgrade to GPUs with minimal change in the architecture.
SageMaker Processing unifies our training and inference architecture, enabling us to use inference architecture for model backtesting.

We also explored other tools, but preferred SageMaker Processing jobs for the following reasons:

Apache Spark on Amazon EMR – Due to the inference operations displaying embarrassingly parallel characteristics and not requiring inter-node communication, we decided against using Spark on Amazon EMR, which involved additional overhead for inter-node communication.
SageMaker batch transform jobs – Batch transform jobs have a hard limit of 100 MB payload size, which couldn’t accommodate the dataset partitions. This proved to be a limiting factor for running batch inference on it.

Solution overview

Large-scale inference requires a scalable inference and scalable training solution.

We approached this by designing an architecture with an event-driven principle in mind that enabled us to build ML workflows for training and inference using infrastructure as code (IaC). At the same time, we incorporated continuous integration and delivery (CI/CD) processes, automated testing, and model versioning into the solution. Because applied scientists need to iterate and experiment, we created a flexible experimentation environment very close to the production one.

The following high-level architecture diagram shows the ML solution deployed on AWS, which is now used by Zalando’s forecasting team to run pricing forecasting models.

The architecture consists of the following components:

Sunrise – Sunrise is Zalando’s internal CI/CD tool, which automates the deployment of the ML solution in an AWS environment.
AWS Step Functions – AWS Step Functions orchestrates the entire ML workflow, coordinating various stages such as model training, versioning, and inference. Step Functions can seamlessly integrate with AWS services such as SageMaker, AWS Lambda, and Amazon S3.
Data store – S3 buckets serve as the data store, holding input and output data as well as model artifacts.
Model registry – Amazon SageMaker Model Registry provides a centralized repository for organizing, versioning, and tracking models.
Logging and monitoring – Amazon CloudWatch handles logging and monitoring, forwarding the metrics to Zalando’s internal alerting tool for further analysis and notifications.

To orchestrate multiple steps within the training and inference pipelines, we used Zflow, a Python-based SDK developed by Zalando that uses the AWS Cloud Development Kit (AWS CDK) to create Step Functions workflows. It uses SageMaker training jobs for model training, processing jobs for batch inference, and the model registry for model versioning.

All the components are declared using Zflow and are deployed using CI/CD (Sunrise) to build reusable end-to-end ML workflows, while integrating with AWS services.

The reusable ML workflow allows experimentation and productionization of different models. This enables the separation of the model orchestration and business logic, allowing data scientists and applied scientists to focus on the business logic and use these predefined ML workflows.

A fully automated production workflow

The MLOps lifecycle starts with ingesting the training data in the S3 buckets. On the arrival of data, Amazon EventBridge invokes the training workflow (containing SageMaker training jobs). Upon completion of the training job, a new model is created and stored in SageMaker Model Registry.

To maintain quality control, the team verifies the model properties against the predetermined requirements. If the model meets the criteria, it’s approved for inference. After a model is approved, the inference pipeline will point to the latest approved version of that model group.

When inference data is ingested on Amazon S3, EventBridge automatically runs the inference pipeline.

This automated workflow streamlines the entire process, from data ingestion to inference, reducing manual interventions and minimizing the risk of errors. By using AWS services such as Amazon S3, EventBridge, SageMaker, and Step Functions, we were able to orchestrate the end-to-end MLOps lifecycle efficiently and reliably.

Seamless integration of experiments

To allow for effortless model experimentation, we created SageMaker notebooks that use the Amazon SageMaker SDK to launch SageMaker training and processing jobs. The notebooks use the same Docker images (SageMaker Studio notebook kernels) as the ones used in CI/CD workflows all the way to production. With these notebooks, applied scientists can bring their own code and connect to different data sources, while also experimenting with different instance sizes by scaling up or down computation and memory requirements. The experimentation setup reflects the production workflows.

Conclusion

In this post, we described how MLOps, in collaboration between Zalando and AWS Professional Services, were streamlined with the objective of improving discount steering at Zalando.

MLOps best practices implemented for forecast model training and inference has provided Zalando a flexible and scalable architecture with reduced engineering complexity.

The implemented architecture enables Zalando’s team to conduct large-scale inference, with frequent experimentation and decreased risks of missing weekly SLOs.

Templatization and automation is expected to provide engineers with weekly savings of 3–4 hours per ML model in operations and maintenance tasks. Furthermore, the transition from data science experimentation into model productionization has been streamlined.

To learn more about ML streamlining, experimentation, and scalability, refer to the following blog posts:

References

Eleanor, L., R. Brian, K. Jalaj, and D. A. Little. 2022. “Promotheus: An End-to-End Machine Learning Framework for Optimizing Markdown in Online Fashion E-commerce.” arXiv. https://arxiv.org/abs/2207.01137.
Kunz, M., S. Birr, M. Raslan, L. Ma, Z. Li, A. Gouttes, M. Koren, et al. 2023. “Deep Learning based Forecasting: a case study from the online fashion industry.” In Forecasting with Artificial Intelligence: Theory and Applications (Switzerland), 2023.
Streeck, R., T. Gellert, A. Schmitt, A. Dipkaya, V. Fux, T. Januschowski, and T. Berthold. 2024. “Tricks from the Trade for Large-Scale Markdown Pricing: Heuristic Cut Generation for Lagrangian Decomposition.” arXiv. https://arxiv.org/abs/2404.02996#.
Sul, Inki. 2023. “Customer-centric Pricing: Maximizing Revenue Through Understanding Customer Behavior.” The University of Texas at Dallas. https://utd-ir.tdl.org/items/a2b9fde1-aa17-4544-a16e-c5a266882dda.

About the Authors

Mones Raslan is an Applied Scientist at Zalando’s Pricing Platform with a background in applied mathematics. His work encompasses the development of business-relevant and scalable forecasting models, stretching from prototyping to deployment. In his spare time, Mones enjoys operatic singing and scuba diving.

Ravi Sharma is a Senior Software Engineer at Zalando’s Pricing Platform, bringing experience across diverse domains such as football betting, radio astronomy, healthcare, and ecommerce. His broad technical expertise enables him to deliver robust and scalable solutions consistently. Outside work, he enjoys nature hikes, table tennis, and badminton.

Adele Gouttes is a Senior Applied Scientist, with experience in machine learning, time series forecasting, and causal inference. She has experience developing products end to end, from the initial discussions with stakeholders to production, and creating technical roadmaps for cross-functional teams. Adele plays music and enjoys gardening.

Irem Gokcek is a Data Architect on the AWS Professional Services team, with expertise spanning both analytics and AI/ML. She has worked with customers from various industries, such as retail, automotive, manufacturing, and finance, to build scalable data architectures and generate valuable insights from the data. In her free time, she is passionate about swimming and painting.

Jean-Michel Lourier is a Senior Data Scientist within AWS Professional Services. He leads teams implementing data-driven applications side by side with AWS customers to generate business value out of their data. He’s passionate about diving into tech and learning about AI, machine learning, and their business applications. He is also a cycling enthusiast.

Junaid Baba, a Senior DevOps Consultant with AWS Professional Services, has expertise in machine learning, generative AI operations, and cloud-centered architectures. He applies these skills to design scalable solutions for clients in the global retail and financial services sectors. In his spare time, Junaid spends quality time with his family and finds joy in hiking adventures.

Luis Bustamante is a Senior Engagement Manager within AWS Professional Services. He helps customers accelerate their journey to the cloud through expertise in digital transformation, cloud migration, and IT remote delivery. He enjoys traveling and reading about historical events.

Viktor Malesevic is a Senior Machine Learning Engineer within AWS Professional Services, leading teams to build advanced machine learning solutions in the cloud. He’s passionate about making AI impactful, overseeing the entire process from modeling to production. In his spare time, he enjoys surfing, cycling, and traveling.

Unleashing Stability AI’s most advanced text-to-image models for media, marketing and advertising: Revolutionizing creative workflows

To stay competitive, media, advertising, and entertainment enterprises need to stay abreast of recent dramatic technological developments. Generative AI has emerged as a game-changer, offering unprecedented opportunities for creative professionals to push boundaries and unlock new realms of possibility. At the forefront of this revolution is Stability AI’s family of cutting-edge text-to-image AI models. These models promise to transform the way we approach visual content creation, empowering large media, advertising, and entertainment organizations to tackle real-world business use cases with efficiency and creativity.

This technical post explores how these organizations can use the power of Stability AI to streamline workflows, enhance creative processes, and unleash a new era of advertising campaigning and visual storytelling.

Overview

Amazon Bedrock recently launched three new models by Stability AI: Stable Image Ultra, Stable Diffusion 3 Large, and Stable Image Core. These advanced models greatly improve performance in multisubject prompts, image quality, and typography and can be used to rapidly generate high-quality visuals for a wide range of use cases across marketing, advertising, media, entertainment, retail, and more. One of the key improvements of these models compared to Stable Diffusion XL (SDXL) (one of Stability AI’s older models) is text quality in generated images, with fewer errors in spelling and typography thanks to its innovative Diffusion Transformer architecture.

By learning the intricate relationships between visual and textual data, these models can generate highly detailed and coherent images from simple text prompts. The improved architecture combines the strengths of various deep learning techniques, including transformer encoders for text understanding, convolutional neural networks (CNNs) for efficient image processing, and attention mechanisms for capturing long-range dependencies and fine-grained details. The new family of models available on Amazon Bedrock are mentioned in the table below:

Features	Stable Image Core	SD3 Large 1.0	Stable Image Ultra 1.0
Parameters	2.6 billion	8 billion	8 billion
Input	Text	Text or Image	Text
Typography	Versatility and readability across different sizes and applications	Tailored for large-scale display	Tailored for large-scale display
Visual Aesthetics	Good rendering, not as detail oriented	Highly realistic with finer attention to detail	Photorealistic image output
Best Fit	Fast and affordable rapid concepting and ideating	Content creation in media, entertainment, retail	High-quality content at speed for media, retail

To evaluate the capabilities of these models, we tested a variety of prompts ranging from simple object descriptions to complex scene compositions. The experiments revealed that, although SDXL excelled at rendering common objects and scenes accurately, these newer models from Stability AI demonstrated improved performance on more nuanced and imaginative prompts. The new models better understand and visually express abstract concepts, stylized artistic renditions, and creative blends of disparate elements.

Stable Image Core is a newer, more affordable and faster version of SDXL. It’s based on the same diffusion architecture as SDXL. In comparison, Stable Diffusion 3 Large and Stable Image Ultra are based on the new diffusion transformer architectures, making them much better at typography.

Expanded training data of the SD3 base model—which is used for both Stable Diffusion 3 Large and Stable Image Ultra—has endowed it with stronger multimodal reasoning and world knowledge compared to SDXL. Some key improvements we observed from the prompt experimentation are the following:

Prompt adherence – These models excel at following complex and detailed prompts, particularly in surreal scenes, making sure that the generated images closely match the specified instructions. Stable Diffusion 3 Large and Stable Image Ultra work the best with natural language.
Text Rendering: Unlike SDXL, which may struggle with incorporating text into images, these newer models effectively generate and integrate text, enhancing the overall coherence of the visuals.
Complex Scene Handling: The new models demonstrate a improved ability to create intricate and detailed scenes, showcasing a better grasp of surreal elements as it understands them in your prompts.
Photorealism: The images produced by these models are more lifelike, with improved handling of textures, lighting, and shadows, making them visually striking.
Visual Aesthetics: The overall visual appeal is enhanced, making them more engaging and attractive.
Multimodal Capabilities: The new models can process various input types beyond just text, allowing for more context-aware image generation.
Scalability: The new architecture of these models supports handling larger datasets and generating higher-resolution images effectively.
Advanced Architecture: The SD3 base model (used for Stable Diffusion 3 Large and Stable Image Ultra) utilizes a new diffusion transformer combined with flow matching, which enhances its performance in generating high-quality images.

The table below showcases the comparison in image generation between the models available on Amazon Bedrock.

Image Generation Comparison – Stability AI Models

Real-world use cases for media, advertising, and entertainment

In the world of media, marketing, and entertainment, concept art and storyboarding are essential for visualizing ideas and communicating creative visions. Stability AI’s models can revolutionize this process by generating high-quality concept art and storyboard frames based on textual descriptions, enabling rapid iteration and exploration of ideas.

Ideation and iteration

Advertising agencies and marketing teams can leverage these models to generate visually stunning and attention-grabbing assets for their campaigns. From product shots to lifestyle imagery, these models can produce a wide range of visuals tailored to specific brand identities and target audiences. In film and television, these models can be a powerful tool for set design and virtual production. By generating realistic environments and backdrops based on textual descriptions, production teams can quickly visualize and iterate on set designs, reducing the need for physical mockups and saving time and resources.

Character design

Character design is a crucial aspect of storytelling in media and entertainment. These models can assist artists and designers in generating unique and compelling character concepts, enabling them to explore a wide range of visual styles and aesthetics.

Social media marketing asset generation

Social media has become a vital marketing channel for media, advertising, and entertainment organizations. Stability AI’s latest models can be leveraged to generate engaging visual content, such as memes, graphics, and promotional materials, tailored to specific social media domains and target audiences.

Stability AI’s capabilities in advertising and marketing campaigns

To showcase the power of Stability AI’s text-to-image models in creating compelling advertising and marketing assets, we walk through a demonstration using a Jupyter notebook that combines large language models (LLMs) and Stable Diffusion 3 Large for end-to-end campaign creation. We demonstrate how to produce generated images for a brand called Young Generational Shoes (YGS), evaluate brand consistency and message effectiveness, use the LLM to analyze images and suggest improvements, and refine prompts based on feedback to generate new iterations. By combining LLM-generated campaign ideas with this model’s advanced image generation capabilities, agencies can rapidly produce high-quality, tailored visual assets that resonate with their target audience. The notebook provides a practical, hands-on example of how these cutting-edge AI tools can be integrated into real-world advertising workflows, potentially saving time and resources while enhancing creative output.

The recorded version of the demo is available here:

Prerequisites

This notebook is designed to run on AWS, leveraging Amazon Bedrock for both the LLM and Stability AI model access. Make sure you have the following set up before moving forward:

An AWS account
An Amazon SageMaker domain
A SageMaker domain user profile

To access Stability AI’s Stable Image Ultra text to image model, request access through the Amazon Bedrock console. For instructions, see Manage access to Amazon Bedrock foundation models. For instructions on how to deploy this sample, refer to the GitHub repo. Use the us-west-2 Region to run this demo.

Setting up the demo

We will be using the Stable Image Ultra for the purposes of this demo. You can use one of the other available models from Stability AI on Bedrock to run through your version of the notebook.

# Amazon Bedrock Model ID used throughout this notebook
# Model IDs: https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html#model-ids-arns
MODEL_ID = "stability.stable-image-ultra-v1:0"

This following function call essentially acts as a wrapper around the Amazon Bedrock API, simplifying the process of generating images using Stability AI’s models. It handles the API call, response parsing, and image decoding, providing a straightforward way to generate images from text prompts using these advanced AI models.

def generate_image_from_text(model_id, body):
    """
    Generate an image using SD3 on demand.
    Args:
        model_id (str): The model ID to use.
        body (str) : The request body to use.
    Returns:
        image_bytes (bytes): The image generated by the model.
    """

    logger.info("Generating image with SD3 model %s", model_id)

    bedrock = boto3.client("bedrock-runtime", region_name="us-west-2")
    
    response = bedrock.invoke_model(modelId=model_id,body=body)
    response_body= json.loads(response["body"].read())
    image_data = base64.b64decode(response_body.get("images")[0]

    logger.info("Successfully generated image with the SD3 model %s", model_id)
    return image_data

Generating creative ad campaigns with multiple models

The demo begins by using an LLM to generate creative ad campaign ideas and follows these steps

Define your product or service and target audience
Prompt the LLM to create multiple ad campaign concepts
The LLM generates diverse ideas, considering factors such as brand identity, audience demographics, and current trends

This process allows for a wide range of creative concepts tailored to your specific marketing needs. The following is the sample prompt we used in the notebook:

You are a seasoned veteran in the advertising industry with a wealth of experience
in creating captivating and impactful campaigns. Your task is to generate five
different creative advertising concepts for our new line of shoes under the brand
"YGS". Our product range includes running shoes, soccer shoes, and training shoes.

Our target audience is the young generation, a demographic known for their energy,
trendiness, and desire to express their individuality.

Each advertising concept should seamlessly incorporate the following elements: 

1. The specific type of shoe (running, soccer, tennis, hiking or training) and 
its intended usage. 
2. A vivid description of the colors and unique features that make our
shoes stand out. 
3. A compelling scenario that vividly illustrates when and where these shoes would
be worn, capturing the essence of the active lifestyle our target audience embraces. 

Your concepts should be fresh, engaging, and resonate with the youthful spirit
of our target market. Creativity, originality, and a deep understanding of
our audience's aspirations and passions should shine through in your advertising
ideas. Remember, the goal is to craft compelling narratives that not only showcase
our product's features but also tap into the emotions and desires of the
young generation, inspiring them to embrace our brand as an extension of
their vibrant lifestyles. 

The output format should follow below Json format: 
[ { "concept": "xxx", "Description": "xxx", "Scenario": "xxx" }, 
{ "concept": "xxx", "Description": "xxx", "Scenario": "xxx" } ... ]"

Prompt engineering for visual assets

Once you have campaign concepts, the next step is to craft effective prompts for SD3 Ultra 1.0. This involves using Anthropic’s Claude Sonnet 3.5 on Amazon Bedrock to transform campaign ideas into detailed image prompts, refining these prompts to include specific visual elements, styles, and compositions, and iterating on them to make sure that they capture the essence of the campaign. This process helps create precise instructions to generate visuals that align closely with the campaign’s objectives.

 """You are an expert to use stable diffusion model to generate shoes ad posters.
 Please user below content to generate the positive and negative prompt for stable
 diffusion model:
 - "Concept": {Concept}
 - "Description": {Description}
 - "Scenario": {Scenario}
 
 Output format shoud be Json format as below:
  [
     {
        "positive_prompt": "xxx"
     }
  ]
 Please add this to the positive prompt: text 'YGS' on the Shoes as a logo."""

Generating ad posters with Stable Image Ultra

With well-crafted prompts, Stable Image Ultra can now create stunning visual assets. The process involves entering the refined prompts into the model through the Amazon Bedrock API, adjusting parameters such as image size, number of inference steps, and guidance scale for optimal results and generating multiple variations to provide a range of options for the campaign. This approach allows for the creation of diverse, high-quality visuals that can be fine-tuned to help meet specific campaign requirements. Here are some posters generated by Stable Image Ultra:

Note:

The images generated could be different because your results depend on the parameters and their values, including the following:

The cfg_scale, which determines how strictly the diffusion process adheres to the prompt text
The height and width of the image in pixels
The number of diffusion steps to run
The random noise seed (which, if provided, makes the resulting generated image deterministic)
The sampler used for the diffusion process to denoise the generation
The array of text prompts used for generation
The weight assigned to each prompt

These parameters allow for fine-tuning and customization of the image generation process, resulting in diverse outputs based on their specific configuration.

Clean up

To avoid charges, you must stop the active SageMaker notebook instances. For instructions, refer to Clean up Amazon Sagemaker notebook instance resources.

Conclusion

Stability AI’s new family of models represents a significant milestone in the field of generative AI, offering media, advertising, and entertainment organizations a powerful tool to streamline creative workflows and unlock new realms of visual expression. By using Stability AI’s capabilities, organizations can tackle real-world business use cases, from concept art and storyboarding to advertising campaigns and content creation. However, it’s essential to proceed with a responsible and ethical mindset, addressing potential biases, respecting intellectual property rights, and mitigating the risks of misuse. By embracing the capabilities of these models while navigating their limitations and ethical considerations, creative professionals can push the boundaries of what’s possible in the world of visual content creation. To get started, check out Stability AI models in Amazon Bedrock.

As the field of generative AI continues to evolve rapidly, we can expect even more exciting developments and innovations from Stability AI and other industry leaders. Stay tuned for further advancements that will shape the creative landscape and empower artists, designers, and content creators in unprecedented ways.

About the authors

Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-native manner while ensuring resilience and scalability. She’s passionate about machine learning technologies and environmental sustainability.

Boshi Huang is a Senior Applied Scientist in Generative AI at Amazon Web Services, where he collaborates with customers to develop and implement generative AI solutions. Boshi’s research focuses on advancing the field of generative AI through automatic prompt engineering, adversarial attack and defense mechanisms, inference acceleration, and developing methods for responsible and reliable visual content generation.

Identification of Hazardous Areas for Priority Landmine Clearance: AI for Humanitarian Mine Action

TL;DR: Landmines pose a persistent threat and hinder development in over 70 war-affected countries. Humanitarian demining aims to clear contaminated areas, but progress is slow: at the current pace, it will take 1,100 years to fully demine the planet. In close collaboration with the UN and local NGOs, we co-develop an interpretable predictive tool for landmine contamination to identify hazardous clusters under geographic and budget constraints, experimentally reducing false alarms and clearance time by half. The system is being tested in Afghanistan and Colombia, where it has already led to the discovery of new landmines.

Anti-personnel landmines are explosive devices hidden in the ground designed to explode by proximity or contact and with the capacity to kill, disable or cause harm to humans (Fig. 1). The mere threat of landmine contamination in a territory not only endangers the physical well-being of affected populations but also results in a loss of forest areas, reduction of productive land, exacerbation of social vulnerability, delay of infrastructure development, and damage of natural, physical, and social capital. Due to such negative consequences, in 1997 most countries signed the Ottawa Treaty committing themselves to stop the manufacture, commercialization, and use of landmines. Likewise, the countries that had historically used these explosive devices during armed conflicts undertook to clear the contaminated territories. Despite ongoing efforts, landmines continue to be used in conflicts worldwide, posing a persistent threat to humanity and hindering the development of war-affected communities in over 70 countries, impacting more than 60 million people and causing nearly 7,000 casualties every year.

Figure 1. Example of a landmine found in Colombia.

Humanitarian mine action operations seek to clear conflict-affected regions of remaining landmines so that communities can safely reland their territories. However, demining operations are laborious and costly due to vast areas that need surveying and the limited monetary and human resources available: at the current rate, it will take about 1,100 years to clear the planet of all remaining landmines, underscoring the urgent need for innovative evidence-based approaches to make demining operations more efficient and safer. In this context, we co-designed the RELand system (Risk Estimation of Landmines), in partnership with the United Nations Mine Action Service and local demining organizations, to efficiently identify hazardous areas for priority landmine clearance. RELand is currently being tested in Colombia, where it has already led to the discovery of three new landmines in a newly prioritized area, potentially saving civilian lives. We have also tailored and deployed the system in Afghanistan, and we are preparing for its deployment in war-torn territories globally, in partnership with UNMAS and UNOPS.

RELand: Risk Estimation of Landmines via Interpretable Invariant Risk Minimization

RELand is a holistic pipeline to identify priority hazard areas to support non-technical surveys in humanitarian demining operations. Theses initial surveys are currently carried out by human experts who evaluate the possible presence of landmines based on available information and that provided by the residents. Since landmines are not used randomly but under war logic, Machine Learning can potentially help with these surveys by analyzing historical events and their correlation to relevant features. However, identifying landmine contamination has been scarcely studied in the literature, and poses three main challenges: noisy labels, geographic dependence, and sparse predicted risk scores. We address the challenges of landmine risk estimation by enhancing existing datasets with rich relevant features, constructing a novel, robust, and interpretable ML model that outperforms standard and new baselines, and identifying cohesive hazard clusters under geographic and budgetary constraints. Finally, the results are delivered through a web application developed with key mine action stakeholders. The major components of RELand are illustrated in Fig. 2. Notably, our approach is the first public pipeline of its kind that can be easily adapted for use in demining workflows globally.

Figure 2. Integration of RELand system into the humanitarian demining pipeline. Current non-technical surveys (grey) are based on the visual inspection of data in geospatial information systems and human expert analyses including local community surveys and domain knowledge. RELand (yellow dashed box) serves as an additional toolbox that contains three major components: dataset enhancement based on existing public geospatial datasets (red), risk modeling with machine learning methods (blue), and interactive web interface (green).

The first component of the system, Dataset Enhancement, integrates different sources of information to construct a dataset for landmine presence with rich relevant features based on geographic information, socio-demographic variables, remnants of war indicators, and historical landmine events. We introduce several new features which prove useful to identify hazard areas and to rule out false alarms. We also argue how labels should be assigned to predict the results of humanitarian demining operations, rectifying the definition of labels used in previous literature.

For the Risk Modeling component, we designed a novel interpretable deep learning tabular model extending TabNet. We propose to minimize the Invariant Risk Minimization (IRM), which enables the model to be robust to distribution shifts and invariant to diverse deployment environments. Intuitively, we define an “easy” environment as one where landmines are found close to past events or grid cells with no historical landmines nearby have indeed negative labels. In contrast, a “hard” environment is one where despite there being some historical events there are no new landmines (and resources are going to be used inefficiently) or new landmines found far away from previous events (and likely missed by baseline methods leading to a latent risk to humans). Formally, let us denote an environment by (e = (X^e, Y^e)) and let (w) be a dummy scalar classifier. Then the IRM loss is composed by an ERM cross-entropy term that encourages prediction accuracy, and a regularization term that forces (f_theta) to be simultaneously optimal across all environments (E). Our landmine risk estimator (f_{theta}(X)) is penalized for applying the distance-existence rule in “easy” environments to “hard” ones, and therefore generalizes well on both environments.

$$IRM(theta) = min_{theta} sumlimits_{e in E} ell_{text{CE}}(f_theta(X^e), Y^e) + lambda cdot ||nabla_{w|w=1} ell_{text{CE}}(w( f_theta(X^e)), Y^e)||^2$$

However, our partner demining organizations quickly emphasized the need for interpretable models, as they must explain to communities why certain areas are prioritized for clearance or not. Therefore, as the first step towards the interpretation of landmine risk estimators, we utilize SparseMax layers to generate global feature importance for our model. SparseMax (SM) is an activation function that normalizes the input vector to sparse probabilities (like a LASSO regularization), and is shown at the top of Fig. 3. Finally, we leverage the sequential design in TabNet to form decision blocks that are summed together and passed into an aggregation FC layer as the final prediction. This sequential design resembles additive modeling in Gradient Boosting Machines and ResNet skip connection mechanism. Initial blocks capture the main correlation in the dataset, and the following blocks can use the rest of the features to learn the residuals to fit the function better. Our final architecthure is show in Figure 3.

Figure 3. RELand architecture with interpretation branch that generates sparse feature masks on the top, and decision blocks at the bottom aggregated before the final FC layer.

To validate the proposed system, we simulate different scenarios in which the RELand system could be deployed in mine clearance operations using real data from Colombia. We use a block cross-validation approach, where the hold-out set corresponds to all cells in a municipality, to account for the geographical nature of current demining operations. In addition, since false negatives represent a higher cost in terms of human lives, we use the Height and Reverse Height (rHeight) metrics of how well a ranking is generated, in the sense that positive cells should be ranked higher than negative cells. Intuitively, models with better predictions for top-ranked regions can speed-up land clearing operations. Given a predicted risk score, Height refers to the number of positive cells ranked below a negative cell, and rHeight is the number of negative cells ranked above a positive one. An ideal classifier minimizes both of these metrics and perfectly rank positive cells above negative cells. Formally,

$$ Height(X_n) = sumlimits_{i = 1}^{P}mathbb{1}(widehat{f}(X_text{p})_i leq widehat{f}(X_text{n})), $$

$$rHeight(X_p) = sumlimits_{j = 1}^{N}mathbb{1}(widehat{f}(X_text{p}) leq widehat{f}(X_text{n})_j) $$

where (P) and (N) are the total counts of positive and negative labels, respectively, and (widehat{f}(X_text{p})) ((widehat{f}(X_text{n}))) is the predicted probability when the ground truth of (X_i) ((X_j)) is positive (negative).

Table 1 presents the result of the experimental validation comparing the proposed methodology with current practices, focusing mainly on historical landmine reports, and two previous ML models proposed in the literature. RELand consistently outperforms the benchmark models on all relevant metrics. Furthermore, Table 1 shows that the proposed method reduces the mean-rHeight by almost half compared to previous approaches. Intuitively, if we were to sequentially clear a region according to the generated risk score ranking, this metric tells us the average number of negative cells we would need to visit before the region is completely cleared. This measures how efficiently we could demine a geographic region of interest: RELand reduces the false alarms and the time required for landmine clearance by half.

Model	ROC (↑)	PR (↑)	mean-Height (↓)	mean-rHeight (↓)
LR-single (current)	86.35 (11.54)	17.07 (10.76)	3.06 (3.19)	226.79 (211.23)
LR-geo (2019, 2016)	67.62 (18.58)	5.37 (8.00)	8.09 (6.93)	573.36 (440.71)
SVM-geo (2019)	48.61 (18.09)	1.73 (1.82)	15.26 (15.66)	821.26 (729.12)
RELand (ours)	92.90 (4.43)	29.03 (22.11)	2.17 (2.48)	132.03 (133.50)

Table1 . Validation results in Colombia. Each entry is the mean (std) performance on validation folds following the block cross-validation rule. RELand is our interpretable IRM model. Full experimental results and ablation studies are available in our paper.

Hazard Cluster Identification as a Quadratic Knapsack Problem

Building a reliable prediction model to estimate landmine contamination risk is a crucial first step in data-driven prioritization of land clearance operations. However, integrating the risk maps generated by machine learning models into demining workflows requires considering the additional geographical and budgetary constraints that mine action organizations face in their ground operations. For instance, demining organizations often operate under limited budgets, allowing them to clear only a fraction of the total area under study while also covering the costs associated with mobilizing equipment and teams across the region (e.g., metal detectors, sniffing dogs, and human deminers). Moreover, if multiple regions are to be demined, there must be a secure path connecting these regions to ensure the safe movement of such demining teams. Humanitarian demining organizations need to maximize the land released back to local communities while navigating these challenges.

We propose to find which cells to prioritize for mine clearance by using a Quadratic Knapsack Problem (QKP), whose optimal solution naturally results in the identification of cohesive hazard clusters due to rewarding the program for prioritizing nearby grid cells. Formally, we use the risk scores (r_i) estimated by our trained deep learning model to compute proxies for the benefit of demining candidate grid cell (i) with centroid ((x_i,y_i)). Then, define the reward matrix (U) that captures the (additional) benefit of prioritizing both grid cells (i) and (j) as

$$u_{ij} = sqrt{r_i r_j}expleft(-lambda ||s_i – s_j||_{h}right),$$

where (||cdot||_{h}) is the standard Haversine distance, and (lambda) controls for the exponential decay of the spatial distance between two locations (s_i = (x_i, y_i)) and (s_j = (x_j, y_j)). For example, selecting a grid cell (i) for mine clearance results in a direct benefit of (u_{ii} = r_i). Note that, in our formulation, riskier cells yield greater rewards. This results in the following binary QKP with variables (z_i in {0,1}), for (iin [n]), which indicate if a grid cell (i) is selected for demining. Then, the total reward is given by (z^{T}Uz), which is maximized subject to a given budget (C in mathbb{R}_{+}) and demining costs (w_i):

$$ max_{z in mathbb{R}^n} ~ z^{T}Uz $$

$$s.t. quad sum_{i=1}^n w_i z_i leq C, quad z_i in {0, 1} quad forall i in [n].$$

Our approach rewards for geographic cohesion, ultimately finding more useful hazard clusters than a greedy solution that prioritizes the (C) grid cells with the largest estimated risk scores (Fig. 4). Moreover, our approach also incorporates realistic budget constraints, unlike standard spatial statistical approaches for geographic clustering such as Moran Local I and LISA.

Figure 4. Hazardous areas identified by RELand in our field test in Colombia. (a) Estimated risk scores from our trained DL model , (b) greedy risk clusters subject to budget constraints, and (c) QKP cohesive risk clusters with geographic pairwise interactions. Three landmines (panel (c), in white) have been found so far in one of the prioritized areas.

Tangible Impact of RELand

We are currently conducting a field study in Colombia, in partnership with the United Nations Mine Action Service and the Colombian Campaign to Ban Landmines, in two municipalities recently selected for humanitarian demining that have not been previously surveyed. We applied RELand to these regions to (i) build the enhanced dataset with rich geographic features, (ii) generate landmine contamination risk estimates by using the trained DL model, and (iii) use the predicted risk scores to identify priority hazard clusters with the QKP formulation. We worked together with the field teams of our partner NGO in Colombia to validate the hazard clusters identified by the system and to create an initial demining plan in the assigned regions. Crucially, the proposed methodology (Fig. 4c) identifies useful cohesive hazard clusters under realistic budgetary constraints. These hazard regions are more useful for demining prioritization than the sparse raw risk scores (Fig. 4a) and the greedy risk clusters (Fig. 4b), which lead to excessive mobilization of demining teams and equipment. Overall, the risk maps generated are in line with what is expected by human experts in humanitarian demining in Colombia. To date, three landmines have been found in one priority area, saving human lives. Moreover, in collaboration with UNOPS and MAPA, we have tailored and deployed the system in Afghanistan, identifying 81 hazardous areas for prioritized demining interventions, positively impacting over 4 million people across the country.

We expect to have the full results of our demining field tests within 6 months to provide a real-world validation of RELand’s capabilities in ground operations. Based on the initial positive feedback, we believe the system can support critical parts of the initial planning of humanitarian mine action, making demining operations more efficient and safer. We are actively working with UNMAS, UNOPS, and local NGOs to refine the system in its three components and prepare it for deployment in war-torn territories globally.

Aknowledgments

RELand was developed in collaboration with Cindy Zeng (UIUC), Anna Wang (CMU), Didier Alvarado (UNMAS Colombia), Francisco Moreno (CCBL), Hoda Heidari (CMU), and Fei Fang (CMU). Special thanks to UNOPS and MAPA for their partnership in our Afghanistan field tests. All errors remain mine.

References

Dulce Rubio, M., Zeng, S., Wang, Q., Alvarado, D., Moreno Rivera, F., Heidari, H., & Fang, F. (2024). RELand: Risk Estimation of Landmines via Interpretable Invariant Risk Minimization. ACM Journal on Computing and Sustainable Societies, 2(2), pp. 1-29. https://doi.org/10.1145/3648437.
Dulce Rubio, M. (2024). Identification of Hazard Clusters for Priority Landmine Clearance as a Quadratic Knapsack Problem. Doing Good with Good OR Competition, INFORMS Annual Meeting.
Collins, R., Fragniere, L., & Dulce Rubio, M. (2024). Advancements In Mine Action: Enhancing Remote Reporting And Analysis Through Innovative Technologies. The Journal of Conventional Weapons Destruction, 28(3), 7.

Build a multi-tenant generative AI environment for your enterprise on AWS

While organizations continue to discover the powerful applications of generative AI, adoption is often slowed down by team silos and bespoke workflows. To move faster, enterprises need robust operating models and a holistic approach that simplifies the generative AI lifecycle. In the first part of the series, we showed how AI administrators can build a generative AI software as a service (SaaS) gateway to provide access to foundation models (FMs) on Amazon Bedrock to different lines of business (LOBs). In this second part, we expand the solution and show to further accelerate innovation by centralizing common Generative AI components. We also dive deeper into access patterns, governance, responsible AI, observability, and common solution designs like Retrieval Augmented Generation.

Our solution uses Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API via a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. It also uses a number of other AWS services such as Amazon API Gateway, AWS Lambda, and Amazon SageMaker.

Architecting a multi-tenant generative AI environment on AWS

A multi-tenant, generative AI solution for your enterprise needs to address the unique requirements of generative AI workloads and responsible AI governance while maintaining adherence to corporate policies, tenant and data isolation, access management, and cost control. As a result, building such a solution is often a significant undertaking for IT teams.

In this post, we discuss the key design considerations and present a reference architecture that:

Accelerates generative AI adoption through quick experimentation, unified model access, and reusability of common generative AI components
Offers tenants the flexibility to choose the optimal design and technical implementation for their use case
Implements centralized governance, guardrails, and controls
Allows for tracking and auditing model usage and cost per tenant, line of business (LOB), or FM provider

Solution overview

The proposed solution consists of two parts:

The generative AI gateway and
The tenant

The following diagram illustrates an overview of the solution.

Generative AI gateway

Shared components lie in this part. Shared components refer to the functionality and features shared by all tenants. Each component in the previous diagram can be implemented as a microservice and is multi-tenant in nature, meaning it stores details related to each tenant, uniquely represented by a tenant_id. Some components are categorized in groups based on the type of functionality they exhibit.

The standalone components are:

The HTTPS endpoint is the entry point to the gateway. Interactions with the shared services goes through this HTTPS endpoint. This is the only entry point of the solution.
The orchestrator is responsible for receiving the requests forwarded by the HTTPS endpoint and invoking relevant microservices, based on the task at hand. This in itself is a microservice, inspired the Orchestrator Saga pattern in microservices.
The generative AI playground is a UI provided to tenants where they can run their one-time experiments, chat with several FMs, and manually test capabilities such as guardrails or model evaluation for exploration purposes.

The component groups are as follows.

Core services is primarily targeted to the environment administrator. It contains services used to onboard, manage, and operate the environment, for example, to onboard and off-board tenants, users, and models, assign quotas to different tenants, and authentication and authorization microservices. It also contains observability components for cost tracking, budgeting, auditing, logging, etc.
Generative AI model components contain microservices for foundation and custom model invocation operations. These microservices abstract communication to FMs served through Amazon Bedrock, Amazon SageMaker, or a third-party model provider.
Generative AI components provide functionalities needed to build a generative AI application. Capabilities such as prompt caching, prompt chaining, agents, or hybrid search are part of these microservices.
Responsible AI components promote the safe and responsible development of AI across tenants. They include features such as guardrails, red teaming, and model evaluation.

Tenant

This part represents the tenants using the AI gateway capabilities. Each tenant has different requirements and needs and their own application stack. They can integrate their application with the generative AI gateway to embed generative AI capabilities in their application. The environment Admin has access to the generative AI gateway and interacts with the core services.

Solution walkthrough

The following sections examine each part of the solution in more depth.

HTTPS endpoint

This serves as the entry point for the generative AI gateway. Incoming requests to the gateway go through this point. There are different approaches you can follow when designing the endpoint:

REST API endpoint – You can set up a REST API endpoint using services such as API Gateway where you can apply all authentication, authorization, and throttling mechanisms. API Gateway is serverless and hence automatically scales with traffic.
WebSockets – For long-running connections, you can use WebSockets instead of a REST interface. This implementation overcomes timeout limitations in synchronous REST requests. A WebSockets implementation keeps the connection open for multiturn or long-running conversations. API Gateway also provides a WebSocket API.
Load balancer – Another option is to use a load balancer that exposes an HTTPS endpoint and routes the request to the orchestrator. You can use AWS services such as Application Load Balancer to implement this approach. The advantage of using Application Load Balancer is that it can seamlessly route the request to virtually any managed, serverless or self-hosted component and can also scale well.

Tenants and access patterns

Tenants, such as LOBs or teams, use the shared services to access APIs and integrate generative AI capabilities into their applications. They can also use the playground UI to assess the suitability of generative AI for their specific use case before diving into full-fledged application development.

Here you also have the data sources, processing pipelines, vector stores, and data governance mechanisms that allow tenants to securely discover, access, andthe data they need for their specific use case. At this point, you need to consider the use case and data isolation requirements. Some applications may need to access data with personal identifiable information (PII) while others may rely on noncritical data. You also need to consider the operational characteristics and noisy neighbor risks.

Take Retrieval Augmented Generation (RAG) as an example. Depending on the use case and data isolation requirements, tenants can have a pooled knowledge base or a siloed one and implement item-level isolation or resource level isolation for the data respectively. Tenants can select data from the data sources they have access to, choose the right chunking strategy for their application, use the shared generative AI FMs for converting the data into embeddings, and store the embeddings in their vector store.

To answer user questions in real time, tenants can implement caching mechanisms to reduce latency and costs for frequent queries. Additionally, they can implement custom logic to retrieve information about previous sessions, the state of the interaction, and information specific to the end user. To generate the final response, they can again access the models and re-ranking functionality available through the gateway.

The following diagram illustrates a potential implementation of a chat-based assistant application with this approach. The tenant application uses FMs available through the generative AI gateway and its own vector store to provide personalized, relevant responses to the end user.

Shared services

The following section describes the shared services groups.

Model components

The goal of this component group is to expose a unified API to tenants for accessing underlying models irrespective of where these are hosted. It abstracts invocation details and accelerates application development. It consists of one or more components depending on the number of FM providers and number and types of custom models used. These components are illustrated in the following diagram.

In terms of how to offer FMs to your tenants, with AWS you have several options:

Amazon Bedrock is a fully managed service that offers a choice of FMs from AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. It’s serverless so you don’t have to manage the infrastructure. You can also bring your own customized models and deploy them to Amazon Bedrock for supported architectures.
SageMaker JumpStart is a machine learning (ML) hub that provides a wide range of publicly available and proprietary FMs from providers such as AI21 Labs, Cohere, Hugging Face, Meta, and Stability AI, which you can deploy to SageMaker endpoints in your own AWS account.
SageMaker offers SageMaker endpoints for inference where you can deploy a publicly available model, such as models from HuggingFace, or your own model.
You can also deploy models on AWS compute using container services such as Amazon Elastic Kubernetes Service (Amazon EKS) or self-managed approaches.

With AWS PrivateLink, you can create a private connection between your virtual private cloud (VPC) and Amazon Bedrock and SageMaker endpoints.

Generative AI application components

This group contains components linked to the unique requirements of generative AI applications. They’re illustrated in the following figure.

Prompt catalog – Crafting effective prompts is important for guiding large language models (LLMs) to generate the desired outputs. Prompt engineering is typically an iterative process, and teams experiment with different techniques and prompt structures until they reach their target outcomes. Having a centralized prompt catalog is essential for storing, versioning, tracking, and sharing prompts. It also lets you automate your evaluation process in your pre-production environments. When a new prompt is added to the catalog, it triggers the evaluation pipeline. If it leads to better performance, your existing default prompt in the application is overridden with the new one. When you use Amazon Bedrock, Amazon Bedrock Prompt Management allows you to create and save your own prompts so you can save time by applying the same prompt to different workflows. Alternatively, you can use Amazon DynamoDB, a serverless, fully managed NoSQL database, to store your prompts.
Prompt chaining – Generative AI developers often use prompt chaining techniques to break complex tasks into subtasks before sending them to an LLM. A centralized service that exposes APIs for common prompt-chaining architectures to your tenants can accelerate development. You can use AWS Step Functions to orchestrate the chaining workflows and Amazon EventBridge to listen to task completion events and trigger the next step. Refer to Perform AI prompt-chaining with Amazon Bedrock for more details.
Agent – Tenants also often employ autonomous agents to complete complex tasks. Such agents orchestrate interactions between models, data sources, APIs, and applications. The agents component allows them to create, manage, access, and share agent implementations. On AWS, you can use the fully managed Amazon Bedrock Agents or tools of your choice such as LangChain agents or LlamaIndex agents.
Re-ranker – In the RAG design, a search in internal company data often returns multiple candidate outputs. A re-ranker, such as a Cohere Rerank 2 model, helps identify the best candidates based on predefined criteria. If your tenants prefer to use the capabilities of managed services such as Amazon OpenSearch Service or Amazon Kendra, this component isn’t needed.
Hybrid search – In RAG, you may also optionally want to implement and expose different templates for performing hybrid search that help improve the quality of the retrieved documents. This logic sits in a hybrid search component. If you use managed services such as Amazon OpenSearch Service, this component is also not required.

Responsible AI components

This group contains key components for Responsible AI, as shown in the following diagram.

Guardrails – Guardrails help you implement safeguards in addition to the FM built-in protections. They can be applied as generic defaults for users in your organization or can be specific to each use case. You can use Amazon Bedrock Guardrails to implement such safeguards based on your application requirements and responsible AI policies. With Amazon Bedrock Guardrails, you can block undesirable topics, filter harmful content, and redact or block sensitive information such as PII and custom regular expression to protect privacy. Additionally, contextual grounding checks can help detect hallucinations in model responses based on a reference source and a user query. The ApplyGuardrail API can evaluate input prompts and model responses for FMs on Amazon Bedrock, custom FMs, and third-party FMs, enabling centralized governance across your generative AI applications.
Red teaming – Red teaming helps reveal model limitations that can cause bad user experiences or enable malicious intentions. LLMs can be vulnerable to security and privacy attacks such as backdoor attacks, poisoning attacks, prompt injection, jailbreaking, PII leakage attacks, membership inference attacks or gradient leakage attacks. You can set up a test application and a red team with your own employees or automate it against a known set of vulnerabilities. For example, you can test the application with known jailbreaking datasets such as these You can use the results to tailor your Amazon Bedrock Guardrails to block undesirable topics, filter harmful content, and redact or block sensitive information.
Human in the loop – The human-in-the-loop approach is the process of collecting human inputs across the ML lifecycle to improve the accuracy and relevancy of models. Humans can perform a variety of tasks, from data generation and annotation to model review, customization, and evaluation. With SageMaker Ground Truth, you have a self-service offering and an AWS managed In the self-service offering, your data annotators, content creators, and prompt engineers (in-house, vendor-managed, or using the public crowd) can use the low-code UI to accelerate human-in-the-loop tasks. The AWS managed offering (SageMaker Ground Truth Plus) designs and customizes an end-to-end workflow and provides a skilled AWS managed team that is trained on specific tasks and meets your data quality, security, and compliance requirements. With model evaluation in Amazon Bedrock, you can set up FM evaluation jobs that use human workers to evaluate the responses from multiple models and compare them with a ground truth response. You can set up different methods including thumbs up or down, 5-point Likert scales, binary choice buttons, or ordinal ranking.
Model evaluation – Model evaluation allows you to compare model outputs and choose the model best suited for downstream generative AI applications. You can use automatic model evaluations, human-in-the-loop evaluations or both. Model evaluation in Amazon Bedrock allows you to set up automatic evaluation jobs and evaluation jobs that use human workers. You can choose existing datasets or provide your own custom prompt dataset. With Amazon SageMaker Clarify, you can evaluate FMs from Amazon SageMaker JumpStart. You can set up model evaluation for different tasks such as text generation, summarization, classification, and question and answering, across different dimensions including prompt stereotyping, toxicity, factual knowledge, semantic robustness, and accuracy. Finally, you can build your own evaluation pipelines and use tools such as fmeval.
Model monitoring – The model monitoring service allows tenants to evaluate model performance against predefined metrics. A model monitoring solution gathers request and response data, runs evaluation jobs to calculate performance metrics against preset baselines, saves the outputs, and sends an alert in case of issues.

If you use Amazon Bedrock, you can enable model invocation logging to collect input and output data and use Amazon Bedrock evaluation to run model evaluation jobs. Alternatively, you can use AWS Lambda and implement your own logic, or use open source tools such as fmeval. In SageMaker, you can enable data capture for your SageMaker real-time endpoint and use SageMaker Clarify to run the model evaluation jobs or implement your own evaluation logic. Both Amazon Bedrock and SageMaker integrate with SageMaker Ground Truth, which helps you gather ground truth data and human feedback for model responses. AWS Step Functions can help you orchestrate the end-to-end monitoring workflow.

Core services

Core services represent a collection of administrative and management components or modules. These components are designed to provide oversight, control, and governance over various aspects of the system’s operation, resource management, user and tenant administration, and model management. These are illustrated in the following diagram.

Tenant management and identity

Tenant management is a crucial aspect of multi-tenant systems, where a single instance of an application or environment serves multiple tenants or customers, each with their own isolated and secure environment. The tenant management component is responsible for managing and administering these tenants within the system.

Tenant onboarding and provisioning – This helps with creating a repeatable onboarding process for new tenants. It involves creating tenant-specific environments, allocating resources, and configuring access controls based on the tenant’s requirements.
Tenant configuration and customization – Many multi-tenant systems allow tenants to customize certain aspects of the application or environment to suit their specific needs. The tenant management component may provide interfaces or tools for tenants to configure settings, branding, workflows, or other customizable features within their isolated environments.
Tenant monitoring and reporting – This component is directly linked to the monitor and metering component and reports on tenant-specific usage, performance, and resource consumption. It can provide insights into tenant activity, identify potential issues, and facilitate capacity planning and resource allocation for each tenant.
Tenant billing and subscription management – In solutions with different pricing models or subscription plans, the tenant management component can handle billing and subscription management for each tenant based on their usage, resource consumption, or contracted service levels.

In the proposed solution, you also need an authorization flow that establishes the identity of the user making the request. With AWS IAM Identity Center, you can create or connect workforce users and centrally manage their access across their AWS accounts and applications. With Amazon Cognito, you can authenticate and authorize users from the built-in user directory, from your enterprise directory, and from other consumer identity providers. AWS Identity and Access Management (IAM) provides fine-grained access control. You can use IAM to specify who can access which FMs and resources to maintain least privilege permissions.

For example, in one common scenario with Cognito that accesses resources with API Gateway and Lambda with a user pool. In the following diagram, when your user signs in to an Amazon Cognito user pool, your application receives JSON Web Tokens (JWTs). You can use groups in a user pool to control permissions with API Gateway by mapping group membership to IAM roles. You can submit your user pool tokens with a request to API Gateway for verification by an Amazon Cognito authorizer Lambda function. For more information, see Using API Gateway with Amazon Cognito user pools.

It is recommended that you don’t use API keys for authentication or authorization to control access to your APIs. Instead, use an IAM role, a Lambda authorizer, or an Amazon Cognito user pool.

Model onboarding

A key aspect of the generative AI gateway is allowing controlled access to foundation and custom models across tenants. For FMs available through Amazon Bedrock, the model onboarding component maintains an allowlist of approved models that tenants can access. You can use a service such as Amazon DynamoDB to track allowlisted models. Similarly, for custom models deployed on Amazon SageMaker, the component tracks which tenants have access to which model versions through entries in the DynamoDB registry table.

To enforce access control, you can use AWS Lambda authorizers with Amazon API Gateway. When a tenant application calls the model invocation API, the Lambda authorizer verifies the tenant’s identity and checks if they have permission to access the requested model based on the DynamoDB registry table. If access is permitted, temporary credentials are issued, which scope down the tenant’s permissions to just the allowed model(s). This prevents tenants from accessing models they shouldn’t have access to. The authorizer logic can be customized based on an organization’s model access policies and governance requirements.

This approach supports model end of life. By managing the model from the allowlist in the DynamoDB registry table for all or selected tenants, models not included aren’t usable automatically, with no further code changes required in the solution.

Model registry

A model registry helps manage and track different versions of custom models. Services such as Amazon SageMaker Model Registry and Amazon DynamoDB help track available models, associated generated model artifacts, and lineage. A model registry offers the following:

Version control – To track different versions of the generative AI models.
Model lineage and provenance – To track the lineage and provenance of each model version, including information about the training data, hyperparameters, model architecture, and other relevant metadata that describes the model’s origin and characteristics.
Model deployment and rollback – To facilitate the deployment and usage of new model versions into production environments and the rollback to previous versions if necessary. This makes sure that models can be updated or reverted seamlessly without disrupting the system’s operation.
Model governance and compliance – To verify that model versions are properly documented, audited, and conform to relevant policies or regulations. This is particularly useful in regulated industries or environments with strict compliance requirements.

Observability

Observability is crucial for monitoring the health of your application, troubleshooting issues, usage of FMs, and optimizing performance and costs.

Logging and monitoring

Amazon CloudWatch is a powerful monitoring and observability service that allows you to collect and analyze logs from your application components, including API Gateway, Amazon Bedrock, Amazon SageMaker, and custom services. Using CloudWatch to capture tenant identity in the logs across the whole stack helps you gain insights into the performance and health of your generative AI gateway down to the tenant level and proactively identify and resolve issues before they escalate. You can also set up alarms to get notified in case of unexpected behavior. Both Amazon SageMaker and Amazon Bedrock are integrated with AWS CloudTrail.

Metering

Metering helps collect, aggregate, and analyze operational and usage data and performance metrics from different parts of the solution. In systems that offer pay-per-use or subscription-based models, metering is crucial for accurately measuring and reporting resource consumption for billing purposes across the different tenants.

In this solution, you need to track the usage of FMs to effectively manage costs and optimize resource utilization. Collecting information related to the models used, number of tokens provided as input, tokens generated as output, AWS Region used, and applying tags related to the team helps you streamline the cost allocation and billing processes. You can log structured data during interactions with the FMs and collect this usage information. The following diagram shows an implementation where the Lambda function logs per tenant information in Amazon CloudWatch and invokes Amazon Bedrock. The invocation generates an AWS CloudTrail event.

Auditing

You can use an AWS Lambda function to aggregate the data from Amazon CloudWatch and store it in S3 buckets for long-term storage and further analysis. Amazon S3 provides a highly durable, scalable, and cost-effective object storage solution, making it an ideal choice for storing large volumes of data. For implementation details, refer to part 1 of this series, Build an internal SaaS service with cost and usage tracking for foundation models on Amazon Bedrock.

Once the data is in Amazon S3, you can use AWS analytics services such as Amazon Athena, AWS Glue Data Catalog, and Amazon QuickSight to uncover patterns in the cost and usage data, generate reports, visualize trends, and make informed decisions about resource allocation, budget forecasting, and cost optimization strategies. With AWS Glue Data Catalog, a centralized metadata repository, and Amazon Athena, an interactive query service, you can run one-time SQL queries directly on the data stored in Amazon S3. The following example describes usage and cost per model per tenant in Athena.

Scaling across the enterprise

The following are some design considerations for when you scale this solution across hundreds of LOBs and teams within an organization.

Account limits – So far, we have discussed how to deploy the gateway solution in a single AWS account. As teams rapidly onboard to the gateway and expand their usage of LLMs, this might result in various components hitting their AWS account limits and can quickly become a bottleneck. We recommend deploying the generative AI gateway to more than one AWS accounts where each AWS account corresponds to one LOB. The reasoning behind this suggestion is, generally, the LOBs in large enterprises are quite autonomous and can each have tens to hundreds of teams. In addition, they may have strict data privacy policies which restricts them from sharing the data with other LOBs. In addition to this account, each LOB may have their non-prod AWS account as well where this gateway solution is deployed for testing and integration purposes.
Production and non-production workloads – In most cases, tenant teams will want to use this gateway across their development, test, and production environments. Although it largely depends on an organization’s operating model, our recommendation is to have a dedicated development, test, and production environment for the gateway as well, so the teams can experiment freely without overloading the production gateway or polluting it with non-production data. This offers the additional benefit that you can set the limits for non-production gateways lower than those in production.
Handling RAG data components – For implementing RAG solutions, we suggest keeping all the data-related components on the tenant’s end. Every tenant will have their own data constraints, update cycle, format, terminologies, and permission groups. Assigning the responsibility of managing data sources to the gateway may hinder scalability because the gateway can’t accommodate the unique requirements of each tenant’s data sources and most likely will end up serving the lowest common denominator. Hence, we recommend having the data sources and related components managed on the tenant’s side.
Avoid reinventing the wheel – With this solution, you can build and manage your own components for model evaluation, guardrails, prompt catalogue, monitoring, and more. Services such as Amazon Bedrock provide the capabilities you need to build generative AI applications with security, privacy, and responsible AI right from the start. Our recommendation is to take a balanced approach and, wherever possible, use AWS native capabilities to reduce operational costs.
Keeping the generative AI gateway thin – Our suggestion is to keep this gateway thin in terms of storing business logic. The gateway shouldn’t add any business rules for any specific tenant and should avoid storing any kind of tenant specific data apart from operational data already discussed in the post.

Conclusion

A generative AI multi-tenant architecture helps you maintain security, governance, and cost controls while scaling the use of generative AI across multiple use cases and teams. In this post, we presented a reference multi-tenant architecture to help you accelerate generative AI adoption. We showed how to standardize common generative AI components and how to expose them as shared services. The proposed architecture also addressed key aspects of governance, security, observability, and responsible AI. Finally, we discussed key considerations when scaling this architecture to hundreds of teams.

If you want to read more about this topic, check out also the following resources:

Let us know what you think in the comments section!

About the authors

Anastasia Tzeveleka is a Senior Generative AI/ML Specialist Solutions Architect at AWS. As part of her work, she helps customers across EMEA build foundation models and create scalable generative AI and machine learning solutions using AWS services.

Hasan Poonawala is a Senior AI/ML Specialist Solutions Architect at AWS, working with Healthcare and Life Sciences customers. Hasan helps design, deploy and scale Generative AI and Machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Bruno Pistone is a Senior Generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations

Vikesh Pandey is a Principal Generative AI/ML Solutions architect, specialising in financial services where he helps financial customers build and scale Generative AI/ML platforms and solution which scales to hundreds to even thousands of users. In his spare time, Vikesh likes to write on various blog forums and build legos with his kid.

Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Enhance customer support with Amazon Bedrock Agents by integrating enterprise data APIs

Generative AI has transformed customer support, offering businesses the ability to respond faster, more accurately, and with greater personalization. AI agents, powered by large language models (LLMs), can analyze complex customer inquiries, access multiple data sources, and deliver relevant, detailed responses.

In this post, we guide you through integrating Amazon Bedrock Agents with enterprise data APIs to create more personalized and effective customer support experiences. Although the principles discussed are applicable across various industries, we use an automotive parts retailer as our primary example throughout this post.

By the end of this post, you’ll have a clear understanding of how to do the following:

Use Amazon Bedrock Agents to create intelligent, context-aware customer support bots
Integrate enterprise data sources, such as inventory management and catalog systems, with agents using AWS Lambda
Build customized chat interfaces using the Amazon Bedrock Agents API
Implement a solution that can instantly cross-reference product specifications with catalogs, check real-time inventory, and provide detailed information to the end-user

Solution overview

To illustrate the potential of this technology, consider an automotive parts retailer. In this industry, finding the right components can be challenging, because it often involves navigating extensive catalogs and complex compatibility requirements. An automotive retailer might use inventory management APIs to track stock levels and catalog APIs for vehicle compatibility and specifications. Access to car manuals and technical documentation helps the agent provide additional context for curated guidance, enhancing the quality of customer interactions.

The solution presented in this post takes approximately 15–30 minutes to deploy and consists of the following key components:

Amazon OpenSearch Service Serverless maintains three indexes: the inventory index, the compatible parts index, and the owner manuals index. These indexes enable efficient searching and retrieval of part data and vehicle information, providing quick and accurate results.
Amazon Bedrock Agents coordinates interactions between foundation models (FMs), knowledge bases, and user conversations. The agents also automatically call APIs to perform actions and access knowledge bases to provide additional information.
Amazon Bedrock Knowledge Bases enables you to use Retrieval Augmented Generation (RAG), a technique that enhances responses from LLMs by incorporating information from a data store. By setting up a knowledge base with your data sources, your application can query it to provide answers, either through direct quotes from the sources or through naturally generated responses based on the query results.
A web application serves as the frontend interface where users can initiate parts lookup requests.

Ingestion flow

The ingestion flow prepares and stores the necessary data for the AI agent to access. The following diagram illustrates how it works.

The workflow includes the following steps:

Documents (owner manuals) are uploaded to an Amazon Simple Storage Service (Amazon S3) bucket.
Amazon Bedrock Knowledge Bases ingests these documents:
1. The knowledge base is configured to use the S3 bucket as a data source.
2. The data source is synchronized and the knowledge base detects new, modified, or deleted documents in the S3 bucket and updates accordingly.
3. The documents are chunked into smaller segments for more effective processing. This solution uses fixed-size chunking, where you can configure the desired chunk size by specifying the number of tokens per chunk and an overlap percentage.
Each chunk is embedded by using an embedding model such as Cohere Embed on Amazon Bedrock to create vector representations (embeddings) of the text.
The embeddings are stored in the Amazon OpenSearch Service owner manuals index. OpenSearch Service is used as the vector store for efficient similarity searching. The embeddings, along with metadata about the source documents, are indexed for quick retrieval.

User interaction flow

The following diagram illustrates the user interaction flow.

A user interacts with the Car Parts Agent through a web application interface. They can ask questions like “What wiper blades fit a 2021 Honda CR-V?” or ”Tell me about part number 76622-T0A-A01.”
The web application sends the user’s query to the Amazon Bedrock agent using the InvokeAgent API. The agent, using Anthropic’s Claude 3 Sonnet, interprets the user’s query and determines the best course of action through chain-of-thought (CoT) reasoning. At this stage, the agent employs guardrails to make sure it stays within its defined scope and capabilities. Through a runtime process that includes preprocessing and postprocessing steps, the agent categorizes the user’s input. This allows it to handle out-of-scope questions or potentially harmful inputs appropriately, without attempting to answer beyond its capabilities or knowledge base. The agent then analyzes the query to extract key information such as vehicle details, part numbers, or general automotive topics. If the query is within scope, the agent proceeds; if not, it provides a response indicating it can’t assist with that particular request.
For general inquiries, the agent consults its knowledge base in Amazon Bedrock, which includes information from various car manuals. This allows the agent to provide context and general information about car parts and systems.
For specific part inquiries, the agent consults the action groups available to the agent and invokes the correct action (API) to retrieve relevant information. This invocation happens when the agent determines that it needs to run a specific action based on the user input.
1. The Lambda function runs the database query against the appropriate OpenSearch Service indexes, searching for exact matches or using fuzzy matching for partial information. It can access the inventory index for specific part details or the compatible parts index for compatibility information.
2. The Lambda function processes the OpenSearch Service results and formats them for the Amazon Bedrock agent.
The Amazon Bedrock agent takes the formatted results and generates a human-readable response, combining database information with its general knowledge to provide comprehensive answers.

The following diagram illustrates the workflow of the agent.

This diagram illustrates the agent’s workflow from user query to response generation, integrating knowledge base and API data to provide comprehensive answers and handle follow-up questions.

Developer tools

The solution also uses the following developer tools:

AWS Powertools for Lambda – This is a suite of utilities for Lambda functions that generates OpenAPI schemas from your Lambda function code. It provides annotations for business logic, descriptions, and parameter validations, automatically producing JSON-serialized OpenAPI schemas for use with Amazon Bedrock Agents.
AWS Generative AI Constructs Library – This is an open source extension of the AWS Cloud Development Kit (AWS CDK) that offers multi-service, well-architected patterns for quickly defining generative AI solutions. It provides constructs to help developers build generative AI applications using pattern-based definitions for your infrastructure.

Prerequisites

You should have the following prerequisites:

An AWS account with the appropriate AWS Identity and Access Management (IAM) permissions to create Amazon Bedrock agents and knowledge bases, Lambda functions, and IAM roles
Access to the following models hosted on Amazon Bedrock:
- Anthropic’s Claude 3 Sonnet (model ID: anthropic.claude-3-sonnet-20240229-v1:0)
- Cohere Embed English v3 (model ID: cohere.embed-english-v3)
A local development environment with the following:
- The AWS Command Line Interface (AWS CLI) installed and configured with appropriate permissions.
- Python 3.9 or later
- Node.js v20.x or later
- The AWS CDK CLI installed

Deploy the solution

The following steps outline the process to deploying the solution using the AWS CDK. The complete source code for this solution is available in the GitHub repository.

Open your terminal and run the following commands to clone the GitHub repository to your local machine:

git clone https://github.com/aws-samples/bedrock-agent-carpart-lookup.git
cd bedrock-agent-carpart-lookup

Create and activate a Python virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows, use .venvScriptsactivate

Install the required Python packages:
```
pip install -r requirements.txt
```
Use the AWS CDK CLI to deploy the solution:
```
cdk deploy
```

During deployment, you may be prompted to approve IAM role creations and security changes. Review and approve these if you’re comfortable with the permissions. After deployment, the AWS CDK CLI will output the web application URL. Make note of this URL (as shown in following screenshot) to access and test the agent.

After you deploy the solution, you can verify the created resources on the Amazon Bedrock console. On the Agents page, you’ll notice a new agent called car-parts-agent.

Effective agent instructions are crucial for optimizing the performance of AI-powered assistants. A well-structured set of instructions should encompass several key components:

Agent role – Define the assistant’s purpose, such as serving as a Car Parts Assistant that helps users find compatible parts and automotive information
Agent actions – Outline primary tasks, such as identifying parts based on vehicle details, verifying compatibility, and providing technical specifications
Agent guidelines – Establish rules for interaction, prioritizing accuracy and safety, clearly stating uncertainties, and using actions for searches
Agent guardrails – Implement limits to make sure the agent operates safely and effectively, using relevant automotive knowledge to enhance user support

For example, the agent we deployed has been preconfigured with the following instruction:

You are an Car Parts Assistant, helping users find compatible parts and providing automotive information. Your main tasks are: Part Identification: Find specific parts based on vehicle details (make, model, year). Assist with partial information. Compatibility Checks: Verify if parts are compatible with given vehicles. Technical Info: Provide part specifications, features, and explain component functions. Use database functions for searches and compatibility checks. Supplement with automotive knowledge for comprehensive help. Your goal is to assist effectively while ensuring users make informed decisions about their vehicle parts. Always prioritize accuracy and safety. State uncertainties clearly.

□ Role

□ Actions

□ Guidelines

□ Guardrails

The agent has two main components:

Action group – An action group named CarpartsApi is created, and the actions it can perform are defined using an OpenAPI schema. Optionally, you can use Powertools for AWS Lambda to simplify the process of generating the OpenAPI schema. For more information, refer to the PowerTools documentation on Amazon Bedrock Agents. The OpenAPI schema used by this agent can be viewed on the following GitHub repo. The action group is then associated with a Lambda function containing the business logic for these actions.
Knowledge base – This repository enhances the agent’s responses using RAG in Amazon Bedrock. It contains information from car manuals and technical documentation. When associating a knowledge base with an agent, you can optionally provide a description on how the agent can use the knowledge base. For this demo, we use the following description for the knowledge base:

This knowledge base contains manuals and technical documentation about various car makes from manufacturers such as Honda, Tesla, Ford, Subaru, Kia, Toyota etc.

□ Instructions

The agent employs CoT reasoning to process user queries, analyzing input against its instructions and evaluating actions based the OpenAPI provided and knowledge base description. When required information is missing, as determined by the OpenAPI schema’s specifications, the agent formulates questions to elicit necessary data from the user. This analysis and information gathering leads to a logical sequence of steps, including API calls and knowledge base queries. The resulting observation enhances the prompt for the FM, which then determines and runs the most effective actions.

For this post, we use the AWS CDK and the AWS Generative AI Constructs Library to create the Amazon Bedrock agent. This approach enables version-controlled, reproducible infrastructure as code (IaC). Alternatively, you can create agents using the AWS CLI or AWS Management Console. For more details, see create and configure agent manually.

Test and validate the solution

After you deploy the solution, you can test the agent either on the Amazon Bedrock console or through the application URL noted earlier. In the application, you can ask questions like “What wiper blades fit a 2021 Honda CR-V?” or “Tell me about part number 76622-T0A-A01.”

To illustrate the Amazon Bedrock agent’s reasoning capabilities, let’s examine the agent’s response to a user question “What wiper blades fit a 2021 Honda CR-V?” The workflow would consist of the following steps:

The agent analyzes the query, identifying key elements: Honda (make), CR-V (model), 2021 (year), and wiper blades (part category).
It then reasons about which API to use, selecting the get_compatible_parts function as the most appropriate for vehicle-specific part queries.
The agent maps the query elements to the function’s parameters, demonstrating its understanding of how to translate natural language to API calls.
The agent calls the carPartsApi/get_compatible_parts action, which queries the OpenSearch Service index for compatible parts. When an action group using a Lambda function is invoked, Amazon Bedrock sends a Lambda input event using a general format.
Upon receiving a response, the agent interprets the data, identifying relevant wiper blade options. The agent formulates a natural language response, recommending specific wiper blades. Similarly, Amazon Bedrock expects a response from your Lambda function to match a pre-defined format.

We get the following natural language response:

Based on the compatible parts search for a 2021 Honda CR-V in the “Wipers” category, here are the recommended wiper blade options:

Wiper Blade Set (Part #76622-TLA-A11) – This is a genuine Honda wiper blade set that includes the driver and passenger side wiper blades for the CR-V.

Rear Wiper Blade (Part #76630-TLA-A01) – This is the rear wiper blade specifically for the 2021 CR-V.

You can also consider individual wiper blades like the Driver Side Wiper Blade (Part #76622-T0A-A01) or the Passenger Side Wiper Arm (Part #76630-T0A-A01) if you only need to replace one side.

This process showcases how Amazon Bedrock agents can intelligently integrate enterprise data APIs with AI-driven reasoning to provide accurate, context-aware responses to customer queries.

Clean up

To avoid future costs and clean up resources after you’re done exploring the solution, delete the resources you created by running the following command from your terminal (from the project directory):

cdk destroy

Key considerations

When implementing Amazon Bedrock Agents, consider the following factors to facilitate optimal performance and scalability:

Agent design – Follow these recommendations when designing your agent:
- Keep instructions focused and clear, with specific responsibilities for the agent
- For complex use cases, consider multiple specialized agents rather than overloading a single one
- Explore different FMs to find the best fit for your needs, considering both behavior and cost
Action management – Consider the following recommendations for action management:
- Define actions carefully, including only those that the agent should reliably perform
- Use clear, descriptive names for actions to help the agent determine their relevance
- Avoid overlapping actions to prevent confusion and conflicts during operation
Testing – Make sure your testing includes the following steps:
- Establish clear testing protocols
- Identify common use case inputs and set accuracy targets
- Define edge case inputs and agree on acceptable accuracy levels
- Determine out-of-domain inputs where the agent should not respond
- Automate tests and run them with system changes to verify consistency and reliability
Performance optimization – Consider the following performance optimizations:
- Break down complex operations into smaller actions to enhance response time and error handling
- Implement a “fail fast” principle for invalid queries, allowing more time for complex tasks
Security and compliance – Use Amazon Bedrock Guardrails to prevent the agent from generating harmful content or making unauthorized actions
Cost management – Monitor usage-based pricing for token processing and storage, facilitating efficient resource allocation and cost management

Conclusion

Integrating enterprise data APIs with Amazon Bedrock Agents offers a powerful solution for streamlining customer support, as demonstrated in the automotive parts industry. This AI-driven approach enables rapid, accurate responses to complex queries, seamlessly integrates multiple data sources, and reduces staff workload while enhancing customer experience through context-aware interactions.

The solution discussed in this post can elevate customer support across various industries. By using Amazon Bedrock agents, organizations can create more efficient, accurate, and satisfying support experiences tailored to their specific needs. To explore how AI agents can transform your own support operations, refer to Automate tasks in your application using conversational agents.

About the Authors

Deepak Kovvuri is a Senior Solutions Architect supporting Automotive and Manufacturing Customers at AWS in the US Northeast. He has over 6 years of experience in helping customers architecting a DevOps strategy for their cloud workloads. Deepak specializes in CI/CD, Systems Administration, Infrastructure as Code and Container Services. He holds an Masters in Computer Engineering from University of Illinois at Chicago.

Kingston Bosco is a Senior Solutions Architect for Global Strategic Partners at AWS. He designs and implements solutions that optimize DevOps workflows, automate cloud operations, and improve infrastructure management for customers. He holds a Master’s in Information Systems. In his free time, he enjoys hiking with his dogs and playing soccer.

Welcome to GeForce NOW Performance: Priority Members Get Instant Upgrade

This GFN Thursday, the GeForce NOW Priority membership is getting enhancements and a fresh name to go along with it. The new Performance membership offers more GeForce-powered premium gaming — at no change in the monthly membership cost.

Gamers having a hard time deciding between the Performance and Ultimate memberships can take them both for a spin with a Day Pass, now 25% off for a limited time. Day Passes give access to 24 continuous hours of powerful cloud gaming.

In addition, seven new games are available this week, joining the over 2,000 games in the GeForce NOW library.

Time for a Glow Up

The Performance membership keeps all the same great gaming benefits and now provides members with an enhanced streaming experience at no additional cost.

Performance membership on GeForce NOW — *Say hello to the Performance membership.*

Performance members can stream at up to 1440p — an increase from the previous 1080p resolution — and experience games in immersive, ultrawide resolutions. They can also save their in-game graphics settings across streaming sessions, including for NVIDIA RTX features in supported titles.

All current Priority members are automatically upgraded to Performance and can take advantage of the upgraded streaming experience today.

Performance members will connect to GeForce RTX-powered gaming rigs for up to 1440p resolution. Ultimate members continue to receive the top streaming experience: connecting to GeForce RTX 4080-powered gaming rigs with up to 4K resolution and 120 frames per second, or 1080p and 240 fps in Competitive mode for games with support for NVIDIA Reflex technology.

Gamers playing on the free tier will now see they’re streaming from basic rigs, with varying specs that offer entry-level cloud gaming and are optimized for capacity.

Account portal on GeForce NOW — *Time to play.*

At the start of next year, GeForce NOW will roll out a 100-hour monthly playtime allowance to continue providing exceptional quality and speed — as well as shorter queue times — for Performance and Ultimate members. This ample limit comfortably accommodates 94% of members, who typically enjoy the service well within this timeframe. Members can check out how much time they’ve spent in the cloud through their account portal (see screenshot example above).

Up to 15 hours of unused playtime will automatically roll over to the next month for members, and additional hours can be purchased at $2.99 for 15 additional hours of Performance, or $5.99 for 15 additional Ultimate hours.

Loyal Member Benefit

To thank the GFN community for joining the cloud gaming revolution, GeForce NOW is offering active paid members as of Dec. 31, 2024, the ability to continue with unlimited playtime for a full year until January 2026.

New members can lock in this feature by signing up for GeForce NOW before Dec. 31, 2024. As long as a member’s account remains uninterrupted and in good standing, they’ll continue to receive unlimited playtime for all of 2025.

Don’t Pass This Up

For those looking to try out the new premium benefits and all Performance and Ultimate memberships have to offer, Day Passes are 25% off for a limited time.

Whether with the newly named Performance Day Pass at $2.99 or the Ultimate Day Pass at $5.99, members can unlock 24 hours of uninterrupted access to powerful NVIDIA GeForce RTX-powered cloud gaming servers.

Another new GeForce NOW feature lets users apply the value of their most recently purchased Day Pass toward any monthly membership if they sign up within 48 hours of the completion of their Day Pass.

Day Pass Sale on GeForce NOW — *Quarter the price, full day of fun.*

Dive into a vast library of over 2,000 games with enhanced graphics, including NVIDIA RTX features like ray tracing and DLSS. With the Ultimate Day Pass, snag a taste of GeForce NOW’s highest-performing membership tier and enjoy up to 4K resolution 120 fps or 1080p 240 fps across nearly any device. It’s an ideal way to experience elevated GeForce gaming in the cloud.

Thrilling New Games

Members can look for the following games available to stream in the cloud this week:

Planet Coaster 2 (New release on Steam, Nov. 6)
Teenage Mutant Ninja Turtles: Splintered Fate (New release on Steam, Nov. 6)
Empire of the Ants (New release on Steam, Nov. 7)
Unrailed 2: Back on Track (New release on Steam, Nov. 7)
TCG Card Shop Simulator (Steam)
StarCraft II (Xbox, available on PC Game Pass, Nov. 5. Members need to enable access.)
StarCraft Remastered (Xbox, available on PC Game Pass, Nov. 5. Members need to enable access.)

What are you planning to play this weekend? Let us know on X or in the comments below.

Unleash the power of generative AI with Amazon Q Business: How CCoEs can scale cloud governance best practices and drive innovation

This post is co-written with Steven Craig from Hearst.

To maintain their competitive edge, organizations are constantly seeking ways to accelerate cloud adoption, streamline processes, and drive innovation. However, Cloud Center of Excellence (CCoE) teams often can be perceived as bottlenecks to organizational transformation due to limited resources and overwhelming demand for their support.

In this post, we share how Hearst, one of the nation’s largest global, diversified information, services, and media companies, overcame these challenges by creating a self-service generative AI conversational assistant for business units seeking guidance from their CCoE. With Amazon Q Business, Hearst’s CCoE team built a solution to scale cloud best practices by providing employees across multiple business units self-service access to a centralized collection of documents and information. This freed up the CCoE to focus their time on high-value tasks by reducing repetitive requests from each business unit.

Readers will learn the key design decisions, benefits achieved, and lessons learned from Hearst’s innovative CCoE team. This solution can serve as a valuable reference for other organizations looking to scale their cloud governance and enable their CCoE teams to drive greater impact.

The challenge: Enabling self-service cloud governance at scale

Hearst undertook a comprehensive governance transformation for their Amazon Web Services (AWS) infrastructure. The CCoE implemented AWS Organizations across a substantial number of business units. These business units then used AWS best practice guidance from the CCoE by deploying landing zones with AWS Control Tower, managing resource configuration with AWS Config, and reporting the efficacy of controls with AWS Audit Manager. As individual business units sought guidance on adhering to the AWS recommended best practices, the CCoE created written directives and enablement materials to facilitate the scaled adoption across Hearst.

The existing CCoE model had several obstacles slowing adoption by business units:

Extreme demand – The CCoE team was becoming a bottleneck, unable to keep up with the growing demand for their expertise and guidance. The team was stretched thin, and the traditional approach of relying on human experts to address every question was impeding the pace of cloud adoption for the organization.
Limited scalability – As the volume of requests increased, the CCoE team couldn’t disseminate updated directives quickly enough. Manually reviewing each request across multiple business units wasn’t sustainable.
Inconsistent governance – Without a standardized, self-service mechanism to access the CCoE teams’ expertise and disseminate guidance on new policies, compliance practices, or governance controls, it was difficult to maintain consistency based on the CCoE best practices across each business unit.

To address these challenges, Hearst’s CCoE team recognized the need to quickly create a scalable, self-service application that could empower the business units with more access to updated CCoE best practices and patterns to follow.

Overview of solution

To enable self-service cloud governance at scale, Hearst’s CCoE team decided to use the power of generative AI with Amazon Q Business to build a conversational assistant. The following diagram shows the solution architecture:

The key steps Hearst took to implement Amazon Q Business were:

Application deployment and authentication – First, the CCoE team deployed Amazon Q Business and integrated AWS IAM Identity Center with their existing identity provider (using Okta in this case) to seamlessly manage user access and permissions between their existing identity provider and Amazon Q Business.
Data source curation and authorization – The CCoE team created several Amazon Simple Storage Service (Amazon S3) buckets to store their curated content, including cloud governance best practices, patterns, and guidance. They set up a general bucket for all users and specific buckets tailored to each business unit’s needs. User authorization for documents within the individual S3 buckets were controlled through access control lists (ACLs). You add access control information to a document in an Amazon S3 data source using a metadata file associated with the document. This made sure end users would only receive responses from documents they were authorized to view. With the Amazon Q Business S3 connector, the CCoE team was able to sync and index their data in just a few clicks.
User access management – With the data source and access controls in place, the CCoE team then set up user access on a business unit by business unit basis, considering various security, compliance, and custom requirements. As a result, the CCoE could deliver a personalized experience to each business unit.
User interface development – To provide a user-friendly experience, Hearst built a custom web interface so employees could interact with the Amazon Q Business assistant through a familiar and intuitive interface. This encouraged widespread adoption and self-service among the business units.
Rollout and continuous improvement – Finally, the CCoE team shared the web experience with the various business units, empowering employees to access the guidance and best practices they needed through natural language interactions. Going forward, the team enriched the knowledge base (S3 buckets) and implemented a feedback loop to facilitate continuous improvement of the solution.

For Hearst’s CCoE team, Amazon Q Business was the quickest way to use generative AI on AWS, with minimal risk and less upfront technical complexity.

Speed to value was an important advantage because it allowed the CCoE to get these powerful generative AI capabilities into the hands of employees as quickly as possible, unlocking new levels of scalability, efficiency, and innovation for cloud governance consistency across the organization.
This strategic decision to use a managed service at the application layer, such as Amazon Q Business, enabled the CCoE to deliver tangible value for the business units in a matter of weeks. By opting for the expedited path to using generative AI on AWS, Hearst was never bogged down in the technical complexities of developing and managing their own generative AI application.

The results: Decreased support requests and increased cloud governance consistency

By using Amazon Q Business, Hearst’s CCoE team achieved remarkable results in empowering self-service cloud governance across the organization. The initial impact was immediate—within the first month, the CCoE team saw a 70% reduction in the volume of requests for guidance and support from the various business units. This freed up the team to focus on higher-value initiatives instead of getting bogged down in repetitive, routine requests. The following month, the number of requests for CCoE support dropped by 76%, demonstrating the power of a self-service assistant with Amazon Q Business. The benefits went beyond just reduced request volume. The CCoE team also saw a significant improvement in the consistency and quality of cloud governance practices across Hearst, enhancing the organization’s overall cloud security, compliance posture, and cloud adoption.

Conclusion

Cloud governance is a critical set of rules, processes, and reports that guide organizations to follow best practices across their IT estate. For Hearst, the CCoE team sets the tone and cloud governance standards that each business unit follows. The implementation of Amazon Q Business allowed Hearst’s CCoE team to scale the governance and security that support business units depend on through a generative AI assistant. By disseminating best practices and guidance across the organization, the CCoE team freed up resources to focus on strategic initiatives, while employees gained access to a self-service application, reducing the burden on the central team. If your CCoE team is looking to scale its impact and enable your workforce, consider using the power of conversational AI through services like Amazon Q Business, which can position your team as a strategic enabler of cloud transformation.

Listen to Steven Craig share how Hearst leveraged Amazon Q Business to scale the Cloud Center of Excellence

Reading References:

Getting started with Amazon Q
The Business Value of AWS Cloud Governance Services IDC white paper.

About the Authors

Steven Craig is a Sr. Director, Cloud Center of Excellence. He oversees Cloud Economics, Cloud Enablement, and Cloud Governance for all Hearst-owned companies. Previously, as VP Product Strategy and Ops at Innova Solutions, he was instrumental in migrating applications to public cloud platforms and creating IT Operations Managed Service offerings. His leadership and technical solutions were key in achieving sequential AWS Managed Services Provider certifications. Steven has been AWS Professionally certified for over 8 years.

Oleg Chugaev is a Principal Solutions Architect and Serverless evangelist with 20+ years in IT, holding multiple AWS certifications. At AWS, he drives customers through their cloud transformation journeys by converting complex challenges into actionable roadmaps for both technical and business audiences.

Rohit Chaudhari is a Senior Customer Solutions Manager with over 15 years of diverse tech experience. His background spans customer success, product management, digital transformation coaching, engineering, and consulting. At AWS, Rohit serves as a trusted advisor for customers to work backwards from their business goals, accelerate their journey to the cloud, and implement innovative solutions.

Al Destefano is a Generative AI Specialist at AWS based in New York City. Leveraging his AI/ML domain expertise, Al develops and executes global go-to-market strategies that drive transformative results for AWS customers at scale. He specializes in helping enterprise customers harness the power of Amazon Q, a generative AI-powered assistant, to overcome complex challenges and unlock new business opportunities.

Integrate foundation models into your code with Amazon Bedrock

The rise of large language models (LLMs) and foundation models (FMs) has revolutionized the field of natural language processing (NLP) and artificial intelligence (AI). These powerful models, trained on vast amounts of data, can generate human-like text, answer questions, and even engage in creative writing tasks. However, training and deploying such models from scratch is a complex and resource-intensive process, often requiring specialized expertise and significant computational resources.

Enter Amazon Bedrock, a fully managed service that provides developers with seamless access to cutting-edge FMs through simple APIs. Amazon Bedrock streamlines the integration of state-of-the-art generative AI capabilities for developers, offering pre-trained models that can be customized and deployed without the need for extensive model training from scratch. Amazon maintains the flexibility for model customization while simplifying the process, making it straightforward for developers to use cutting-edge generative AI technologies in their applications. With Amazon Bedrock, you can integrate advanced NLP features, such as language understanding, text generation, and question answering, into your applications.

In this post, we explore how to integrate Amazon Bedrock FMs into your code base, enabling you to build powerful AI-driven applications with ease. We guide you through the process of setting up the environment, creating the Amazon Bedrock client, prompting and wrapping code, invoking the models, and using various models and streaming invocations. By the end of this post, you’ll have the knowledge and tools to harness the power of Amazon Bedrock FMs, accelerating your product development timelines and empowering your applications with advanced AI capabilities.

Solution overview

Amazon Bedrock provides a simple and efficient way to use powerful FMs through APIs, without the need for training custom models. For this post, we run the code in a Jupyter notebook within VS Code and use Python. The process of integrating Amazon Bedrock into your code base involves the following steps:

Set up your development environment by importing the necessary dependencies and creating an Amazon Bedrock client. This client will serve as the entry point for interacting with Amazon Bedrock FMs.
After the Amazon Bedrock client is set up, you can define prompts or code snippets that will be used to interact with the FMs. These prompts can include natural language instructions or code snippets that the model will process and generate output based on.
With the prompts defined, you can invoke the Amazon Bedrock FM by passing the prompts to the client. Amazon Bedrock supports various models, each with its own strengths and capabilities, allowing you to choose the most suitable model for your use case.
Depending on the model and the prompts provided, Amazon Bedrock will generate output, which can include natural language text, code snippets, or a combination of both. You can then process and integrate this output into your application as needed.
For certain models and use cases, Amazon Bedrock supports streaming invocations, which allow you to interact with the model in real time. This can be particularly useful for conversational AI or interactive applications where you need to exchange multiple prompts and responses with the model.

Throughout this post, we provide detailed code examples and explanations for each step, helping you seamlessly integrate Amazon Bedrock FMs into your code base. By using these powerful models, you can enhance your applications with advanced NLP capabilities, accelerate your development process, and deliver innovative solutions to your users.

Prerequisites

Before you dive into the integration process, make sure you have the following prerequisites in place:

AWS account – You’ll need an AWS account to access and use Amazon Bedrock. If you don’t have one, you can create a new account.
Development environment – Set up an integrated development environment (IDE) with your preferred coding language and tools. You can interact with Amazon Bedrock using AWS SDKs available in Python, Java, Node.js, and more.
AWS credentials – Configure your AWS credentials in your development environment to authenticate with AWS services. You can find instructions on how to do this in the AWS documentation for your chosen SDK. We walk through a Python example in this post.

With these prerequisites in place, you’re ready to start integrating Amazon Bedrock FMs into your code.

In your IDE, create a new file. For this example, we use a Jupyter notebook (Kernel: Python 3.12.0).

In the following sections, we demonstrate how to implement the solution in a Jupyter notebook.

Set up the environment

To begin, import the necessary dependencies for interacting with Amazon Bedrock. The following is an example of how you can do this in Python.

First step is to import boto3 and json:

import boto3, json

Next, create an instance of the Amazon Bedrock client. This client will serve as the entry point for interacting with the FMs. The following is a code example of how to create the client:

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-east-1'
)

Define prompts and code snippets

With the Amazon Bedrock client set up, define prompts and code snippets that will be used to interact with the FMs. These prompts can include natural language instructions or code snippets that the model will process and generate output based on.

In this example, we asked the model, “Hello, who are you?”.

To send the prompt to the API endpoint, you need some keyword arguments to pass in. You can get these arguments from the Amazon Bedrock console.

On the Amazon Bedrock console, choose Base models in the navigation pane.

Select Titan Text G1 – Express.

Choose the model name (Titan Text G1 – Express) and go to the API request.

Copy the API request:

{
"modelId": "amazon.titan-text-express-v1",
"contentType": "application/json",
"accept": "application/json",
"body": "{"inputText":"this is where you place your input text","textGenerationConfig":{"maxTokenCount":8192,"stopSequences":[],"temperature":0,"topP":1}}"
}

Insert this code in the Jupyter notebook with the following minor modifications:
- We post the API requests to keyword arguments (kwargs).
- The next change is on the prompt. We will replace ”this is where you place your input text” by ”Hello, who are you?”
Print the keyword arguments:

kwargs = {
 "modelId": "amazon.titan-text-express-v1",
 "contentType": "application/json",
 "accept": "application/json",
 "body": "{"inputText":"Hello, who are you?","textGenerationConfig":{"maxTokenCount":8192,"stopSequences":[],"temperature":0,"topP":1}}"
}
print(kwargs)

This should give you the following output:

{'modelId': 'amazon.titan-text-express-v1', 'contentType': 'application/json', 'accept': 'application/json', 'body': '{"inputText":"Hello, who are you?","textGenerationConfig":{"maxTokenCount":8192,"stopSequences":[],"temperature":0,"topP":1}}'}

Invoke the model

With the prompt defined, you can now invoke the Amazon Bedrock FM.

Pass the prompt to the client:

response = bedrock_runtime.invoke_model(**kwargs)
response

This will invoke the Amazon Bedrock model with the provided prompt and print the generated streaming body object response.

{'ResponseMetadata': {'RequestId': '3cfe2718-b018-4a50-94e3-59e2080c75a3', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 18 Oct 2024 11:30:14 GMT', 'content-type': 'application/json', 'content-length': '255', 'connection': 'keep-alive', 'x-amzn-requestid': '3cfe2718-b018-4a50-94e3-59e2080c75a3', 'x-amzn-bedrock-invocation-latency': '1980', 'x-amzn-bedrock-output-token-count': '37', 'x-amzn-bedrock-input-token-count': '6'}, 'RetryAttempts': 0}, 'contentType': 'application/json', 'body': <botocore.response.StreamingBody at 0x105e8e7a0>}

The preceding Amazon Bedrock runtime invoke model will work for the FM you choose to invoke.

Unpack the JSON string as follows:

response_body = json.loads(response.get('body').read())
response_body

You should get a response as follows (this is the response we got from the Titan Text G1 – Express model for the prompt we supplied).

{'inputTextTokenCount': 6, 'results': [{'tokenCount': 37, 'outputText': 'nI am Amazon Titan, a large language model built by AWS. It is designed to assist you with tasks and answer any questions you may have. How may I help you?', 'completionReason': 'FINISH'}]}

Experiment with different models

Amazon Bedrock offers various FMs, each with its own strengths and capabilities. You can specify which model you want to use by passing the model_name parameter when creating the Amazon Bedrock client.

Like the previous Titan Text G1 – Express example, get the API request from the Amazon Bedrock console. This time, we use Anthropic’s Claude on Amazon Bedrock.

{
"modelId": "anthropic.claude-v2",
"contentType": "application/json",
"accept": "*/*",
"body": "{"prompt":"\n\nHuman: Hello world\n\nAssistant:","max_tokens_to_sample":300,"temperature":0.5,"top_k":250,"top_p":1,"stop_sequences":["\n\nHuman:"],"anthropic_version":"bedrock-2023-05-31"}"
}

Anthropic’s Claude accepts the prompt in a different way (\n\nHuman:), so the API request on the Amazon Bedrock console provides the prompt in the way that Anthropic’s Claude can accept.

Edit the API request and put it in the keyword argument:

kwargs = {
  "modelId": "anthropic.claude-v2",
  "contentType": "application/json",
  "accept": "*/*",
  "body": "{"prompt":"\n\nHuman: we have received some text without any context.\nWe will need to label the text with a title so that others can quickly see what the text is about \n\nHere is the text between these <text></text> XML tags\n\n<text>\nToday I sent to the beach and saw a whale. I ate an ice-cream and swam in the sea\n</text>\n\nProvide title between <title></title> XML tags\n\nAssistant:","max_tokens_to_sample":300,"temperature":0.5,"top_k":250,"top_p":1,"stop_sequences":["\n\nHuman:"],"anthropic_version":"bedrock-2023-05-31"}"
}
print(kwargs)

You should get the following response:

{'modelId': 'anthropic.claude-v2', 'contentType': 'application/json', 'accept': '*/*', 'body': '{"prompt":"\n\nHuman: we have received some text without any context.\nWe will need to label the text with a title so that others can quickly see what the text is about \n\nHere is the text between these <text></text> XML tags\n\n<text>\nToday I sent to the beach and saw a whale. I ate an ice-cream and swam in the sea\n</text>\n\nProvide title between <title></title> XML tags\n\nAssistant:","max_tokens_to_sample":300,"temperature":0.5,"top_k":250,"top_p":1,"stop_sequences":["\n\nHuman:"],"anthropic_version":"bedrock-2023-05-31"}'}

With the prompt defined, you can now invoke the Amazon Bedrock FM by passing the prompt to the client:

response = bedrock_runtime.invoke_model(**kwargs)
response

You should get the following output:

{'ResponseMetadata': {'RequestId': '72d2b1c7-cbc8-42ed-9098-2b4eb41cd14e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Thu, 17 Oct 2024 15:07:23 GMT', 'content-type': 'application/json', 'content-length': '121', 'connection': 'keep-alive', 'x-amzn-requestid': '72d2b1c7-cbc8-42ed-9098-2b4eb41cd14e', 'x-amzn-bedrock-invocation-latency': '538', 'x-amzn-bedrock-output-token-count': '15', 'x-amzn-bedrock-input-token-count': '100'}, 'RetryAttempts': 0}, 'contentType': 'application/json', 'body': <botocore.response.StreamingBody at 0x1200b5990>}

Unpack the JSON string as follows:

response_body = json.loads(response.get('body').read())
response_body

This results in the following output on the title for the given text.

{'type': 'completion', 'completion': ' <title>A Day at the Beach</title>', 'stop_reason': 'stop_sequence', 'stop': 'nnHuman:'}

Print the completion:

completion = response_body.get('completion')
completion

Because the response is returned in the XML tags as you defined, you can consume the response and display it to the client.

' <title>A Day at the Beach</title>'

Invoke model with streaming code

For certain models and use cases, Amazon Bedrock supports streaming invocations, which allow you to interact with the model in real time. This can be particularly useful for conversational AI or interactive applications where you need to exchange multiple prompts and responses with the model. For example, if you’re asking the FM for an article or story, you might want to stream the output of the generated content.

Import the dependencies and create the Amazon Bedrock client:

import boto3, json
bedrock_runtime = boto3.client(
service_name='bedrock-runtime',
region_name='us-east-1'
)

Define the prompt as follows:

prompt = "write an article about fictional planet Foobar"

Edit the API request and put it in keyword argument as before:
We use the API request of the claude-v2 model.

kwargs = {
  "modelId": "anthropic.claude-v2",
  "contentType": "application/json",
  "accept": "*/*",
  "body": "{"prompt":"\n\nHuman: " + prompt + "\nAssistant:","max_tokens_to_sample":300,"temperature":0.5,"top_k":250,"top_p":1,"stop_sequences":["\n\nHuman:"],"anthropic_version":"bedrock-2023-05-31"}"
}

You can now invoke the Amazon Bedrock FM by passing the prompt to the client:
We use invoke_model_with_response_stream instead of invoke_model.

response = bedrock_runtime.invoke_model_with_response_stream(**kwargs)

stream = response.get('body')
if stream:
    for event in stream:
        chunk = event.get('chunk')
        if chunk:
            print(json.loads(chunk.get('bytes')).get('completion'), end="")

You get a response like the following as streaming output:

Here is a draft article about the fictional planet Foobar: Exploring the Mysteries of Planet Foobar Far off in a distant solar system lies the mysterious planet Foobar. This strange world has confounded scientists and explorers for centuries with its bizarre environments and alien lifeforms. Foobar is slightly larger than Earth and orbits a small, dim red star. From space, the planet appears rusty orange due to its sandy deserts and red rock formations. While the planet looks barren and dry at first glance, it actually contains a diverse array of ecosystems. The poles of Foobar are covered in icy tundra, home to resilient lichen-like plants and furry, six-legged mammals. Moving towards the equator, the tundra slowly gives way to rocky badlands dotted with scrubby vegetation. This arid zone contains ancient dried up riverbeds that point to a once lush environment. The heart of Foobar is dominated by expansive deserts of fine, deep red sand. These deserts experience scorching heat during the day but drop to freezing temperatures at night. Hardy cactus-like plants manage to thrive in this harsh landscape alongside tough reptilian creatures. Oases rich with palm-like trees can occasionally be found tucked away in hidden canyons. Scattered throughout Foobar are pockets of tropical jungles thriving along rivers and wetlands.

Conclusion

In this post, we showed how to integrate Amazon Bedrock FMs into your code base. With Amazon Bedrock, you can use state-of-the-art generative AI capabilities without the need for training custom models, accelerating your development process and enabling you to build powerful applications with advanced NLP features.

Whether you’re building a conversational AI assistant, a code generation tool, or another application that requires NLP capabilities, Amazon Bedrock provides a simple and efficient solution. By using the power of FMs through Amazon Bedrock APIs, you can focus on building innovative solutions and delivering value to your users, without worrying about the underlying complexities of language models.

As you continue to explore and integrate Amazon Bedrock into your projects, remember to stay up to date with the latest updates and features offered by the service. Additionally, consider exploring other AWS services and tools that can complement and enhance your AI-driven applications, such as Amazon SageMaker for machine learning model training and deployment, or Amazon Lex for building conversational interfaces.

To further explore the capabilities of Amazon Bedrock, refer to the following resources:

Share and learn with our generative AI community at community.aws.

Happy coding and building with Amazon Bedrock!

About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customer guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

YaduKishore Tatavarthi is a Senior Partner Solutions Architect at Amazon Web Services, supporting customers and partners worldwide. For the past 20 years, he has been helping customers build enterprise data strategies, advising them on Generative AI, cloud implementations, migrations, reference architecture creation, data modeling best practices, and data lake/warehouse architectures.

New features in Amazon Bedrock Prompt Management

Create a new prompt

Build the prompt

Compare prompt variants

Invoke the prompt

Now available

Conclusion

About the Authors

Motivation for streamlined ML operations and large-scale inference

Solution overview

A fully automated production workflow

Seamless integration of experiments

Conclusion

References

About the Authors

Overview

Real-world use cases for media, advertising, and entertainment

Ideation and iteration

Character design

Social media marketing asset generation

Stability AI’s capabilities in advertising and marketing campaigns

Prerequisites

Setting up the demo

Generating creative ad campaigns with multiple models

Prompt engineering for visual assets

Generating ad posters with Stable Image Ultra

Clean up

Conclusion

About the authors

RELand: Risk Estimation of Landmines via Interpretable Invariant Risk Minimization

Hazard Cluster Identification as a Quadratic Knapsack Problem

Tangible Impact of RELand

Aknowledgments

References

Architecting a multi-tenant generative AI environment on AWS

Solution overview

Generative AI gateway

Tenant

Solution walkthrough

HTTPS endpoint

Tenants and access patterns

Shared services

Model components

Generative AI application components

Responsible AI components

Core services

Scaling across the enterprise

Conclusion

About the authors

Solution overview

Ingestion flow

User interaction flow

Developer tools

Prerequisites

Deploy the solution

Test and validate the solution

Clean up

Key considerations

Conclusion

About the Authors

Time for a Glow Up

Loyal Member Benefit

Don’t Pass This Up

Thrilling New Games

The challenge: Enabling self-service cloud governance at scale

Overview of solution

The results: Decreased support requests and increased cloud governance consistency

Conclusion

Reading References:

About the Authors

Solution overview

Prerequisites

Set up the environment

Define prompts and code snippets

Invoke the model

Experiment with different models

Invoke model with streaming code

Conclusion

About the Authors

Navigation