Research Focus: Week of February 19, 2024

Research Focus: Week of February 19, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus Week of February 19, 2024

Vertically Autoscaling Monolithic Applications with CaaSPER: Scalable Container-as-a-Service Performance Enhanced Resizing Algorithm for the Cloud

Kubernetes is a prominent open-source platform for managing cloud applications, including stateful databases, which keep track of changes and transactions involving the underlying data. These monolithic applications often must rely on vertical resource scaling instead of horizontal scale-out, adjusting CPU cores to match load fluctuations. However, an analysis of database-as-a-service (DBaaS) offerings at Microsoft revealed that many customers consistently over-provision resources for peak workloads, neglecting opportunities to optimize their cloud resource consumption by scaling down. Existing vertical autoscaling tools lack the ability to minimize resource slack and respond promptly to throttling, leading to increased costs and impacting crucial metrics, such as throughput and availability.

In a recent paper: Vertically Autoscaling Monolithic Applications with CaaSPER: Scalable Container-as-a-Service Performance Enhanced Resizing Algorithm for the Cloud, researchers from Microsoft propose CaaSPER, a vertical autoscaling algorithm that blends reactive and proactive strategies to address this challenge. By dynamically adjusting CPU resources, CaaSPER minimizes resource slack, maintains optimal CPU utilization, and reduces throttling. Importantly, customers have the flexibility to prioritize either cost savings or high performance. Extensive testing demonstrates that CaaSPER effectively reduces throttling and keeps CPU utilization within target levels. CaaSPER is designed to be application-agnostic and platform-agnostic, with potential for extension to other applications and resources requiring vertical autoscaling.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


Improved Scene Landmark Detection for Camera Localization

Camera localization is a fundamental component commonly used in computer vision, robotics, augmented reality, and virtual reality applications for estimating the precise 3D position and orientation of a camera-enabled device within a scene. Localization techniques that use image-based retrieval, visual feature matching, and 3D structure-based pose estimation, are generally accurate, but they require high storage, are often slow, and are not privacy-preserving. Researchers from Microsoft and external colleagues recently proposed an alternate learning-based localization method based on scene landmark detection (SLD) to address these limitations. It involves training a convolutional neural network to detect a few predetermined, salient, scene-specific 3D points or landmarks and computing camera pose from the associated 2D–3D correspondences. Although SLD outperformed existing learning-based approaches, it was notably less accurate than 3D structure-based methods.

In a recent paper: Improved Scene Landmark Detection for Camera Localization, researchers from Microsoft show that the accuracy gap was due to insufficient model capacity and noisy labels during training. To mitigate the capacity issue, they propose splitting the landmarks into subgroups and training a separate network for each subgroup. To generate better training labels, they propose using dense reconstructions to estimate accurate visibility of scene landmarks. Finally, they present a compact neural network architecture to improve memory efficiency. This approach is as accurate as state-of-the-art structure-based methods on the INDOOR-6 dataset, but it runs significantly faster and uses less storage.


ESUS: Aligning and Simplifying SUS for Enterprise Applications

Over many years, researchers have developed standard questionnaires to evaluate usability and present a single score that represents a product’s overall level of ease of use. These evaluations are highly valuable for researchers studying human-computer interaction (HCI) and user experience (UX). One of the most notable questionnaires is the System Usability Scale (SUS). However, since the SUS was introduced in 1986, products and services have undergone monumental advances in technology, while HCI and UX research practices have matured considerably. These changes are also true in enterprise environments.

In a recent article: ESUS: Aligning and Simplifying SUS for Enterprise Applications, researchers from Microsoft present preliminary evidence showing the effectiveness of a new usability questionnaire with three advantages for enterprise applications over the original 10-item SUS questionnaire. The Enterprise System Usability Scale (ESUS) offers better measurement of usability for technical products and services; reduced questionnaire items; and alignment with enterprise environments. Results indicate that the ESUS strongly correlates with user satisfaction, similar to the SUS.

The post Research Focus: Week of February 19, 2024 appeared first on Microsoft Research.

Read More

Shining Brighter Together: Google’s Gemma Optimized to Run on NVIDIA GPUs

Shining Brighter Together: Google’s Gemma Optimized to Run on NVIDIA GPUs

NVIDIA, in collaboration with Google, today launched optimizations across all NVIDIA AI platforms for Gemma — Google’s state-of-the-art new lightweight 2 billion– and 7 billion-parameter open language models that can be run anywhere, reducing costs and speeding innovative work for domain-specific use cases.

Teams from the companies worked closely together to accelerate the performance of Gemma — built from the same research and technology used to create the Gemini models — with NVIDIA TensorRT-LLM, an open-source library for optimizing large language model inference, when running on NVIDIA GPUs in the data center, in the cloud and on PCs with NVIDIA RTX GPUs.

This allows developers to target the installed base of over 100 million NVIDIA RTX GPUs available in high-performance AI PCs globally.

Developers can also run Gemma on NVIDIA GPUs in the cloud, including on Google Cloud’s A3 instances based on the H100 Tensor Core GPU and soon, NVIDIA’s H200 Tensor Core GPUs — featuring 141GB of HBM3e memory at 4.8 terabytes per second — which Google will deploy this year.

Enterprise developers can additionally take advantage of NVIDIA’s rich ecosystem of tools — including NVIDIA AI Enterprise with the NeMo framework and TensorRT-LLM — to fine-tune Gemma and deploy the optimized model in their production application.

Learn more about how TensorRT-LLM is revving up inference for Gemma, along with additional information for developers. This includes several model checkpoints of Gemma and the FP8-quantized version of the model, all optimized with TensorRT-LLM.

Experience Gemma 2B and Gemma 7B directly from your browser on the NVIDIA AI Playground.

Gemma Coming to Chat With RTX

Adding support for Gemma soon is Chat with RTX, an NVIDIA tech demo that uses retrieval-augmented generation and TensorRT-LLM software to give users generative AI capabilities on their local, RTX-powered Windows PCs.

The Chat with RTX lets users personalize a chatbot with their own data by easily connecting local files on a PC to a large language model.

Since the model runs locally, it provides results fast, and user data stays on the device. Rather than relying on cloud-based LLM services, Chat with RTX lets users process sensitive data on a local PC without the need to share it with a third party or have an internet connection.

Read More

Streamline diarization using AI as an assistive technology: ZOO Digital’s story

Streamline diarization using AI as an assistive technology: ZOO Digital’s story

ZOO Digital provides end-to-end localization and media services to adapt original TV and movie content to different languages, regions, and cultures. It makes globalization easier for the world’s best content creators. Trusted by the biggest names in entertainment, ZOO Digital delivers high-quality localization and media services at scale, including dubbing, subtitling, scripting, and compliance.

Typical localization workflows require manual speaker diarization, wherein an audio stream is segmented based on the identity of the speaker. This time-consuming process must be completed before content can be dubbed into another language. With manual methods, a 30-minute episode can take between 1–3 hours to localize. Through automation, ZOO Digital aims to achieve localization in under 30 minutes.

In this post, we discuss deploying scalable machine learning (ML) models for diarizing media content using Amazon SageMaker, with a focus on the WhisperX model.

Background

ZOO Digital’s vision is to provide a faster turnaround of localized content. This goal is bottlenecked by the manually intensive nature of the exercise compounded by the small workforce of skilled people that can localize content manually. ZOO Digital works with over 11,000 freelancers and localized over 600 million words in 2022 alone. However, the supply of skilled people is being outstripped by the increasing demand for content, requiring automation to assist with localization workflows.

With an aim to accelerate the localization of content workflows through machine learning, ZOO Digital engaged AWS Prototyping, an investment program by AWS to co-build workloads with customers. The engagement focused on delivering a functional solution for the localization process, while providing hands-on training to ZOO Digital developers on SageMaker, Amazon Transcribe, and Amazon Translate.

Customer challenge

After a title (a movie or an episode of a TV series) has been transcribed, speakers must be assigned to each segment of speech so that they can be correctly assigned to the voice artists that are cast to play the characters. This process is called speaker diarization. ZOO Digital faces the challenge of diarizing content at scale while being economically viable.

Solution overview

In this prototype, we stored the original media files in a specified Amazon Simple Storage Service (Amazon S3) bucket. This S3 bucket was configured to emit an event when new files are detected within it, triggering an AWS Lambda function. For instructions on configuring this trigger, refer to the tutorial Using an Amazon S3 trigger to invoke a Lambda function. Subsequently, the Lambda function invoked the SageMaker endpoint for inference using the Boto3 SageMaker Runtime client.

The WhisperX model, based on OpenAI’s Whisper, performs transcriptions and diarization for media assets. It’s built upon the Faster Whisper reimplementation, offering up to four times faster transcription with improved word-level timestamp alignment compared to Whisper. Additionally, it introduces speaker diarization, not present in the original Whisper model. WhisperX utilizes the Whisper model for transcriptions, the Wav2Vec2 model to enhance timestamp alignment (ensuring synchronization of transcribed text with audio timestamps), and the pyannote model for diarization. FFmpeg is used for loading audio from source media, supporting various media formats. The transparent and modular model architecture allows flexibility, because each component of the model can be swapped out as needed in the future. However, it’s essential to note that WhisperX lacks full management features and isn’t an enterprise-level product. Without maintenance and support, it may not be suitable for production deployment.

In this collaboration, we deployed and evaluated WhisperX on SageMaker, using an asynchronous inference endpoint to host the model. SageMaker asynchronous endpoints support upload sizes up to 1 GB and incorporate auto scaling features that efficiently mitigate traffic spikes and save costs during off-peak times. Asynchronous endpoints are particularly well-suited for processing large files, such as movies and TV series in our use case.

The following diagram illustrates the core elements of the experiments we conducted in this collaboration.

In the following sections, we delve into the details of deploying the WhisperX model on SageMaker, and evaluate the diarization performance.

Download the model and its components

WhisperX is a system that includes multiple models for transcription, forced alignment, and diarization. For smooth SageMaker operation without the need to fetch model artifacts during inference, it’s essential to pre-download all model artifacts. These artifacts are then loaded into the SageMaker serving container during initiation. Because these models aren’t directly accessible, we offer descriptions and sample code from the WhisperX source, providing instructions on downloading the model and its components.

WhisperX uses six models:

Most of these models can be obtained from Hugging Face using the huggingface_hub library. We use the following download_hf_model() function to retrieve these model artifacts. An access token from Hugging Face, generated after accepting the user agreements for the following pyannote models, is required:

import huggingface_hub
import yaml
import torchaudio
import urllib.request
import os

CONTAINER_MODEL_DIR = "/opt/ml/model"
WHISPERX_MODEL = "guillaumekln/faster-whisper-large-v2"
VAD_MODEL_URL = "https://whisperx.s3.eu-west-2.amazonaws.com/model_weights/segmentation/0b5b3216d60a2d32fc086b47ea8c67589aaeb26b7e07fcbe620d6d0b83e209ea/pytorch_model.bin"
WAV2VEC2_MODEL = "WAV2VEC2_ASR_BASE_960H"
DIARIZATION_MODEL = "pyannote/speaker-diarization"

def download_hf_model(model_name: str, hf_token: str, local_model_dir: str) -> str:
    """
    Fetches the provided model from HuggingFace and returns the subdirectory it is downloaded to
    :param model_name: HuggingFace model name (and an optional version, appended with @[version])
    :param hf_token: HuggingFace access token authorized to access the requested model
    :param local_model_dir: The local directory to download the model to
    :return: The subdirectory within local_modeL_dir that the model is downloaded to
    """
    model_subdir = model_name.split('@')[0]
    huggingface_hub.snapshot_download(model_subdir, token=hf_token, local_dir=f"{local_model_dir}/{model_subdir}", local_dir_use_symlinks=False)
    return model_subdir

The VAD model is fetched from Amazon S3, and the Wav2Vec2 model is retrieved from the torchaudio.pipelines module. Based on the following code, we can retrieve all the models’ artifacts, including those from Hugging Face, and save them to the specified local model directory:

def fetch_models(hf_token: str, local_model_dir="./models"):
    """
    Fetches all required models to run WhisperX locally without downloading models every time 
    :param hf_token: A huggingface access token to download the models
    :param local_model_dir: The directory to download the models to
    """
    # Fetch Faster Whisper's Large V2 model from HuggingFace
    download_hf_model(model_name=WHISPERX_MODEL, hf_token=hf_token, local_model_dir=local_model_dir)

    # Fetch WhisperX's VAD Segmentation model from S3
    vad_model_dir = "whisperx/vad"
    if not os.path.exists(f"{local_model_dir}/{vad_model_dir}"):
        os.makedirs(f"{local_model_dir}/{vad_model_dir}")

    urllib.request.urlretrieve(VAD_MODEL_URL, f"{local_model_dir}/{vad_model_dir}/pytorch_model.bin")

    # Fetch the Wav2Vec2 alignment model
    torchaudio.pipelines.__dict__[WAV2VEC2_MODEL].get_model(dl_kwargs={"model_dir": f"{local_model_dir}/wav2vec2/"})

    # Fetch pyannote's Speaker Diarization model from HuggingFace
    download_hf_model(model_name=DIARIZATION_MODEL,
                      hf_token=hf_token,
                      local_model_dir=local_model_dir)

    # Read in the Speaker Diarization model config to fetch models and update with their local paths
    with open(f"{local_model_dir}/{DIARIZATION_MODEL}/config.yaml", 'r') as file:
        diarization_config = yaml.safe_load(file)

    embedding_model = diarization_config['pipeline']['params']['embedding']
    embedding_model_dir = download_hf_model(model_name=embedding_model,
                                            hf_token=hf_token,
                                            local_model_dir=local_model_dir)
    diarization_config['pipeline']['params']['embedding'] = f"{CONTAINER_MODEL_DIR}/{embedding_model_dir}"

    segmentation_model = diarization_config['pipeline']['params']['segmentation']
    segmentation_model_dir = download_hf_model(model_name=segmentation_model,
                                               hf_token=hf_token,
                                               local_model_dir=local_model_dir)
    diarization_config['pipeline']['params']['segmentation'] = f"{CONTAINER_MODEL_DIR}/{segmentation_model_dir}/pytorch_model.bin"

    with open(f"{local_model_dir}/{DIARIZATION_MODEL}/config.yaml", 'w') as file:
        yaml.safe_dump(diarization_config, file)

    # Read in the Speaker Embedding model config to update it with its local path
    speechbrain_hyperparams_path = f"{local_model_dir}/{embedding_model_dir}/hyperparams.yaml"
    with open(speechbrain_hyperparams_path, 'r') as file:
        speechbrain_hyperparams = file.read()

    speechbrain_hyperparams = speechbrain_hyperparams.replace(embedding_model_dir, f"{CONTAINER_MODEL_DIR}/{embedding_model_dir}")

    with open(speechbrain_hyperparams_path, 'w') as file:
        file.write(speechbrain_hyperparams)

Select the appropriate AWS Deep Learning Container for serving the model

After the model artifacts are saved using the preceding sample code, you can choose pre-built AWS Deep Learning Containers (DLCs) from the following GitHub repo. When selecting the Docker image, consider the following settings: framework (Hugging Face), task (inference), Python version, and hardware (for example, GPU). We recommend using the following image: 763104351884.dkr.ecr.[REGION].amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04 This image has all the necessary system packages pre-installed, such as ffmpeg. Remember to replace [REGION] with the AWS Region you are using.

For other required Python packages, create a requirements.txt file with a list of packages and their versions. These packages will be installed when the AWS DLC is built. The following are the additional packages needed to host the WhisperX model on SageMaker:

faster-whisper==0.7.1 
git+https://github.com/m-bain/whisperx.git@1b092de19a1878a8f138f665b1467ca21b076e7e 
ffmpeg-python

Create an inference script to load the models and run inference

Next, we create a custom inference.py script to outline how the WhisperX model and its components are loaded into the container and how the inference process should be run. The script contains two functions: model_fn and transform_fn. The model_fn function is invoked to load the models from their respective locations. Subsequently, these models are passed to the transform_fn function during inference, where transcription, alignment, and diarization processes are performed. The following is a code sample for inference.py:

import io
import json
import logging
import tempfile
import time

import torch
import whisperx

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

def model_fn(model_dir: str) -> dict:
    """
    Deserialize and return the models
    """
    logging.info("Loading WhisperX model")
    model = whisperx.load_model(whisper_arch=f"{model_dir}/guillaumekln/faster-whisper-large-v2",
                                device=DEVICE,
                                language="en",
                                compute_type="float16",
                                vad_options={'model_fp': f"{model_dir}/whisperx/vad/pytorch_model.bin"})

    logging.info("Loading alignment model")
    align_model, metadata = whisperx.load_align_model(language_code="en",
                                                      device=DEVICE,
                                                      model_name="WAV2VEC2_ASR_BASE_960H",
                                                      model_dir=f"{model_dir}/wav2vec2")

    logging.info("Loading diarization model")
    diarization_model = whisperx.DiarizationPipeline(model_name=f"{model_dir}/pyannote/speaker-diarization/config.yaml",
                                                     device=DEVICE)

    return {
        'model': model,
        'align_model': align_model,
        'metadata': metadata,
        'diarization_model': diarization_model
    }

def transform_fn(model: dict, request_body: bytes, request_content_type: str, response_content_type="application/json") -> (str, str):
    """
    Load in audio from the request, transcribe and diarize, and return JSON output
    """

    # Start a timer so that we can log how long inference takes
    start_time = time.time()

    # Unpack the models
    whisperx_model = model['model']
    align_model = model['align_model']
    metadata = model['metadata']
    diarization_model = model['diarization_model']

    # Load the media file (the request_body as bytes) into a temporary file, then use WhisperX to load the audio from it
    logging.info("Loading audio")
    with io.BytesIO(request_body) as file:
        tfile = tempfile.NamedTemporaryFile(delete=False)
        tfile.write(file.read())
        audio = whisperx.load_audio(tfile.name)

    # Run transcription
    logging.info("Transcribing audio")
    result = whisperx_model.transcribe(audio, batch_size=16)

    # Align the outputs for better timings
    logging.info("Aligning outputs")
    result = whisperx.align(result["segments"], align_model, metadata, audio, DEVICE, return_char_alignments=False)

    # Run diarization
    logging.info("Running diarization")
    diarize_segments = diarization_model(audio)
    result = whisperx.assign_word_speakers(diarize_segments, result)

    # Calculate the time it took to perform the transcription and diarization
    end_time = time.time()
    elapsed_time = end_time - start_time
    logging.info(f"Transcription and Diarization took {int(elapsed_time)} seconds")

    # Return the results to be stored in S3
    return json.dumps(result), response_content_type

Within the model’s directory, alongside the requirements.txt file, ensure the presence of inference.py in a code subdirectory. The models directory should resemble the following:

models
├── code
│   ├── inference.py
│   └── requirements.txt
├── guillaumekln
│   └── faster-whisper-large-v2
├── pyannote
│   ├── segmentation
│   │   └── ...
│   └── speaker-diarization
│       └── ...
├── speechbrain
│   └── spkrec-ecapa-voxceleb
│       └── ...
├── wav2vec2
│   └── ...
└── whisperx
    └── vad
        └── ...

Create a tarball of the models

After you create the models and code directories, you can use the following command lines to compress the model into a tarball (.tar.gz file) and upload it to Amazon S3. At the time of writing, using the faster-whisper Large V2 model, the resulting tarball representing the SageMaker model is 3 GB in size. For more information, refer to Model hosting patterns in Amazon SageMaker, Part 2: Getting started with deploying real time models on SageMaker.

# Save the model artifacts to the 'model' directory and create a tarball
tar cvzf model.tar.gz -C model/ .
# Upload the model to S3
aws s3 cp model.tar.gz s3://<target_bucket> 

Create a SageMaker model and deploy an endpoint with an asynchronous predictor

Now you can create the SageMaker model, endpoint config, and asynchronous endpoint with AsyncPredictor using the model tarball created in the previous step. For instructions, refer to Create an Asynchronous Inference Endpoint.

Evaluate diarization performance

To assess the diarization performance of the WhisperX model in various scenarios, we selected three episodes each from two English titles: one drama title consisting of 30-minute episodes, and one documentary title consisting of 45-minute episodes. We utilized pyannote’s metrics toolkit, pyannote.metrics, to calculate the diarization error rate (DER). In the evaluation, manually transcribed and diarized transcripts provided by ZOO served as the ground truth.

We defined the DER as follows:

Total is the length of the ground truth video. FA (False Alarm) is the length of segments that are considered as speech in predictions, but not in ground truth. Miss is the length of segments that are considered as speech in ground truth, but not in prediction. Error, also called Confusion, is the length of segments that are assigned to different speakers in prediction and ground truth. All the units are measured in seconds. The typical values for DER can vary depending on the specific application, dataset, and the quality of the diarization system. Note that DER can be larger than 1.0. A lower DER is better.

To be able to calculate the DER for a piece of media, a ground truth diarization is required as well as the WhisperX transcribed and diarized outputs. These must be parsed and result in lists of tuples containing a speaker label, speech segment start time, and speech segment end time for each segment of speech in the media. The speaker labels don’t need to match between the WhisperX and ground truth diarizations. The results are based mostly on the time of the segments. pyannote.metrics takes these tuples of ground truth diarizations and output diarizations (referred to in the pyannote.metrics documentation as reference and hypothesis) to calculate the DER. The following table summarizes our results.

Video Type  DER  Correct Miss  Error  False Alarm 
Drama 0.738 44.80% 21.80% 33.30% 18.70%
Documentary  1.29 94.50% 5.30% 0.20% 123.40%
Average 0.901 71.40% 13.50% 15.10% 61.50%

These results reveal a significant performance difference between the drama and documentary titles, with the model achieving notably better results (using DER as an aggregate metric) for the drama episodes compared to the documentary title. A closer analysis of the titles provides insights into potential factors contributing to this performance gap. One key factor could be the frequent presence of background music overlapping with speech in the documentary title. Although preprocessing media to enhance diarization accuracy, such as removing background noise to isolate speech, was beyond the scope of this prototype, it opens avenues for future work that could potentially enhance the performance of WhisperX.

Conclusion

In this post, we explored the collaborative partnership between AWS and ZOO Digital, employing machine learning techniques with SageMaker and the WhisperX model to enhance the diarization workflow. The AWS team played a pivotal role in assisting ZOO in prototyping, evaluating, and understanding the effective deployment of custom ML models, specifically designed for diarization. This included incorporating auto scaling for scalability using SageMaker.

Harnessing AI for diarization will lead to substantial savings in both cost and time when generating localized content for ZOO. By aiding transcribers in swiftly and precisely creating and identifying speakers, this technology addresses the traditionally time-consuming and error-prone nature of the task. The conventional process often involves multiple passes through the video and additional quality control steps to minimize errors. The adoption of AI for diarization enables a more targeted and efficient approach, thereby increasing productivity within a shorter timeframe.

We’ve outlined key steps to deploy the WhisperX model on the SageMaker asynchronous endpoint, and encourage you to try it yourself using the provided code. For further insights into ZOO Digital’s services and technology, visit ZOO Digital’s official site. For details on deploying the OpenAI Whisper model on SageMaker and various inference options, refer to Host the Whisper Model on Amazon SageMaker: exploring inference options. Feel free to share your thoughts in the comments.


About the Authors

Ying Hou, PhD, is a Machine Learning Prototyping Architect at AWS. Her primary areas of interest encompass Deep Learning, with a focus on GenAI, Computer Vision, NLP, and time series data prediction. In her spare time, she relishes spending quality moments with her family, immersing herself in novels, and hiking in the national parks of the UK.

Ethan Cumberland is an AI Research Engineer at ZOO Digital, where he works on using AI and Machine Learning as assistive technologies to improve workflows in speech, language, and localisation. He has a background in software engineering and research in the security and policing domain, focusing on extracting structured information from the web and leveraging open-source ML models for analysing and enriching collected data.

Gaurav Kaila leads the AWS Prototyping team for UK & Ireland. His team works with customers across diverse industries to ideate & co-develop business critical workloads with a mandate to accelerate adoption of AWS services.

Read More

Keyframer: Empowering Animation Design using Large Language Models

Large language models (LLMs) have the potential to impact a wide range of creative domains, as exemplified in popular text-to-image generators like DALL·E and Midjourney. However, the application of LLMs to motion-based visual design has not yet been explored and presents novels challenges such as how users might effectively describe motion in natural language. Further, many existing generative design tools lack support for iterative refinement of designs beyond prompt engineering. In this paper, we present Keyframer, a design tool that leverages the code generation capabilities of LLMs to…Apple Machine Learning Research

Run ML inference on unplanned and spiky traffic using Amazon SageMaker multi-model endpoints

Run ML inference on unplanned and spiky traffic using Amazon SageMaker multi-model endpoints

Amazon SageMaker multi-model endpoints (MMEs) are a fully managed capability of SageMaker inference that allows you to deploy thousands of models on a single endpoint. Previously, MMEs pre-determinedly allocated CPU computing power to models statically regardless the model traffic load, using Multi Model Server (MMS) as its model server. In this post, we discuss a solution in which an MME can dynamically adjust the compute power assigned to each model based on the model’s traffic pattern. This solution enables you to use the underlying compute of MMEs more efficiently and save costs.

MMEs dynamically load and unload models based on incoming traffic to the endpoint. When utilizing MMS as the model server, MMEs allocate a fixed number of model workers for each model. For more information, refer to Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints.

However, this can lead to a few issues when your traffic pattern is variable. Let’s say you have a singular or few models receiving a large amount of traffic. You can configure MMS to allocate a high number of workers for these models, but this gets assigned to all the models behind the MME because it’s a static configuration. This leads to a large number of workers using hardware compute—even the idle models. The opposite problem can happen if you set a small value for the number of workers. The popular models won’t have enough workers at the model server level to properly allocate enough hardware behind the endpoint for these models. The main issue is that it’s difficult to remain traffic pattern agnostic if you can’t dynamically scale your workers at the model server level to allocate the necessary amount of compute.

The solution we discuss in this post uses DJLServing as the model server, which can help mitigate some of the issues that we discussed and enable per-model scaling and enable MMEs to be traffic pattern agnostic.

MME architecture

SageMaker MMEs enable you to deploy multiple models behind a single inference endpoint that may contain one or more instances. Each instance is designed to load and serve multiple models up to its memory and CPU/GPU capacity. With this architecture, a software as a service (SaaS) business can break the linearly increasing cost of hosting multiple models and achieve reuse of infrastructure consistent with the multi-tenancy model applied elsewhere in the application stack. The following diagram illustrates this architecture.

A SageMaker MME dynamically loads models from Amazon Simple Storage Service (Amazon S3) when invoked, instead of downloading all the models when the endpoint is first created. As a result, an initial invocation to a model might see higher inference latency than the subsequent inferences, which are completed with low latency. If the model is already loaded on the container when invoked, then the download step is skipped and the model returns the inferences with low latency. For example, assume you have a model that is only used a few times a day. It’s automatically loaded on demand, whereas frequently accessed models are retained in memory and invoked with consistently low latency.

Behind each MME are model hosting instances, as depicted in the following diagram. These instances load and evict multiple models to and from memory based on the traffic patterns to the models.

SageMaker continues to route inference requests for a model to the instance where the model is already loaded such that the requests are served from a cached model copy (see the following diagram, which shows the request path for the first prediction request vs. the cached prediction request path). However, if the model receives many invocation requests, and there are additional instances for the MME, SageMaker routes some requests to another instance to accommodate the increase. To take advantage of automated model scaling in SageMaker, make sure you have instance auto scaling set up to provision additional instance capacity. Set up your endpoint-level scaling policy with either custom parameters or invocations per minute (recommended) to add more instances to the endpoint fleet.

Model server overview

A model server is a software component that provides a runtime environment for deploying and serving machine learning (ML) models. It acts as an interface between the trained models and client applications that want to make predictions using those models.

The primary purpose of a model server is to allow effortless integration and efficient deployment of ML models into production systems. Instead of embedding the model directly into an application or a specific framework, the model server provides a centralized platform where multiple models can be deployed, managed, and served.

Model servers typically offer the following functionalities:

  • Model loading – The server loads the trained ML models into memory, making them ready for serving predictions.
  • Inference API – The server exposes an API that allows client applications to send input data and receive predictions from the deployed models.
  • Scaling – Model servers are designed to handle concurrent requests from multiple clients. They provide mechanisms for parallel processing and managing resources efficiently to ensure high throughput and low latency.
  • Integration with backend engines – Model servers have integrations with backend frameworks like DeepSpeed and FasterTransformer to partition large models and run highly optimized inference.

DJL architecture

DJL Serving is an open source, high performance, universal model server. DJL Serving is built on top of DJL, a deep learning library written in the Java programming language. It can take a deep learning model, several models, or workflows and make them available through an HTTP endpoint. DJL Serving supports deploying models from multiple frameworks like PyTorch, TensorFlow, Apache MXNet, ONNX, TensorRT, Hugging Face Transformers, DeepSpeed, FasterTransformer, and more.

DJL Serving offers many features that allow you to deploy your models with high performance:

  • Ease of use – DJL Serving can serve most models out of the box. Just bring the model artifacts, and DJL Serving can host them.
  • Multiple device and accelerator support – DJL Serving supports deploying models on CPU, GPU, and AWS Inferentia.
  • Performance – DJL Serving runs multithreaded inference in a single JVM to boost throughput.
  • Dynamic batching – DJL Serving supports dynamic batching to increase throughput.
  • Auto scaling – DJL Serving will automatically scale workers up and down based on the traffic load.
  • Multi-engine support – DJL Serving can simultaneously host models using different frameworks (such as PyTorch and TensorFlow).
  • Ensemble and workflow models – DJL Serving supports deploying complex workflows comprised of multiple models, and runs parts of the workflow on CPU and parts on GPU. Models within a workflow can use different frameworks.

In particular, the auto scaling feature of DJL Serving makes it straightforward to ensure the models are scaled appropriately for the incoming traffic. By default, DJL Serving determines the maximum number of workers for a model that can be supported based on the hardware available (CPU cores, GPU devices). You can set lower and upper bounds for each model to make sure that a minimum traffic level can always be served, and that a single model doesn’t consume all available resources.

DJL Serving uses a Netty frontend on top of backend worker thread pools. The frontend uses a single Netty setup with multiple HttpRequestHandlers. Different request handlers will provide support for the Inference API, Management API, or other APIs available from various plugins.

The backend is based around the WorkLoadManager (WLM) module. The WLM takes care of multiple worker threads for each model along with the batching and request routing to them. When multiple models are served, WLM checks the inference request queue size of each model first. If the queue size is greater than two times a model’s batch size, WLM scales up the number of workers assigned to that model.

Solution overview

The implementation of DJL with an MME differs from the default MMS setup. For DJL Serving with an MME, we compress the following files in the model.tar.gz format that SageMaker Inference is expecting:

  • model.joblib – For this implementation, we directly push the model metadata into the tarball. In this case, we are working with a .joblib file, so we provide that file in our tarball for our inference script to read. If the artifact is too large, you can also push it to Amazon S3 and point towards that in the serving configuration you define for DJL.
  • serving.properties – Here you can configure any model server-related environment variables. The power of DJL here is that you can configure minWorkers and maxWorkers for each model tarball. This allows for each model to scale up and down at the model server level. For instance, if a singular model is receiving the majority of the traffic for an MME, the model server will scale the workers up dynamically. In this example, we don’t configure these variables and let DJL determine the necessary number of workers depending on our traffic pattern.
  • model.py – This is the inference script for any custom preprocessing or postprocessing you would like to implement. The model.py expects your logic to be encapsulated in a handle method by default.
  • requirements.txt (optional) – By default, DJL comes installed with PyTorch, but any additional dependencies you need can be pushed here.

For this example, we showcase the power of DJL with an MME by taking a sample SKLearn model. We run a training job with this model and then create 1,000 copies of this model artifact to back our MME. We then showcase how DJL can dynamically scale to handle any type of traffic pattern that your MME may receive. This can include an even distribution of traffic across all models or even a few popular models receiving the majority of the traffic. You can find all the code in the following GitHub repo.

Prerequisites

For this example, we use a SageMaker notebook instance with a conda_python3 kernel and ml.c5.xlarge instance. To perform the load tests, you can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or a larger SageMaker notebook instance. In this example, we scale to over a thousand transactions per second (TPS), so we suggest testing on a heavier EC2 instance such as an ml.c5.18xlarge so that you have more compute to work with.

Create a model artifact

We first need to create our model artifact and data that we use in this example. For this case, we generate some artificial data with NumPy and train using an SKLearn linear regression model with the following code snippet:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import joblib

# Generate dummy data
np.random.seed(0)
X = np.random.rand(100, 1)
y = 2 * X + 1 + 0.1 * np.random.randn(100, 1)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a Linear Regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)

# Create serialized model artifact
model_filename = "model.joblib"
joblib.dump(model, model_filename)

After you run the preceding code, you should have a model.joblib file created in your local environment.

Pull the DJL Docker image

The Docker image djl-inference:0.23.0-cpu-full-v1.0 is our DJL serving container used in this example. You can adjust the following URL depending on your Region:

inference_image_uri = "474422712127.dkr.ecr.us-east-1.amazonaws.com/djl-serving-cpu:latest"

Optionally, you can also use this image as a base image and extend it to build your own Docker image on Amazon Elastic Container Registry (Amazon ECR) with any other dependencies you need.

Create the model file

First, we create a file called serving.properties. This instructs DJLServing to use the Python engine. We also define the max_idle_time of a worker to be 600 seconds. This makes sure that we take longer to scale down the number of workers we have per model. We don’t adjust minWorkers and maxWorkers that we can define and we let DJL dynamically compute the number of workers needed depending on the traffic each model is receiving. The serving.properties is shown as follows. To see the complete list of configuration options, refer to Engine Configuration.

engine=Python
max_idle_time=600

Next, we create our model.py file, which defines the model loading and inference logic. For MMEs, each model.py file is specific to a model. Models are stored in their own paths under the model store (usually /opt/ml/model/). When loading models, they will be loaded under the model store path in their own directory. The full model.py example in this demo can be seen in the GitHub repo.

We create a model.tar.gz file that includes our model (model.joblib), model.py, and serving.properties:

#Build tar file with model data + inference code, replace this cell with your model.joblib
bashCommand = "tar -cvpzf model.tar.gz model.joblib requirements.txt model.py serving.properties"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

For demonstration purposes, we make 1,000 copies of the same model.tar.gz file to represent the large number of models to be hosted. In production, you need to create a model.tar.gz file for each of your models.

Lastly, we upload these models to Amazon S3.

Create a SageMaker model

We now create a SageMaker model. We use the ECR image defined earlier and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure Mode as MultiModel. This tells DJLServing that we’re creating an MME.

mme_model_name = "sklearn-djl-mme" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + mme_model_name)

create_model_response = sm_client.create_model(
ModelName=mme_model_name,
ExecutionRoleArn=role,
PrimaryContainer={"Image": inference_image_uri, "Mode": "MultiModel", "ModelDataUrl": mme_artifacts},
)

Create a SageMaker endpoint

In this demo, we use 20 ml.c5d.18xlarge instances to scale to a TPS in the thousands range. Make sure to get a limit increase on your instance type, if necessary, to achieve the TPS you are targeting.

mme_epc_name = "sklearn-djl-mme-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=mme_epc_name,
ProductionVariants=[
{
"VariantName": "sklearnvariant",
"ModelName": mme_model_name,
"InstanceType": "ml.c5d.18xlarge",
"InitialInstanceCount": 20
},],)

Load testing

At the time of writing, the SageMaker in-house load testing tool Amazon SageMaker Inference Recommender doesn’t natively support testing for MMEs. Therefore, we use the open source Python tool Locust. Locust is straightforward to set up and can track metrics such as TPS and end-to-end latency. For a full understanding of how to set it up with SageMaker, see Best practices for load testing Amazon SageMaker real-time inference endpoints.

In this use case, we have three different traffic patterns we want to simulate with MMEs, so we have the following three Python scripts that align with each pattern. Our goal here is to prove that, regardless of what our traffic pattern is, we can achieve the same target TPS and scale appropriately.

We can specify a weight in our Locust script to assign traffic across different portions of our models. For instance, with our single hot model, we implement two methods as follows:

# popular model
def sendPopular(self):

        request_meta = {
            "request_type": "InvokeEndpoint",
            "name": "SageMaker",
            "start_time": time.time(),
            "response_length": 0,
            "response": None,
            "context": {},
            "exception": None,
        }
        start_perf_counter = time.perf_counter()
        try:
            response = self.sagemaker_client.invoke_endpoint(
                EndpointName=self.endpoint_name,
                Body=self.payload,
                ContentType=self.content_type,
                TargetModel = "sklearn-0.tar.gz"
            )
  
# rest of model          
def sendRest(self):

        request_meta = {
            "request_type": "InvokeEndpoint",
            "name": "SageMaker",
            "start_time": time.time(),
            "response_length": 0,
            "response": None,
            "context": {},
            "exception": None,
        }
        start_perf_counter = time.perf_counter()
   
        try:
            response = self.sagemaker_client.invoke_endpoint(
                EndpointName=self.endpoint_name,
                Body=self.payload,
                ContentType=self.content_type,
                TargetModel = f'sklearn-{random.randint(1,989)}.tar.gz'
            )
            response_body = response["Body"].read()

We can then assign a certain weight to each method, which is when a certain method receives a specific percentage of the traffic:

# assign weights to models
class MyUser(BotoUser):

# 90% of traffic to singular model
@task(9)
def send_request(self):
self.client.sendPopular()

@task
def send_request_major(self):
self.client.sendRest()

For 20 ml.c5d.18xlarge instances, we see the following invocation metrics on the Amazon CloudWatch console. These values remain fairly consistent across all three traffic patterns. To understand CloudWatch metrics for SageMaker real-time inference and MMEs better, refer to SageMaker Endpoint Invocation Metrics.

You can find the rest of the Locust scripts in the locust-utils directory in the GitHub repository.

Summary

In this post, we discussed how an MME can dynamically adjust the compute power assigned to each model based on the model’s traffic pattern. This newly launched feature is available in all AWS Regions where SageMaker is available. Note that at the time of announcement, only CPU instances are supported. To learn more, refer to Supported algorithms, frameworks, and instances.


About the Authors

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Xu Deng is a Software Engineer Manager with the SageMaker team. He focuses on helping customers build and optimize their AI/ML inference experience on Amazon SageMaker. In his spare time, he loves traveling and snowboarding.

Siddharth Venkatesan is a Software Engineer in AWS Deep Learning. He currently focusses on building solutions for large model inference. Prior to AWS he worked in the Amazon Grocery org building new payment features for customers world-wide. Outside of work, he enjoys skiing, the outdoors, and watching sports.

Rohith Nallamaddi is a Software Development Engineer at AWS. He works on optimizing deep learning workloads on GPUs, building high performance ML inference and serving solutions. Prior to this, he worked on building microservices based on AWS for Amazon F3 business. Outside of work he enjoys playing and watching sports.

Read More

Use Amazon Titan models for image generation, editing, and searching

Use Amazon Titan models for image generation, editing, and searching

Amazon Bedrock provides a broad range of high-performing foundation models from Amazon and other leading AI companies, including Anthropic, AI21, Meta, Cohere, and Stability AI, and covers a wide range of use cases, including text and image generation, searching, chat, reasoning and acting agents, and more. The new Amazon Titan Image Generator model allows content creators to quickly generate high-quality, realistic images using simple English text prompts. The advanced AI model understands complex instructions with multiple objects and returns studio-quality images suitable for advertising, ecommerce, and entertainment. Key features include the ability to refine images by iterating on prompts, automatic background editing, and generating multiple variations of the same scene. Creators can also customize the model with their own data to output on-brand images in a specific style. Importantly, Titan Image Generator has in-built safeguards, like invisible watermarks on all AI-generated images, to encourage responsible use and mitigate the spread of disinformation. This innovative technology makes producing custom images in large volume for any industry more accessible and efficient.

The new Amazon Titan Multimodal Embeddings model  helps build more accurate search and recommendations by understanding text, images, or both. It converts images and English text into semantic vectors, capturing meaning and relationships in your data. You can combine text and images like product descriptions and photos to identify items more effectively. The vectors power speedy, accurate search experiences. Titan Multimodal Embeddings is flexible in vector dimensions, enabling optimization for performance needs. An asynchronous API and Amazon OpenSearch Service connector make it easy to integrate the model into your neural search applications.

In this post, we walk through how to use the Titan Image Generator and Titan Multimodal Embeddings models via the AWS Python SDK.

Image generation and editing

In this section, we demonstrate the basic coding patterns for using the AWS SDK to generate new images and perform AI-powered edits on existing images. Code examples are provided in Python, and JavaScript (Node.js) is also available in this GitHub repository.

Before you can write scripts that use the Amazon Bedrock API, you need to install the appropriate version of the AWS SDK in your environment. For Python scripts, you can use the AWS SDK for Python (Boto3). Python users may also want to install the Pillow module, which facilitates image operations like loading and saving images. For setup instructions, refer to the GitHub repository.

Additionally, enable access to the Amazon Titan Image Generator and Titan Multimodal Embeddings models. For more information, refer to Model access.

Helper functions

The following function sets up the Amazon Bedrock Boto3 runtime client and generates images by taking payloads of different configurations (which we discuss later in this post):

import boto3
import json, base64, io
from random import randint
from PIL import Image

bedrock_runtime_client = boto3.client("bedrock-runtime")


def titan_image(
    payload: dict,
    num_image: int = 2,
    cfg: float = 10.0,
    seed: int = None,
    modelId: str = "amazon.titan-image-generator-v1",
) -> list:
    #   ImageGenerationConfig Options:
    #   - numberOfImages: Number of images to be generated
    #   - quality: Quality of generated images, can be standard or premium
    #   - height: Height of output image(s)
    #   - width: Width of output image(s)
    #   - cfgScale: Scale for classifier-free guidance
    #   - seed: The seed to use for reproducibility
    seed = seed if seed is not None else randint(0, 214783647)
    body = json.dumps(
        {
            **payload,
            "imageGenerationConfig": {
                "numberOfImages": num_image,  # Range: 1 to 5
                "quality": "premium",  # Options: standard/premium
                "height": 1024,  # Supported height list above
                "width": 1024,  # Supported width list above
                "cfgScale": cfg,  # Range: 1.0 (exclusive) to 10.0
                "seed": seed,  # Range: 0 to 214783647
            },
        }
    )

    response = bedrock_runtime_client.invoke_model(
        body=body,
        modelId=modelId,
        accept="application/json",
        contentType="application/json",
    )

    response_body = json.loads(response.get("body").read())
    images = [
        Image.open(io.BytesIO(base64.b64decode(base64_image)))
        for base64_image in response_body.get("images")
    ]
    return images
        

Generate images from text

Scripts that generate a new image from a text prompt follow this implementation pattern:

  1. Configure a text prompt and optional negative text prompt.
  2. Use the BedrockRuntime client to invoke the Titan Image Generator model.
  3. Parse and decode the response.
  4. Save the resulting images to disk.

Text-to-image

The following is a typical image generation script for the Titan Image Generator model:

# Text Variation
# textToImageParams Options:
#   text: prompt to guide the model on how to generate variations
#   negativeText: prompts to guide the model on what you don't want in image
images = titan_image(
    {
        "taskType": "TEXT_IMAGE",
        "textToImageParams": {
            "text": "two dogs walking down an urban street, facing the camera",  # Required
            "negativeText": "cars",  # Optional
        },
    }
)

This will produce images similar to the following.

Response Image 1 Response Image 2
2 dogs walking on street 2 dogs walking on street

Image variants

Image variation provides a way to generate subtle variants of an existing image. The following code snippet uses one of the images generated in the previous example to create variant images:

# Import an input image like this (only PNG/JPEG supported):
with open("<YOUR_IMAGE_FILE_PATH>", "rb") as image_file:
    input_image = base64.b64encode(image_file.read()).decode("utf8")

# Image Variation
# ImageVariationParams Options:
#   text: prompt to guide the model on how to generate variations
#   negativeText: prompts to guide the model on what you don't want in image
#   images: base64 string representation of the input image, only 1 is supported
images = titan_image(
    {
        "taskType": "IMAGE_VARIATION",
        "imageVariationParams": {
            "text": "two dogs walking down an urban street, facing the camera",  # Required
            "images": [input_image],  # One image is required
            "negativeText": "cars",  # Optional
        },
    },
)

This will produce images similar to the following.

Original Image Response Image 1 Response Image 2
2 dogs walking on street

Edit an existing image

The Titan Image Generator model allows you to add, remove, or replace elements or areas within an existing image. You specify which area to affect by providing one of the following:

  • Mask image – A mask image is a binary image in which the 0-value pixels represent the area you want to affect and the 255-value pixels represent the area that should remain unchanged.
  • Mask prompt – A mask prompt is a natural language text description of the elements you want to affect, that uses an in-house text-to-segmentation model.

For more information, refer to Prompt Engineering Guidelines.

Scripts that apply an edit to an image follow this implementation pattern:

  1. Load the image to be edited from disk.
  2. Convert the image to a base64-encoded string.
  3. Configure the mask through one of the following methods:
    1. Load a mask image from disk, encoding it as base64 and setting it as the maskImage parameter.
    2. Set the maskText parameter to a text description of the elements to affect.
  4. Specify the new content to be generated using one of the following options:
    1. To add or replace an element, set the text parameter to a description of the new content.
    2. To remove an element, omit the text parameter completely.
  5. Use the BedrockRuntime client to invoke the Titan Image Generator model.
  6. Parse and decode the response.
  7. Save the resulting images to disk.

Object editing: Inpainting with a mask image

The following is a typical image editing script for the Titan Image Generator model using maskImage. We take one of the images generated earlier and provide a mask image, where 0-value pixels are rendered as black and 255-value pixels as white. We also replace one of the dogs in the image with a cat using a text prompt.

with open("<YOUR_MASK_IMAGE_FILE_PATH>", "rb") as image_file:
    mask_image = base64.b64encode(image_file.read()).decode("utf8")

# Import an input image like this (only PNG/JPEG supported):
with open("<YOUR_ORIGINAL_IMAGE_FILE_PATH>", "rb") as image_file:
    input_image = base64.b64encode(image_file.read()).decode("utf8")

# Inpainting
# inPaintingParams Options:
#   text: prompt to guide inpainting
#   negativeText: prompts to guide the model on what you don't want in image
#   image: base64 string representation of the input image
#   maskImage: base64 string representation of the input mask image
#   maskPrompt: prompt used for auto editing to generate mask

images = titan_image(
    {
        "taskType": "INPAINTING",
        "inPaintingParams": {
            "text": "a cat",  # Optional
            "negativeText": "bad quality, low res",  # Optional
            "image": input_image,  # Required
            "maskImage": mask_image,
        },
    },
    num_image=3,
)

This will produce images similar to the following.

Original Image Mask Image Edited Image
2 dogs walking on street cat&dog walking on the street

Object removal: Inpainting with a mask prompt

In another example, we use maskPrompt to specify an object in the image, taken from the earlier steps, to edit. By omitting the text prompt, the object will be removed:

# Import an input image like this (only PNG/JPEG supported):
with open("<YOUR_IMAGE_FILE_PATH>", "rb") as image_file:
    input_image = base64.b64encode(image_file.read()).decode("utf8")

images = titan_image(
    {
        "taskType": "INPAINTING",
        "inPaintingParams": {
            "negativeText": "bad quality, low res",  # Optional
            "image": input_image,  # Required
            "maskPrompt": "white dog",  # One of "maskImage" or "maskPrompt" is required
        },
    },
)

This will produce images similar to the following.

Original Image Response Image
2 dogs walking on street one dog walking on the street

Background editing: Outpainting

Outpainting is useful when you want to replace the background of an image. You can also extend the bounds of an image for a zoom-out effect. In the following example script, we use maskPrompt to specify which object to keep; you can also use maskImage. The parameter outPaintingMode specifies whether to allow modification of the pixels inside the mask. If set as DEFAULT, pixels inside of the mask are allowed to be modified so that the reconstructed image will be consistent overall. This option is recommended if the maskImage provided doesn’t represent the object with pixel-level precision. If set as PRECISE, the modification of pixels inside of the mask is prevented. This option is recommended if using a maskPrompt or a maskImage that represents the object with pixel-level precision.

# Import an input image like this (only PNG/JPEG supported):
with open("<YOUR_IMAGE_FILE_PATH>", "rb") as image_file:
    input_image = base64.b64encode(image_file.read()).decode("utf8")

# OutPaintingParams Options:
#   text: prompt to guide outpainting
#   negativeText: prompts to guide the model on what you don't want in image
#   image: base64 string representation of the input image
#   maskImage: base64 string representation of the input mask image
#   maskPrompt: prompt used for auto editing to generate mask
#   outPaintingMode: DEFAULT | PRECISE
images = titan_image(
    {
        "taskType": "OUTPAINTING",
        "outPaintingParams": {
            "text": "forest",  # Required
            "image": input_image,  # Required
            "maskPrompt": "dogs",  # One of "maskImage" or "maskPrompt" is required
            "outPaintingMode": "PRECISE",  # One of "PRECISE" or "DEFAULT"
        },
    },
    num_image=3,
)

This will produce images similar to the following.

Original Image Text Response Image
2 dogs walking on the street “beach” one dog walking on the beach
2 dogs walking on street “forest”

In addition, the effects of different values for outPaintingMode, with a maskImage that doesn’t outline the object with pixel-level precision, are as follows.

Original Image Mask Image Text outPaintingMode Response Image
2 dogs walking on street making image on the left dog “forest” DEFAULT
2 dogs walking on street making image on the left dog “forest” PRECISE

This section has given you an overview of the operations you can perform with the Titan Image Generator model. Specifically, these scripts demonstrate text-to-image, image variation, inpainting, and outpainting tasks. You should be able to adapt the patterns for your own applications by referencing the parameter details for those task types detailed in Amazon Titan Image Generator documentation.

Multimodal embedding and searching

You can use the Amazon Titan Multimodal Embeddings model for enterprise tasks such as image search and similarity-based recommendation, and it has built-in mitigation that helps reduce bias in searching results. There are multiple embedding dimension sizes for best latency/accuracy trade-offs for different needs, and all can be customized with a simple API to adapt to your own data while persisting data security and privacy. Amazon Titan Multimodal Embeddings is provided as simple APIs for real-time or asynchronous batch transform searching and recommendation applications, and can be connected to different vector databases, including Amazon OpenSearch Service.

Helper functions

The following function converts an image, and optionally text, into multimodal embeddings:

def titan_multimodal_embedding(
    image_path: str = None,  # maximum 2048 x 2048 pixels
    description: str = None,  # English only and max input tokens 128
    dimension: int = 1024,  # 1,024 (default), 384, 256
    model_id: str = "amazon.titan-embed-image-v1",
):
    payload_body = {}
    embedding_config: dict = {"embeddingConfig": {"outputEmbeddingLength": dimension}}

    # You can specify either text or image or both
    if image_path:
        # Maximum image size supported is 2048 x 2048 pixels
        with open(image_path, "rb") as image_file:
            payload_body["inputImage"] = base64.b64encode(image_file.read()).decode(
                "utf8"
            )
    if description:
        payload_body["inputText"] = description

    assert payload_body, "please provide either an image and/or a text description"
    print("n".join(payload_body.keys()))

    response = bedrock_runtime_client.invoke_model(
        body=json.dumps({**payload_body, **embedding_config}),
        modelId=model_id,
        accept="application/json",
        contentType="application/json",
    )

    return json.loads(response.get("body").read())

The following function returns the top similar multimodal embeddings given a query multimodal embeddings. Note that in practice, you can use a managed vector database, such as OpenSearch Service. The following example is for illustration purposes:

from scipy.spatial.distance import cdist
import numpy as np

def search(query_emb: np.array, indexes: np.array, top_k: int = 1):
    dist = cdist(query_emb, indexes, metric="cosine")
    return dist.argsort(axis=-1)[0, :top_k], np.sort(dist, axis=-1)[:top_k]

Synthetic dataset

For illustration purposes, we use Anthropic’s Claude 2.1 model in Amazon Bedrock to randomly generate seven different products, each with three variants, using the following prompt:

Generate a list of 7 items description for an online e-commerce shop, each comes with 3 variants of color or type. All with separate full sentence description.

The following is the list of returned outputs:

1. T-shirt
- A red cotton t-shirt with a crew neck and short sleeves.
- A blue cotton t-shirt with a v-neck and short sleeves.
- A black polyester t-shirt with a scoop neck and cap sleeves.

2. Jeans
- Classic blue relaxed fit denim jeans with a mid-rise waist.
- Black skinny fit denim jeans with a high-rise waist and ripped details at the knees.
- Stonewash straight leg denim jeans with a standard waist and front pockets.

3. Sneakers
- White leather low-top sneakers with an almond toe cap and thick rubber outsole.
- Gray mesh high-top sneakers with neon green laces and a padded ankle collar.
- Tan suede mid-top sneakers with a round toe and ivory rubber cupsole.

4. Backpack
- A purple nylon backpack with padded shoulder straps, front zipper pocket and laptop sleeve.
- A gray canvas backpack with brown leather trims, side water bottle pockets and drawstring top closure.
- A black leather backpack with multiple interior pockets, top carry handle and adjustable padded straps.

5. Smartwatch
- A silver stainless steel smartwatch with heart rate monitor, GPS tracker and sleep analysis.
- A space gray aluminum smartwatch with step counter, phone notifications and calendar syncing.
- A rose gold smartwatch with activity tracking, music controls and customizable watch faces.

6. Coffee maker
- A 12-cup programmable coffee maker in brushed steel with removable water tank and keep warm plate.
- A compact 5-cup single serve coffee maker in matt black with travel mug auto-dispensing feature.
- A retro style stovetop percolator coffee pot in speckled enamel with stay-cool handle and glass knob lid.

7. Yoga mat
- A teal 4mm thick yoga mat made of natural tree rubber with moisture-wicking microfiber top.
- A purple 6mm thick yoga mat made of eco-friendly TPE material with integrated carrying strap.
- A patterned 5mm thick yoga mat made of PVC-free material with towel cover included.

Assign the above response to variable response_cat. Then we use the Titan Image Generator model to create product images for each item:

import re

def extract_text(input_string):
    pattern = r"- (.*?)($|n)"
    matches = re.findall(pattern, input_string)
    extracted_texts = [match[0] for match in matches]
    return extracted_texts

product_description = extract_text(response_cat)

titles = []
for prompt in product_description:
    images = titan_image(
        {
            "taskType": "TEXT_IMAGE",
            "textToImageParams": {
                "text": prompt,  # Required
            },
        },
        num_image=1,
    )
    title = "_".join(prompt.split()[:4]).lower()
    titles.append(title)
    images[0].save(f"{title}.png", format="png")

All the generated images can be found in the appendix at the end of this post.

Multimodal dataset indexing

Use the following code for multimodal dataset indexing:

multimodal_embeddings = []
for image_filename, description in zip(titles, product_description):
    embedding = titan_multimodal_embedding(f"{image_filename}.png", dimension=1024)["embedding"]
    multimodal_embeddings.append(embedding)

Multimodal searching

Use the following code for multimodal searching:

query_prompt = "<YOUR_QUERY_TEXT>"
query_embedding = titan_multimodal_embedding(description=query_prompt, dimension=1024)["embedding"]
# If searching via Image
# query_image_filename = "<YOUR_QUERY_IMAGE>"
# query_emb = titan_multimodal_embedding(image_path=query_image_filename, dimension=1024)["embedding"]
idx_returned, dist = search(np.array(query_embedding)[None], np.array(multimodal_embeddings))

The following are some search results.

Query Results
“sneaker” leather sneaker
“white sneaker”
“leather backpack”
“purple backpack”

Conclusion

The post introduces the Amazon Titan Image Generator and Amazon Titan Multimodal Embeddings models. Titan Image Generator enables you to create custom, high-quality images from text prompts. Key features include iterating on prompts, automatic background editing, and data customization. It has safeguards like invisible watermarks to encourage responsible use. Titan Multimodal Embeddings converts text, images, or both into semantic vectors to power accurate search and recommendations. We then provided Python code samples for using these services, and demonstrated generating images from text prompts and iterating on those images; editing existing images by adding, removing, or replacing elements specified by mask images or mask text; creating multimodal embeddings from text, images, or both; and searching for similar multimodal embeddings to a query. We also demonstrated using a synthetic e-commerce dataset indexed and searched using Titan Multimodal Embeddings. The aim of this post is to enable developers to start using these new AI services in their applications. The code patterns can serve as templates for custom implementations.

All the code is available on the GitHub repository. For more information, refer to the Amazon Bedrock User Guide.


About the Authors

Rohit Mittal is a Principal Product Manager at Amazon AI building multi-modal foundation models. He recently led the launch of Amazon Titan Image Generator model as part of Amazon Bedrock service. Experienced in AI/ML, NLP, and Search, he is interested in building products that solves customer pain points with innovative technology.

Dr. Ashwin Swaminathan is a Computer Vision and Machine Learning researcher, engineer, and manager with 12+ years of industry experience and 5+ years of academic research experience. Strong fundamentals and proven ability to quickly gain knowledge and contribute to newer and emerging areas.

Dr. Yusheng Xie is a Principal Applied Scientist at Amazon AGI. His work focuses building multi-modal foundation models. Before joining AGI, he was leading various multi-modal AI development at AWS such as Amazon Titan Image Generator and Amazon Textract Queries.

Dr. Hao Yang is a Principal Applied Scientist at Amazon. His main research interests are object detection and learning with limited annotations. Outside work, Hao enjoys watching films, photography, and outdoor activities.

Dr. Davide Modolo is an Applied Science Manager at Amazon AGI, working on building large multimodal foundational models. Before joining Amazon AGI, he was a manager/lead for 7 years in AWS AI Labs (Amazon Bedrock and Amazon Rekognition). Outside of work, he enjoys traveling and playing any kind of sport, especially soccer.

Dr. Baichuan Sun, is currently serving as a Sr. AI/ML Solutions Architect at AWS, focusing on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends.

Dr. Kai Zhu currently works as Cloud Support Engineer at AWS, helping customers with issues in AI/ML related services like SageMaker, Bedrock, etc. He is a SageMaker Subject Matter Expert. Experienced in data science and data engineering, he is interested in building generative AI powered projects.

Kris Schultz has spent over 25 years bringing engaging user experiences to life by combining emerging technologies with world class design. In his role as Senior Product Manager, Kris helps design and build AWS services to power Media & Entertainment, Gaming, and Spatial Computing.


Appendix

In the following sections, we demonstrate challenging sample use cases like text insertion, hands, and reflections to highlight the capabilities of the Titan Image Generator model. We also include the sample output images produced in earlier examples.

Text

The Titan Image Generator model excels at complex workflows like inserting readable text into images. This example demonstrates Titan’s ability to clearly render uppercase and lowercase letters in a consistent style within an image.

a corgi wearing a baseball cap with text “genai” a happy boy giving a thumbs up, wearing a tshirt with text “generative AI”

Hands

The Titan Image Generator model also has the ability to generate detailed AI images. The image shows realistic hands and fingers with visible detail, going beyond more basic AI image generation that may lack such specificity. In the following examples, notice the precise depiction of the pose and anatomy.

a person’s hand viewed from above a close look at a person’s hands holding a coffee mug

Mirror

The images generated by the Titan Image Generator model spatially arrange objects and accurately reflect mirror effects, as demonstrated in the following examples.

A cute fluffy white cat stands on its hind legs, peering curiously into an ornate golden mirror. In the reflection the cat sees itself beautiful sky lake with reflections on the water

Synthetic product images

The following are the product images generated earlier in this post for the Titan Multimodal Embeddings model.

red tshirt black tshirt blue tshirt
blue jeans jeans
white sneaker laced running shoes leather sneaker
purple backpack gray canvas backpack leather backpack
smart watch smart watch pink smart watch
coffee machine single coffee machine coffee kettle
blue yogo mat purple yoga mat pile of yoga mat

Read More

Build a contextual chatbot application using Knowledge Bases for Amazon Bedrock

Build a contextual chatbot application using Knowledge Bases for Amazon Bedrock

Modern chatbots can serve as digital agents, providing a new avenue for delivering 24/7 customer service and support across many industries. Their popularity stems from the ability to respond to customer inquiries in real time and handle multiple queries simultaneously in different languages. Chatbots also offer valuable data-driven insights into customer behavior while scaling effortlessly as the user base grows; therefore, they present a cost-effective solution for engaging customers. Chatbots use the advanced natural language capabilities of large language models (LLMs) to respond to customer questions. They can understand conversational language and respond naturally. However, chatbots that merely answer basic questions have limited utility. To become trusted advisors, chatbots need to provide thoughtful, tailored responses.

One way to enable more contextual conversations is by linking the chatbot to internal knowledge bases and information systems. Integrating proprietary enterprise data from internal knowledge bases enables chatbots to contextualize their responses to each user’s individual needs and interests. For example, a chatbot could suggest products that match a shopper’s preferences and past purchases, explain details in language adapted to the user’s level of expertise, or provide account support by accessing the customer’s specific records. The ability to intelligently incorporate information, understand natural language, and provide customized replies in a conversational flow allows chatbots to deliver real business value across diverse use cases.

The popular architecture pattern of Retrieval Augmented Generation (RAG) is often used to augment user query context and responses. RAG combines the capabilities of LLMs with the grounding in facts and real-world knowledge that comes from retrieving relevant texts and passages from corpus of data. These retrieved texts are then used to inform and ground the output, reducing hallucination and improving relevance.

In this post, we illustrate contextually enhancing a chatbot by using Knowledge Bases for Amazon Bedrock, a fully managed serverless service. The Knowledge Bases for Amazon Bedrock integration allows our chatbot to provide more relevant, personalized responses by linking user queries to related information data points. Internally, Amazon Bedrock uses embeddings stored in a vector database to augment user query context at runtime and enable a managed RAG architecture solution. We use the Amazon letters to shareholders dataset to develop this solution.

Retrieval Augmented Generation

RAG is an approach to natural language generation that incorporates information retrieval into the generation process. RAG architecture involves two key workflows: data preprocessing through ingestion, and text generation using enhanced context.

The data ingestion workflow uses LLMs to create embedding vectors that represent semantic meaning of texts. Embeddings are created for documents and user questions. The document embeddings are split into chunks and stored as indexes in a vector database. The text generation workflow then takes a question’s embedding vector and uses it to retrieve the most similar document chunks based on vector similarity. It augments prompts with these relevant chunks to generate an answer using the LLM. For more details, refer to the Primer on Retrieval Augmented Generation, Embeddings, and Vector Databases section in Preview – Connect Foundation Models to Your Company Data Sources with Agents for Amazon Bedrock.

The following diagram illustrates the high-level RAG architecture.

High level retrieval augmented generation architecture

Although the RAG architecture has many advantages, it involves multiple components, including a database, retrieval mechanism, prompt, and generative model. Managing these interdependent parts can introduce complexities in system development and deployment. The integration of retrieval and generation also requires additional engineering effort and computational resources. Some open source libraries provide wrappers to reduce this overhead; however, changes to libraries can introduce errors and add additional overhead of versioning. Even with open source libraries, significant effort is required to write code, determine optimal chunk size, generate embeddings, and more. This setup work alone can take weeks depending on data volume.

Therefore, a managed solution that handles these undifferentiated tasks could streamline and accelerate the process of implementing and managing RAG applications.

Knowledge Bases for Amazon Bedrock

Knowledge Bases for Amazon Bedrock is a serverless option to build powerful conversational AI systems using RAG. It offers fully managed data ingestion and text generation workflows.

For data ingestion, it handles creating, storing, managing, and updating text embeddings of document data in the vector database automatically. It splits the documents into manageable chunks for efficient retrieval. The chunks are then converted to embeddings and written to a vector index, while allowing you to see the source documents when answering a question.

For text generation, Amazon Bedrock provides the RetrieveAndGenerate API to create embeddings of user queries, and retrieves relevant chunks from the vector database to generate accurate responses. It also supports source attribution and short-term memory needed for RAG applications.

This enables you to focus on your core business applications and removes the undifferentiated heavy lifting.

Solution overview

The solution presented in this post uses a chatbot created using a Streamlit application and includes the following AWS services:

The following diagram is a common solution architecture pattern you can use to integrate any chatbot application to Knowledge Bases for Amazon Bedrock.

Common architecture pattern for Knowledge Bases for Amazon Bedrock

This architecture includes the following steps:

  1. A user interacts with the Streamlit chatbot interface and submits a query in natural language
  2. This triggers a Lambda function, which invokes the Knowledge Bases RetrieveAndGenerate API. Internally, Knowledge Bases uses an Amazon Titan embedding model and converts the user query to a vector and finds chunks that are semantically similar to the user query. The user prompt is than augmented with the chunks that are retrieved from the knowledge base. The prompt alongside the additional context is then sent to a LLM for response generation. In this solution, we use Anthropic Claude Instant as our LLM to generate user responses using additional context. Note that this solution is supported in Regions where Anthropic Claude on Amazon Bedrock is available.
  3. A contextually relevant response is sent back to the chatbot application and user.

Prerequisites

Amazon Bedrock users need to request access to foundation models before they are available for use. This is a one-time action and takes less than a minute. For this solution, you’ll need to enable access to the Titan Embeddings G1 – Text and Claude Instant – v1.2 model in Amazon Bedrock. For more information, refer to Model access.

Clone the GitHub repo

The solution presented in this post is available in the following GitHub repo. You need to clone the GitHub repository to your local machine. Open a terminal window and run the following command. Note this is one single git clone command.

git clone --depth 2 --filter=blob:none --no-checkout https://github.com/aws-samples/amazon-bedrock-samples && cd amazon-bedrock-samples && git checkout main rag-solutions/contextual-chatbot-using-knowledgebase

Upload your knowledge dataset to Amazon S3

We download the dataset for our knowledge base and upload it into a S3 bucket. This dataset will feed and power knowledge base. Complete the following steps:

  1. Navigate to the Annual reports, proxies and shareholder letters data repository and download the last few years of Amazon shareholder letters.Amazon annual reports, proxies and shareholder letters repository
  2. On the Amazon S3 console, choose Buckets in the navigation pane.
  3. Choose Create bucket.
  4. Name the bucket knowledgebase-<your-awsaccount-number>.
  5. Leave all other bucket settings as default and choose Create.
  6. Navigate to the knowledgebase-<your-awsaccount-number> bucket.
  7. Choose Create folder and name it dataset.
  8. Leave all other folder settings as default and choose Create.
  9. Navigate back to the bucket home and choose Create folder to create a new folder and name it lambdalayer.
  10. Leave all other settings as default and choose Create.
    Amazon S3 buckets
  11. Navigate to the dataset folder.
  12. Upload the annual reports, proxies and shareholder letters dataset files you downloaded earlier to this bucket and choose Upload.
  13. Navigate to the lambdalayer folder.
  14. Upload the knowledgebase-lambdalayer.zip file available under the /lambda/layer folder in the GitHub repo you cloned earlier and choose Upload. You will use this Lambda layer code later to create the Lambda function.

Lambda code

Create a knowledge base

In this step, we create a knowledge base using the Amazon shareholder letters dataset we uploaded to our S3 bucket in the previous step.

  1. On the Amazon Bedrock console, under Orchestration in the navigation pane, choose Knowledge base.
  2. Choose Create knowledge base.Create knowledge base page
  3. In the Knowledge base details section, enter a name and optional description.
  4. In the IAM permissions section, select Create and use a new service role and enter a name for the role.
  5. Add tags as needed.
  6. Choose Next.Provide knowledge base details
  7. Leave Data source name as the default name.
  8. For S3 URI, choose Browse S3 to choose the S3 bucket knowledgebase-<your-account-number>/dataset/.You need to point to the bucket and dataset folder you created in the previous steps.
  9. In the Advanced settings section, leave the default values (if you want, you can change the default chunking strategy and specify the chunk size and overlay in percentage).
  10. Choose Next.Knowledge base data source
  11. For Embeddings model, select Titan Embedding G1 – Text.
  12. For Vector database, you can either select Quick create a new vector store or Choose a vector store you have created. Note that, to use the vector store of your choice, you need have a vector store preconfigured to use. We currently support four vector engine types: the vector engine for Amazon OpenSearch Serverless, Amazon Aurora, Pinecone, and Redis Enterprise Cloud. For this post, we select Quick create a new vector store, which by default creates a new OpenSearch Serverless vector store in your account.
  13. Choose Next.Select embeddings model and configure vector store
  14. On the Review and create page, review all the information, or choose Previous to modify any options.
  15. Choose Create knowledge base.Review knowledge base options and create knowledge base Note the knowledge base creation process begins and the status is In progress. It will take a few minutes to create the vector store and knowledge base. Don’t navigate away from the page, otherwise creation will fail.
  16. When the knowledge base status is in the Ready state, note down the knowledge base ID. You will use it in the next steps to configure the Lambda function.Knowledge bases ready state
  17. Now that knowledge base is ready, we need to sync our Amazon shareholders letter data to it. In the Data Source section of the knowledge base details page, choose Sync to trigger the data ingestion process from the S3 bucket to the knowledge base.Knowledge base ready for sync

This sync process splits the document files into smaller chunks of the chunk size specified earlier, generates vector embeddings using the selected text embedding model, and stores them in the vector store managed by Knowledge Bases for Amazon Bedrock.

Knowledge base syncing status

When the dataset sync is complete, the status of the data source will change to the Ready state. Note that, if you add any additional documents in the S3 data folder, you need to re-sync the knowledge base.

Knowledge base synced

Congratulations, your knowledge base is ready.

Note that you can also use Knowledge Bases for Amazon Bedrock service APIs and the AWS Command Line Interface (AWS CLI) to programmatically create a knowledge base. You will need to run various sections of the Jupyter notebook provided under the /notebook folder in the GitHub repo.

Create a Lambda function

This Lambda function is deployed using an AWS CloudFormation template available in the GitHub repo under the /cfn folder. The template requires two parameters: the S3 bucket name and the knowledge base ID.

  1. On the AWS CloudFormation service home page, choose Create stack to create a new stack.Cloudformation home page
  2. Select Template is ready for Prepare template.
  3. Select Upload the template file for Template source.
  4. Choose Choose file, navigate to the GitHub repo you cloned earlier, and choose the .yaml file under the /cfn folder.
  5. Choose Next.Create Cloudformation stack
  6. For Stack name, enter a name.
  7. In the Parameters section, enter the knowledge base ID and S3 bucket name you noted down earlier.
  8. Choose Next.Cloudformation stack details
  9. Leave all default options as is, choose Next, and choose Submit.
  10. Verify that the CloudFormation template ran successfully, and there are no errors.

Congratulations, you have created a Lambda function, related roles, and policies successfully.

Test the contextual chatbot application

To test your chatbot application, complete the following steps:

  1. Open a new terminal or a command line window on your machine.
  2. Run the following command to install the AWS SDK for Python (Boto3). Boto3 makes it straightforward to integrate a Python application, library, or script with AWS services.
    pip install boto3

  3. Run the following command to install and set up a local Python development environment to run the Streamlit application:
    pip install streamlit

  4. Navigate to the /streamlit folder in the GitHub repository folder you cloned earlier.
  5. Run the following command to instantiate the chatbot application:
    python -m streamlit run chatbot.py

This should open a web-based chat application powered by Streamlit in your default web browser.

  1. Use this Streamlit chatbot application to post natural language questions to start the conversations powered by Knowledge Bases for Amazon Bedrock.

When you submit a prompt, the Streamlit app triggers the Lambda function, which invokes the Knowledge Bases RetrieveAndGenerate API to search and generate responses.

The following table includes some sample questions and related knowledge base responses. Try out some of these questions by using prompts.

Questions Answers
What is Amazon doing in the field of generative AI? Amazon has been working on their own large language models (LLMs) for generative AI and believes it will transform and improve every customer experience. They plan to continue investing substantially in these models across all their consumer, seller, brand, and creator experiences.
What is AWS year-over-year revenue in 2022? AWS revenue grew 29% year-over-year in 2022 on a $62 billion revenue base.
How many days has Amazon asked employees to come to work in office? Amazon has asked corporate employees to come back to office at least three days a week beginning May 2022.
By what percentage did AWS revenue grow year-over-year in 2022? AWS had a 29% year-over-year (‘YoY’) revenue in 2022.
Compared to Graviton2 processors, what performance improvement did Graviton3 chips deliver according to the passage? In 2022, AWS delivered their Graviton3 chips, providing 25% better performance than the Graviton2 processors.
Which was the first inference chip launched by AWS according to the passage? AWS launched their first inference chips (“Inferentia”) in 2019, and they have saved companies like Amazon over a hundred million dollars in capital expense.
According to the context, in what year did Amazon’s annual revenue increase from $245B to $434B? Amazon’s annual revenue increased from $245B in 2019 to $434B in 2022.
Tell me again what was the revenue in 2019? Amazon’s revenue in 2019 was $245 billion.
and, 2021? Amazon’s revenue in 2021 was $469.8 billion, an increase of 22% over 2020.
And, remind me again when was the first inference chip was launched? Amazon’s first inference chip was Inferentia, which launched in 2019.

During the first call to the Lambda function, the RetrieveAndGenerate API returns a sessionId, which is then passed by the Streamlit app along with the subsequent user prompt as an input to the RetrieveAndGenerate API to continue the conversation in the same session. The RetrieveAndGenerate API manages the short-term memory and uses the chat history as long as the same sessionId is passed as an input in the successive calls.

Congratulations, you have successfully created and tested a chatbot application using Knowledge Bases for Amazon Bedrock.

Clean up

Failing to delete resources such as the S3 bucket, OpenSearch Serverless collection, and knowledge base will incur charges. To clean up these resources, delete the CloudFormation stack, delete the S3 bucket (including any document folders and files stored in that bucket), delete the OpenSearch Serverless collection, delete the knowledge base, and delete any roles, policies, and permissions that you created earlier.

Conclusion

In this post, we provided an overview of contextual chatbots and explained why they’re important. We described the complexities involved in data ingestion and text generation workflows for a RAG architecture. We then introduced how Knowledge Bases for Amazon Bedrock creates a fully managed serverless RAG system, including a vector store. Finally, we provided a solution architecture and sample code in a GitHub repo to retrieve and generate contextual responses for a chatbot application using a knowledge base.

By explaining the value of contextual chatbots, the challenges of RAG systems, and how Knowledge Bases for Amazon Bedrock addresses those challenges, this post aimed to showcase how Amazon Bedrock enables you to build sophisticated conversational AI applications with minimal effort.

For more information, see the Amazon Bedrock Developer Guide and Knowledge Base APIs.


About the Authors

Manish Chugh is a Principal Solutions Architect at AWS based in San Francisco, CA. He specializes in machine learning and generative AI. He works with organizations ranging from large enterprises to early-stage startups on problems related to machine learning. His role involves helping these organizations architect scalable, secure, and cost-effective workloads on AWS. He regularly presents at AWS conferences and other partner events. Outside of work, he enjoys hiking on East Bay trails, road biking, and watching (and playing) cricket.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Pallavi Nargund is a Principal Solutions Architect at AWS. In her role as a cloud technology enabler, she works with customers to understand their goals and challenges, and give prescriptive guidance to achieve their objective with AWS offerings. She is passionate about women in technology and is a core member of Women in AI/ML at Amazon. She speaks at internal and external conferences such as AWS re:Invent, AWS Summits, and webinars. Outside of work she enjoys volunteering, gardening, cycling and hiking.

Read More