At Your Microservice: NVIDIA Smooths Businesses’ Journey to Generative AI

At Your Microservice: NVIDIA Smooths Businesses’ Journey to Generative AI

NVIDIA’s AI platform is available to any forward-thinking business — and it’s easier to use than ever.

Launched today, NVIDIA AI Enterprise 5.0 includes NVIDIA microservices, downloadable software containers for deploying generative AI applications and accelerated computing. It’s available from leading cloud service providers, system builders and software vendors — and it’s in use at customers such as Uber.

“Our adoption of NVIDIA AI Enterprise inference software is important for meeting the high performance our users expect,” said Albert Greenberg, vice president of platform engineering at Uber. “Uber prides itself on being at the forefront of adopting and using the latest, most advanced AI innovations to deliver a customer service platform that sets the industry standard for effectiveness and excellence.”

Microservices Speed App Development

Developers are turning to microservices as an efficient way to build modern enterprise applications at a global scale. Working from a browser, they use cloud APIs, or application programming interfaces, to compose apps that can run on systems and serve users worldwide.

NVIDIA AI Enterprise 5.0 now includes a wide range of microservices — NVIDIA NIM for deploying AI models in production and the  NVIDIA CUDA-X collection of microservices which includes NVIDIA cuOpt.

NIM microservices optimize inference for dozens of popular AI models from NVIDIA and its partner ecosystem.

Powered by NVIDIA inference software — including Triton Inference Server, TensorRT, and TensorRT-LLM — NIM slashes deployment times from weeks to minutes. It provides security and manageability based on industry standards as well as compatibility with enterprise-grade management tools.

NVIDIA cuOpt is a GPU-accelerated AI microservice that’s set world records for route optimization and can empower dynamic decision-making that reduces cost, time and carbon footprint. It’s one of the CUDA-X microservices that help industries put AI into production.

More capabilities are in the works. For example, NVIDIA RAG LLM operator — now in early access and described in more detail here — will move co-pilots and other generative AI applications that use retrieval-augmented generation from pilot to production without rewriting any code.

NVIDIA microservices are being adopted by leading application and cybersecurity platform providers including CrowdStrike, SAP and ServiceNow.

More Tools and Features

Three other updates in version 5.0 are worth noting.

The platform now packs NVIDIA AI Workbench, a developer toolkit for quickly downloading, customizing, and running generative AI projects. The software is now generally available and supported with an NVIDIA AI Enterprise license.

Version 5.0 also now supports Red Hat OpenStack Platform, the environment most Fortune 500 companies use for creating private and public cloud services. Maintained by Red Hat, it provides developers a familiar option for building virtual computing environments. IBM Consulting will help customers deploy these new capabilities.

In addition, version 5.0 expands support to cover a wide range of the latest NVIDIA GPUs, networking hardware and virtualization software.

Available to Run Anywhere

The enhanced NVIDIA AI platform is easier to access than ever.

NIM and CUDA-X microservices and all the 5.0 features will be available soon on the AWS, Google Cloud, Microsoft Azure and Oracle Cloud marketplaces.

For those who prefer to run code in their own data centers, VMware Private AI Foundation with NVIDIA will support the software, so it can be deployed in the virtualized data centers of Broadcom’s customers.

Companies have the option of running NVIDIA AI Enterprise on Red Hat OpenShift, allowing them to deploy on bare-metal or virtualized environments. It’s also supported on Canonical’s Charmed Kubernetes as well as Ubuntu.

In addition, the AI platform will be part of the software available on HPE ProLiant servers from Hewlett Packard Enterprise (HPE). HPE’s enterprise computing solution for generative AI handles inference and model fine-tuning using NVIDIA AI Enterprise.

In addition, Anyscale, Dataiku and DataRobot — three leading providers of the software for managing machine learning operations — will support NIM on their platforms. They join an NVIDIA ecosystem of hundreds of MLOps partners, including Microsoft Azure Machine Learning, Dataloop AI, Domino Data Lab and Weights & Biases.

However they access it, NVIDIA AI Enterprise 5.0 users can benefit from software that’s secure, production-ready and optimized for performance. It can be flexibly deployed for applications in the data center, the cloud, on workstations or at the network’s edge.

NVIDIA AI Enterprise is available through leading system providers, including Cisco, Dell Technologies, HP, HPE, Lenovo and Supermicro.

Hear Success Stories at GTC

Users will share their experiences with the software at NVIDIA GTC, a global AI conference, running March 18-21 at the San Jose Convention Center.

For example, ServiceNow chief digital information officer Chris Bedi will speak on a panel about harnessing generative AI’s potential. In a separate talk, ServiceNow vice president of AI Products Jeremy Barnes will share on using NVIDIA AI Enterprise to achieve maximum developer productivity.

Executives from BlackRock, Medtronic, SAP and Uber will discuss their work in finance, healthcare, enterprise software, and business operations using the NVIDIA AI platform.

In addition, executives from ControlExpert, a global application provider for  car insurance companies based in Germany, will share how they developed an AI-powered claims management solution using NVIDIA AI Enterprise software.

They’re among a growing set of companies that benefit from NVIDIA’s work evaluating hundreds of internal and external generative AI projects — all integrated into a single package that’s been tested for stability and security.

And get the full picture from NVIDIA CEO and founder Jensen Huang in his GTC keynote.

See notice regarding software product information. 

Read More

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices

NVIDIA NIM microservices now integrate with Amazon SageMaker, allowing you to deploy industry-leading large language models (LLMs) and optimize model performance and cost. You can deploy state-of-the-art LLMs in minutes instead of days using technologies such as NVIDIA TensorRT, NVIDIA TensorRT-LLM, and NVIDIA Triton Inference Server on NVIDIA accelerated instances hosted by SageMaker.

NIM, part of the NVIDIA AI Enterprise software platform listed on AWS marketplace, is a set of inference microservices that bring the power of state-of-the-art LLMs to your applications, providing natural language processing (NLP) and understanding capabilities, whether you’re developing chatbots, summarizing documents, or implementing other NLP-powered applications. You can use pre-built NVIDIA containers to host popular LLMs that are optimized for specific NVIDIA GPUs for quick deployment or use NIM tools to create your own containers.

In this post, we provide a high-level introduction to NIM and show how you can use it with SageMaker.

An introduction to NVIDIA NIM

NIM provides optimized and pre-generated engines for a variety of popular models for inference. These microservices support a variety of LLMs, such as Llama 2 (7B, 13B, and 70B), Mistral-7B-Instruct, Mixtral-8x7B, NVIDIA Nemotron-3 22B Persona, and Code Llama 70B, out of the box using pre-built NVIDIA TensorRT engines tailored for specific NVIDIA GPUs for maximum performance and utilization. These models are curated with the optimal hyperparameters for model-hosting performance for deploying applications with ease.

If your model is not in NVIDIA’s set of curated models, NIM offers essential utilities such as the Model Repo Generator, which facilitates the creation of a TensorRT-LLM-accelerated engine and a NIM-format model directory through a straightforward YAML file. Furthermore, an integrated community backend of vLLM provides support for cutting-edge models and emerging features that may not have been seamlessly integrated into the TensorRT-LLM-optimized stack.

In addition to creating optimized LLMs for inference, NIM provides advanced hosting technologies such as optimized scheduling techniques like in-flight batching, which can break down the overall text generation process for an LLM into multiple iterations on the model. With in-flight batching, rather than waiting for the whole batch to finish before moving on to the next set of requests, the NIM runtime immediately evicts finished sequences from the batch. The runtime then begins running new requests while other requests are still in flight, making the best use of your compute instances and GPUs.

Deploying NIM on SageMaker

NIM integrates with SageMaker, allowing you to host your LLMs with performance and cost optimization while benefiting from the capabilities of SageMaker. When you use NIM on SageMaker, you can use capabilities such as scaling out the number of instances to host your model, performing blue/green deployments, and evaluating workloads using shadow testing—all with best-in-class observability and monitoring with Amazon CloudWatch.

Conclusion

Using NIM to deploy optimized LLMs can be a great option for both performance and cost. It also helps make deploying LLMs effortless. In the future, NIM will also allow for Parameter-Efficient Fine-Tuning (PEFT) customization methods like LoRA and P-tuning. NIM also plans to have LLM support by supporting Triton Inference Server, TensorRT-LLM, and vLLM backends.

We encourage you to learn more about NVIDIA microservices and how to deploy your LLMs using SageMaker and try out the benefits available to you. NIM is available as a paid offering as part of the NVIDIA AI Enterprise software subscription available on AWS Marketplace.

In the near future, we will post an in-depth guide for NIM on SageMaker.


About the authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Nikhil Kulkarni is a software developer with AWS Machine Learning, focusing on making machine learning workloads more performant on the cloud, and is a co-creator of AWS Deep Learning Containers for training and inference. He’s passionate about distributed Deep Learning Systems. Outside of work, he enjoys reading books, fiddling with the guitar, and making pizza.

Harish Tummalacherla is Software Engineer with Deep Learning Performance team at SageMaker. He works on performance engineering for serving large language models efficiently on SageMaker. In his spare time, he enjoys running, cycling and ski mountaineering.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Read More

MELON: Reconstructing 3D objects from images with unknown poses

MELON: Reconstructing 3D objects from images with unknown poses

A person’s prior experience and understanding of the world generally enables them to easily infer what an object looks like in whole, even if only looking at a few 2D pictures of it. Yet the capacity for a computer to reconstruct the shape of an object in 3D given only a few images has remained a difficult algorithmic problem for years. This fundamental computer vision task has applications ranging from the creation of e-commerce 3D models to autonomous vehicle navigation.

A key part of the problem is how to determine the exact positions from which images were taken, known as pose inference. If camera poses are known, a range of successful techniques — such as neural radiance fields (NeRF) or 3D Gaussian Splatting — can reconstruct an object in 3D. But if these poses are not available, then we face a difficult “chicken and egg” problem where we could determine the poses if we knew the 3D object, but we can’t reconstruct the 3D object until we know the camera poses. The problem is made harder by pseudo-symmetries — i.e., many objects look similar when viewed from different angles. For example, square objects like a chair tend to look similar every 90° rotation. Pseudo-symmetries of an object can be revealed by rendering it on a turntable from various angles and plotting its photometric self-similarity map.

Self-Similarity map of a toy truck model. Left: The model is rendered on a turntable from various azimuthal angles, θ. Right: The average L2 RGB similarity of a rendering from θ with that of θ*. The pseudo-similarities are indicated by the dashed red lines.

The diagram above only visualizes one dimension of rotation. It becomes even more complex (and difficult to visualize) when introducing more degrees of freedom. Pseudo-symmetries make the problem ill-posed, with naïve approaches often converging to local minima. In practice, such an approach might mistake the back view as the front view of an object, because they share a similar silhouette. Previous techniques (such as BARF or SAMURAI) side-step this problem by relying on an initial pose estimate that starts close to the global minima. But how can we approach this if those aren’t available?

Methods, such as GNeRF and VMRF leverage generative adversarial networks (GANs) to overcome the problem. These techniques have the ability to artificially “amplify” a limited number of training views, aiding reconstruction. GAN techniques, however, often have complex, sometimes unstable, training processes, making robust and reliable convergence difficult to achieve in practice. A range of other successful methods, such as SparsePose or RUST, can infer poses from a limited number views, but require pre-training on a large dataset of posed images, which aren’t always available, and can suffer from “domain-gap” issues when inferring poses for different types of images.

In “MELON: NeRF with Unposed Images in SO(3)”, spotlighted at 3DV 2024, we present a technique that can determine object-centric camera poses entirely from scratch while reconstructing the object in 3D. MELON (Modulo Equivalent Latent Optimization of NeRF) is one of the first techniques that can do this without initial pose camera estimates, complex training schemes or pre-training on labeled data. MELON is a relatively simple technique that can easily be integrated into existing NeRF methods. We demonstrate that MELON can reconstruct a NeRF from unposed images with state-of-the-art accuracy while requiring as few as 4–6 images of an object.

MELON

We leverage two key techniques to aid convergence of this ill-posed problem. The first is a very lightweight, dynamically trained convolutional neural network (CNN) encoder that regresses camera poses from training images. We pass a downscaled training image to a four layer CNN that infers the camera pose. This CNN is initialized from noise and requires no pre-training. Its capacity is so small that it forces similar looking images to similar poses, providing an implicit regularization greatly aiding convergence.

The second technique is a modulo loss that simultaneously considers pseudo symmetries of an object. We render the object from a fixed set of viewpoints for each training image, backpropagating the loss only through the view that best fits the training image. This effectively considers the plausibility of multiple views for each image. In practice, we find N=2 views (viewing an object from the other side) is all that’s required in most cases, but sometimes get better results with N=4 for square objects.

These two techniques are integrated into standard NeRF training, except that instead of fixed camera poses, poses are inferred by the CNN and duplicated by the modulo loss. Photometric gradients back-propagate through the best-fitting cameras into the CNN. We observe that cameras generally converge quickly to globally optimal poses (see animation below). After training of the neural field, MELON can synthesize novel views using standard NeRF rendering methods.

We simplify the problem by using the NeRF-Synthetic dataset, a popular benchmark for NeRF research and common in the pose-inference literature. This synthetic dataset has cameras at precisely fixed distances and a consistent “up” orientation, requiring us to infer only the polar coordinates of the camera. This is the same as an object at the center of a globe with a camera always pointing at it, moving along the surface. We then only need the latitude and longitude (2 degrees of freedom) to specify the camera pose.

MELON uses a dynamically trained lightweight CNN encoder that predicts a pose for each image. Predicted poses are replicated by the modulo loss, which only penalizes the smallest L2 distance from the ground truth color. At evaluation time, the neural field can be used to generate novel views.

Results

We compute two key metrics to evaluate MELON’s performance on the NeRF Synthetic dataset. The error in orientation between the ground truth and inferred poses can be quantified as a single angular error that we average across all training images, the pose error. We then test the accuracy of MELON’s rendered objects from novel views by measuring the peak signal-to-noise ratio (PSNR) against held out test views. We see that MELON quickly converges to the approximate poses of most cameras within the first 1,000 steps of training, and achieves a competitive PSNR of 27.5 dB after 50k steps.

Convergence of MELON on a toy truck model during optimization. Left: Rendering of the NeRF. Right: Polar plot of predicted (blue x), and ground truth (red dot) cameras.

MELON achieves similar results for other scenes in the NeRF Synthetic dataset.

Reconstruction quality comparison between ground-truth (GT) and MELON on NeRF-Synthetic scenes after 100k training steps.

Noisy images

MELON also works well when performing novel view synthesis from extremely noisy, unposed images. We add varying amounts, σ, of white Gaussian noise to the training images. For example, the object in σ=1.0 below is impossible to make out, yet MELON can determine the pose and generate novel views of the object.

Novel view synthesis from noisy unposed 128×128 images. Top: Example of noise level present in training views. Bottom: Reconstructed model from noisy training views and mean angular pose error.

This perhaps shouldn’t be too surprising, given that techniques like RawNeRF have demonstrated NeRF’s excellent de-noising capabilities with known camera poses. The fact that MELON works for noisy images of unknown camera poses so robustly was unexpected.

Conclusion

We present MELON, a technique that can determine object-centric camera poses to reconstruct objects in 3D without the need for approximate pose initializations, complex GAN training schemes or pre-training on labeled data. MELON is a relatively simple technique that can easily be integrated into existing NeRF methods. Though we only demonstrated MELON on synthetic images we are adapting our technique to work in real world conditions. See the paper and MELON site to learn more.

Acknowledgements

We would like to thank our paper co-authors Axel Levy, Matan Sela, and Gordon Wetzstein, as well as Florian Schroff and Hartwig Adam for continuous help in building this technology. We also thank Matthew Brown, Ricardo Martin-Brualla and Frederic Poitevin for their helpful feedback on the paper draft. We also acknowledge the use of the computational resources at the SLAC Shared Scientific Data Facility (SDF).

Read More

Fine-tune Code Llama on Amazon SageMaker JumpStart

Fine-tune Code Llama on Amazon SageMaker JumpStart

Today, we are excited to announce the capability to fine-tune Code Llama models by Meta using Amazon SageMaker JumpStart. The Code Llama family of large language models (LLMs) is a collection of pre-trained and fine-tuned code generation models ranging in scale from 7 billion to 70 billion parameters. Fine-tuned Code Llama models provide better accuracy and explainability over the base Code Llama models, as evident on its testing against HumanEval and MBPP datasets. You can fine-tune and deploy Code Llama models with SageMaker JumpStart using the Amazon SageMaker Studio UI with a few clicks or using the SageMaker Python SDK. Fine-tuning of Llama models is based on the scripts provided in the llama-recipes GitHub repo from Meta using PyTorch FSDP, PEFT/LoRA, and Int8 quantization techniques.

In this post, we walk through how to fine-tune Code Llama pre-trained models via SageMaker JumpStart through a one-click UI and SDK experience available in the following GitHub repository.

What is SageMaker JumpStart

With SageMaker JumpStart, machine learning (ML) practitioners can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.

What is Code Llama

Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets and sampling more data from that same dataset for longer. Code Llama features enhanced coding capabilities. It can generate code and natural language about code, from both code and natural language prompts (for example, “Write me a function that outputs the Fibonacci sequence”). You can also use it for code completion and debugging. It supports many of the most popular programming languages used today, including Python, C++, Java, PHP, Typescript (JavaScript), C#, Bash, and more.

Why fine-tune Code Llama models

Meta published Code Llama performance benchmarks on HumanEval and MBPP for common coding languages such as Python, Java, and JavaScript. The performance of Code Llama Python models on HumanEval demonstrated varying performance across different coding languages and tasks ranging from 38% on 7B Python model to 57% on 70B Python models. In addition, fine-tuned Code Llama models on SQL programming language have shown better results, as evident in SQL evaluation benchmarks. These published benchmarks highlight the potential benefits of fine-tuning Code Llama models, enabling better performance, customization, and adaptation to specific coding domains and tasks.

No-code fine-tuning via the SageMaker Studio UI

To start fine-tuning your Llama models using SageMaker Studio, complete the following steps:

  1. On the SageMaker Studio console, choose JumpStart in the navigation pane.

You will find listings of over 350 models ranging from open source and proprietary models.

  1. Search for Code Llama models.

If you don’t see Code Llama models, you can update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps. You can also find other model variants by choosing Explore all Code Generation Models or searching for Code Llama in the search box.

SageMaker JumpStart currently supports instruction fine-tuning for Code Llama models. The following screenshot shows the fine-tuning page for the Code Llama 2 70B model.

  1. For Training dataset location, you can point to the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning.
  2. Set your deployment configuration, hyperparameters, and security settings for fine-tuning.
  3. Choose Train to start the fine-tuning job on a SageMaker ML instance.

We discuss the dataset format you need prepare for instruction fine-tuning in the next section.

  1. After the model is fine-tuned, you can deploy it using the model page on SageMaker JumpStart.

The option to deploy the fine-tuned model will appear when fine-tuning is finished, as shown in the following screenshot.

Fine-tune via the SageMaker Python SDK

In this section, we demonstrate how to fine-tune Code LIama models using the SageMaker Python SDK on an instruction-formatted dataset. Specifically, the model is fine-tuned for a set of natural language processing (NLP) tasks described using instructions. This helps improve the model’s performance for unseen tasks with zero-shot prompts.

Complete the following steps to complete your fine-tuning job. You can get the entire fine-tuning code from the GitHub repository.

First, let’s look at the dataset format required for the instruction fine-tuning. The training data should be formatted in a JSON lines (.jsonl) format, where each line is a dictionary representing a data sample. All training data must be in a single folder. However, it can be saved in multiple .jsonl files. The following is a sample in JSON lines format:

{
	'system_prompt': 'a chat',
	'question': 'Please focus on the efficiency of this problem and provide code in python:nYou are given two strings `s` and `t` consisting of only lowercase English letters.nnReturn _the minimum number of characters that need to be appended to the end of_ `s` _so that_ `t` _becomes a **subsequence** of_ `s`.nnA **subsequence** is a string that can be derived from another string by deleting some or no characters without changing the order of the remaining characters.nn**Example 1:**nn**Input:** s = "coaching ", t = "coding "n**Output:** 4n**Explanation:** Append the characters "ding " to the end of s so that s = "coachingding ".nNow, t is a subsequence of s ( "**co**aching**ding** ").nIt can be shown that appending any 3 characters to the end of s will never make t a subsequence.nn**Example 2:**nn**Input:** s = "abcde ", t = "a "n**Output:** 0n**Explanation:** t is already a subsequence of s ( "**a**bcde ").nn**Example 3:**nn**Input:** s = "z ", t = "abcde "n**Output:** 5n**Explanation:** Append the characters "abcde " to the end of s so that s = "zabcde ".nNow, t is a subsequence of s ( "z**abcde** ").nIt can be shown that appending any 4 characters to the end of s will never make t a subsequence.nn**Constraints:**nn* `1 <= s.length, t.length <= 105`n* `s` and `t` consist only of lowercase English letters.',
	'response': '### Explanationn The algorithm maintains two pointers, `i` and `j`, to iterate through the given strings `s` and `t`. The pointer `i` iterates through the string `s` and the pointer `j` iterates through the string `t`. While iterating through `s`, if we find a character that matches the current character in `t`, we increment the `j` pointer. This is because the matched character can be a part of the subsequence we are looking for. The iteration stops when either `i` reaches the end of `s` or `j` reaches the end of `t`.nnAt the end, we subtract the index `j` from the length of `t`. This gives us the minimum number of characters that need to be appended to the end of `s` so that `t` becomes a subsequence of `s`.nn### Pseudocode:n```nMIN-ADDITION(s, t)n s_len = LENGTH(s)n t_len = LENGTH(t)n i = 1n j = 1n WHILE i <= s_len AND j <= t_lenn IF s[i] == t[j]n j = j + 1n ENDIFn i = i + 1n ENDWHILEn RETURN t_len - j + 1nENDMIN-ADDITIONn```nn### Coden```pythonndef min_addition(s: str, t: str) -> int:n s_len = len(s)n t_len = len(t)n i, j = 0, 0n while i < s_len and j < t_len:n if s[i] == t[j]:n j += 1n i += 1n return t_len - jn```n'
}

The training folder can contain a template.json file describing the input and output formats. The following is an example template:

{
    "prompt": "{system_prompt} ### Input: {question}",
    "completion": "{response}"
}

To match the template, each sample in the JSON lines files must include system_prompt, question, and response fields. In this demonstration, we use the Dolphin Coder dataset from Hugging Face.

After you prepare the dataset and upload it to the S3 bucket, you can start fine-tuning using the following code:

from sagemaker.jumpstart.estimator import JumpStartEstimator

model_id = "meta-textgeneration-llama-codellama-7b" 
model_version = "*"
train_data_location = f"s3://{your_own_bucket_hosting_training_data}/" # training data in s3 bucket

estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    hyperparameters= hyperparameters,
    environment={
        "accept_eula": "false"
    },  # please change `accept_eula` to be `true` to accept EULA.
)

estimator.fit({"training": train_data_location})

You can deploy the fine-tuned model directly from the estimator, as shown in the following code. For details, see the notebook in the GitHub repository.

finetuned_predictor = estimator.deploy()

Fine-tuning techniques

Language models such as Llama are more than 10 GB or even 100 GB in size. Fine-tuning such large models requires instances with significantly high CUDA memory. Furthermore, training these models can be very slow due to the size of the model. Therefore, for efficient fine-tuning, we use the following optimizations:

  • Low-Rank Adaptation (LoRA) – This is a type of parameter efficient fine-tuning (PEFT) for efficient fine-tuning of large models. With this method, you freeze the whole model and only add a small set of adjustable parameters or layers into the model. For instance, instead of training all 7 billion parameters for Llama 2 7B, you can fine-tune less than 1% of the parameters. This helps in significant reduction of the memory requirement because you only need to store gradients, optimizer states, and other training-related information for only 1% of the parameters. Furthermore, this helps in reduction of training time as well as the cost. For more details on this method, refer to LoRA: Low-Rank Adaptation of Large Language Models.
  • Int8 quantization – Even with optimizations such as LoRA, models such as Llama 70B are still too big to train. To decrease the memory footprint during training, you can use Int8 quantization during training. Quantization typically reduces the precision of floating point data types. Although this decreases the memory required to store model weights, it degrades the performance due to loss of information. Int8 quantization uses only a quarter precision but doesn’t incur degradation of performance because it doesn’t simply drop the bits. It rounds the data from one type to the another. To learn about Int8 quantization, refer to LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
  • Fully Sharded Data Parallel (FSDP) – This is a type of data-parallel training algorithm that shards the model’s parameters across data parallel workers and can optionally offload part of the training computation to the CPUs. Although the parameters are sharded across different GPUs, computation of each microbatch is local to the GPU worker. It shards parameters more uniformly and achieves optimized performance via communication and computation overlapping during training.

The following table summarizes the details of each model with different settings.

Model Default Setting LORA + FSDP LORA + No FSDP Int8 Quantization + LORA + No FSDP
Code Llama 2 7B LORA + FSDP Yes Yes Yes
Code Llama 2 13B LORA + FSDP Yes Yes Yes
Code Llama 2 34B INT8 + LORA + NO FSDP No No Yes
Code Llama 2 70B INT8 + LORA + NO FSDP No No Yes

Fine-tuning of Llama models is based on scripts provided by the following GitHub repo.

Supported hyperparameters for training

Code Llama 2 fine-tuning supports a number of hyperparameters, each of which can impact the memory requirement, training speed, and performance of the fine-tuned model:

  • epoch – The number of passes that the fine-tuning algorithm takes through the training dataset. Must be an integer greater than 1. Default is 5.
  • learning_rate – The rate at which the model weights are updated after working through each batch of training examples. Must be a positive float greater than 0. Default is 1e-4.
  • instruction_tuned – Whether to instruction-train the model or not. Must be True or False. Default is False.
  • per_device_train_batch_size – The batch size per GPU core/CPU for training. Must be a positive integer. Default is 4.
  • per_device_eval_batch_size – The batch size per GPU core/CPU for evaluation. Must be a positive integer. Default is 1.
  • max_train_samples – For debugging purposes or quicker training, truncate the number of training examples to this value. Value -1 means using all of the training samples. Must be a positive integer or -1. Default is -1.
  • max_val_samples – For debugging purposes or quicker training, truncate the number of validation examples to this value. Value -1 means using all of the validation samples. Must be a positive integer or -1. Default is -1.
  • max_input_length – Maximum total input sequence length after tokenization. Sequences longer than this will be truncated. If -1, max_input_length is set to the minimum of 1024 and the maximum model length defined by the tokenizer. If set to a positive value, max_input_length is set to the minimum of the provided value and the model_max_length defined by the tokenizer. Must be a positive integer or -1. Default is -1.
  • validation_split_ratio – If validation channel is none, the ratio of the train-validation split from the train data must be between 0–1. Default is 0.2.
  • train_data_split_seed – If validation data is not present, this fixes the random splitting of the input training data to training and validation data used by the algorithm. Must be an integer. Default is 0.
  • preprocessing_num_workers – The number of processes to use for preprocessing. If None, the main process is used for preprocessing. Default is None.
  • lora_r – Lora R. Must be a positive integer. Default is 8.
  • lora_alpha – Lora Alpha. Must be a positive integer. Default is 32
  • lora_dropout – Lora Dropout. must be a positive float between 0 and 1. Default is 0.05.
  • int8_quantization – If True, the model is loaded with 8-bit precision for training. Default for 7B and 13B is False. Default for 70B is True.
  • enable_fsdp – If True, training uses FSDP. Default for 7B and 13B is True. Default for 70B is False. Note that int8_quantization is not supported with FSDP.

When choosing the hyperparameters, consider the following:

  • Setting int8_quantization=True decreases the memory requirement and leads to faster training.
  • Decreasing per_device_train_batch_size and max_input_length reduces the memory requirement and therefore can be run on smaller instances. However, setting very low values may increase the training time.
  • If you’re not using Int8 quantization (int8_quantization=False), use FSDP (enable_fsdp=True) for faster and efficient training.

Supported instance types for training

The following table summarizes the supported instance types for training different models.

Model Default Instance Type Supported Instance Types
Code Llama 2 7B ml.g5.12xlarge

ml.g5.12xlarge,

ml.g5.24xlarge,

ml.g5.48xlarge,

ml.p3dn.24xlarge,

ml.g4dn.12xlarge

Code Llama 2 13B ml.g5.12xlarge

ml.g5.24xlarge,

ml.g5.48xlarge,

ml.p3dn.24xlarge,

ml.g4dn.12xlarge

Code Llama 2 70B ml.g5.48xlarge

ml.g5.48xlarge

ml.p4d.24xlarge

When choosing the instance type, consider the following:

  • G5 instances provide the most efficient training among the instance types supported. Therefore, if you have G5 instances available, you should use them.
  • Training time largely depends on the amount of the number of GPUs and the CUDA memory available. Therefore, training on instances with the same number of GPUs (for example, ml.g5.2xlarge and ml.g5.4xlarge) is roughly the same. Therefore, you can use the cheaper instance for training (ml.g5.2xlarge).
  • When using p3 instances, training will be done with 32-bit precision because bfloat16 is not supported on these instances. Therefore, the training job will consume double the amount of CUDA memory when training on p3 instances compared to g5 instances.

To learn about the cost of training per instance, refer to Amazon EC2 G5 Instances.

Evaluation

Evaluation is an important step to assess the performance of fine-tuned models. We present both qualitative and quantitative evaluations to show improvement of fine-tuned models over non-fine-tuned ones. In qualitative evaluation, we show an example response from both fine-tuned and non-fine-tuned models. In quantitative evaluation, we use HumanEval, a test suite developed by OpenAI to generate Python code to test the abilities of producing correct and accurate results. The HumanEval repository is under MIT license. We fine-tuned Python variants of all Code LIama models over different sizes (Code LIama Python 7B, 13B, 34B, and 70B on the Dolphin Coder dataset), and present the evaluation results in the following sections.

Qualitatively evaluation

With your fine-tuned model deployed, you can start using the endpoint to generate code. In the following example, we present responses from both base and fine-tuned Code LIama 34B Python variants on a test sample in the Dolphin Coder dataset:

alpaca_format = """{system_prompt}

### Input:
{question}

### Response:n"""

system="You are a coding AI"
prompt = """Please explain the reasoning behind the following solution and 
provide code in python: Given a string `s`, return the longest palindromic 
substring in `s`. **Example 1:** **Input:** s = "babad " **Output:** "bab " 
**Explanation:** "aba " is also a valid answer. **Example 2:** 
**Input:** s = "cbbd " **Output:** "bb " 
**Constraints:** * `1 <= s.length <= 1000` * `s` consist of only digits 
and English letters."""

payload = json.dumps({
    "inputText": alpaca_format.format(system_prompt=system, question=prompt),
    "textGenerationConfig": {
    "maxTokenCount": 2048,
    "temperature":0.2
}})

finetuned_response = finetuned_predictor.predict(payload)
finetuned_response["generated_text"]

The fine-tuned Code Llama model, in addition to providing the code for the preceding query, generates a detailed explanation of the approach and a pseudo code.

Code Llama 34b Python Non-Fine-Tuned Response:

### Code
def longestPalindrome(s):
    # Fill this in.
    longest = ""
    for i in range(len(s)):
        for j in range(len(s), i, -1):
            if len(longest) >= j-i:
                break
            elif s[i:j] == s[i:j][::-1]:
                longest = s[i:j]
                break
    return longest

print(longestPalindrome("abaxyzzyxf"))
# axyzzyx
###Reference:
- [https://www.geeksforgeeks.org/longest-palindrome-substring-set-1/]
(https://www.geeksforgeeks.org/longest-palindrome-substring-set-1/)

Code Llama 34B Python Fine-Tuned Response

### Explanation
The algorithm uses dynamic programming to find the longest palindromic substring. 
It initializes a 2D boolean table `dp` with the same dimensions as the input string `s`. 
The `dp[i][j]` value is set to `true` if the substring from index `i` to index `j` is a palindrome.

The algorithm then iterates through the string in a nested loop, starting from the first character to the last. 
It checks if the substring from index `i` to index `j` is a palindrome. 
If it is, it sets the `dp[i][j]` value to `true`. 
If the substring is of length 2 or 3, it sets the `dp[i][j]` value to `true` as well.

After the nested loop, the algorithm iterates through the `dp` table to find the longest palindromic substring. 
It keeps track of the maximum length and the starting and ending indices of the substring.

Finally, the algorithm returns the substring from the starting index to the ending index.

### Code
```python
def longestPalindrome(s: str) -> str:
    n = len(s)
    dp = [[False] * n for _ in range(n)]

    for i in range(n):
        dp[i][i] = True
        if i + 1 < n:
            dp[i][i + 1] = s[i] == s[i + 1]

    max_len = 1
    start = 0
    end = 0

    for i in range(n - 1):
        for j in range(i + 1, n):
            if j - i == 1 or dp[i + 1][j - 1]:
                dp[i][j] = s[i] == s[j]
                if j - i + 1 > max_len:
                    max_len = j - i + 1
                    start = i
                    end = j

    return s[start:end + 1]
```

Ground Truth

### Explanation 
1. Initialize `start` and `maxLength` for result substring. 
2. Iterate through the given string `s` using the index `i`. 
3. For each index `i`, create two pointers `l` and `r` starting at `i`. 
4. Check if there's a consecutive sequence of identical characters, increment the right pointer `r` until the end of the sequence is reached. 
5. Update the index `i` to the current value of `r`. 
6. Expand the pointers `l` and `r` outwards to find the longest palindromic substring, checking that characters on both sides are equal. 
7. If the current length of the substring is greater than `maxLength`, update `start` and `maxLength`. 
8. Return the longest palindromic substring using the `start` and `maxLength`. 

### Code 
```python 
def longest_palindromic_substring(s: str) → str:
    n = len(s)
    if n == 0:
        return ""
    
    start, max_length = 0, 1
    for i in range(n):
        l, r = i, i
        while r < n - 1 and s[r] == s[r + 1]:
            r += 1
            i = r
        while l > 0 and r < n - 1 and s[l - 1] == s[r + 1]:
            l -= 1
            r += 1
        length = r - l + 1
        if length > max_length:
            start, max_length = l, length
    return s[start:start + max_length]
```

Interestingly, our fine-tuned version of Code Llama 34B Python provides a dynamic programming-based solution to the longest palindromic substring, which is different from the solution provided in the ground truth from the selected test example. Our fine-tuned model reasons and explains the dynamic programming-based solution in detail. On the other hand, the non-fine-tuned model hallucinates potential outputs right after the print statement (shown in the left cell) because the output axyzzyx is not the longest palindrome in the given string. In terms of time complexity, the dynamic programming solution is generally better than the initial approach. The dynamic programming solution has a time complexity of O(n^2), where n is the length of the input string. This is more efficient than the initial solution from the non-fine-tuned model, which also had a quadratic time complexity of O(n^2) but with a less optimized approach.

This looks promising! Remember, we only fine-tuned the Code LIama Python variant with 10% of the Dolphin Coder dataset. There is a lot more to explore!

Despite of thorough instructions in the response, we still need examine the correctness of the Python code provided in the solution. Next, we use an evaluation framework called Human Eval to run integration tests on the generated response from Code LIama to systematically examine its quality.

Quantitative evaluation with HumanEval

HumanEval is an evaluation harness for evaluating an LLM’s problem-solving capabilities on Python-based coding problems, as described in the paper Evaluating Large Language Models Trained on Code. Specifically, it consists of 164 original Python-based programming problems that assess a language model’s ability to generate code based on provided information like function signature, docstring, body, and unit tests.

For each Python-based programming question, we send it to a Code LIama model deployed on a SageMaker endpoint to get k responses. Next, we run each of the k responses on the integration tests in the HumanEval repository. If any response of the k responses passes the integration tests, we count that test case succeed; otherwise, failed. Then we repeat the process to calculate the ratio of successful cases as the final evaluation score named pass@k. Following standard practice, we set k as 1 in our evaluation, to only generate one response per question and test whether it passes the integration test.

The following is a sample code to use HumanEval repository. You can access the dataset and generate a single response using a SageMaker endpoint. For details, see the notebook in the GitHub repository.

%pip3 install human_eval
import json
from human_eval.evaluation import evaluate_functional_correctness
from human_eval.data import write_jsonl, read_problems
from tqdm import tqdm
problems = read_problems()

num_samples_per_task = 1 # value k: number of responses for each question
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
    for task_id in tqdm(problems)
    for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)

evaluate_functional_correctness('./samples.jsonl')

The following table shows the improvements of the fine-tuned Code LIama Python models over the non-fine-tuned models across different model sizes. To ensure correctness, we also deploy the non-fine-tuned Code LIama models in SageMaker endpoints and run through Human Eval evaluations. The pass@1 numbers (the first row in the following table) match the reported numbers in the Code Llama research paper. The inference parameters are consistently set as "parameters": {"max_new_tokens": 384, "temperature": 0.2}.

As we can see from the results, all the fine-tuned Code LIama Python variants show significant improvement over the non-fine-tuned models. In particular, Code LIama Python 70B outperforms the non-fine-tuned model by approximately 12%.

. 7B Python 13B Python 34B 34B Python 70B Python
Pre-trained model performance (pass@1) 38.4 43.3 48.8 53.7 57.3
Fine-tuned model performance (pass@1) 45.12 45.12 59.1 61.5 69.5

Now you can try fine-tuning Code LIama models on your own dataset.

Clean up

If you decide that you no longer want to keep the SageMaker endpoint running, you can delete it using AWS SDK for Python (Boto3), AWS Command Line Interface (AWS CLI), or SageMaker console. For more information, see Delete Endpoints and Resources. Additionally, you can shut down the SageMaker Studio resources that are no longer required.

Conclusion

In this post, we discussed fine-tuning Meta’s Code Llama 2 models using SageMaker JumpStart. We showed that you can use the SageMaker JumpStart console in SageMaker Studio or the SageMaker Python SDK to fine-tune and deploy these models. We also discussed the fine-tuning technique, instance types, and supported hyperparameters. In addition, we outlined recommendations for optimized training based on various tests we carried out. As we can see from these results of fine-tuning three models over two datasets, fine-tuning improves summarization compared to non-fine-tuned models. As a next step, you can try fine-tuning these models on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use cases.


About the Authors

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Vishaal Yalamanchali is a Startup Solutions Architect working with early-stage generative AI, robotics, and autonomous vehicle companies. Vishaal works with his customers to deliver cutting-edge ML solutions and is personally interested in reinforcement learning, LLM evaluation, and code generation. Prior to AWS, Vishaal was an undergraduate at UCI, focused on bioinformatics and intelligent systems.

Meenakshisundaram Thandavarayan works for AWS as an AI/ ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focuses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector and design thinker, and strives to drive businesses to new ways of working through innovation, incubation, and democratization.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More

Exploring how context, culture, and character matter in avatar research

Exploring how context, culture, and character matter in avatar research

This research paper was presented at the IEEE VR Workshop Series on Animation in Virtual and Augmented Environments (opens in new tab) (ANIVAE 2024), the premier series on 3D content creation for simulated training in extended reality.

IEEE Conference logo with the paper featured

Face-to-face communication is changing, moving beyond physical interaction to include video conferencing and AR/VR platforms, where the participants are represented by avatars. Sophisticated avatars, animated through motion tracking, can realistically portray their human counterparts, but they can also suffer from noise, such as jitter and distortion, reducing their realism. Advances in motion-capture technology aim to reduce such issues, but they come with higher development costs and require additional time due to the need for advanced components. While some noise is inevitable, it’s important to determine acceptable types and levels to efficiently develop and introduce AR/VR devices and avatars to the market. Additionally, understanding how noise impacts avatar-based communication is essential for creating more inclusive avatars that accurately represent diverse cultures and abilities, enhancing the user experience.

In our paper, “Ecological Validity and the Evaluation of Avatar Facial Animation Noise,” presented at ANIVAE 2024, we explore the challenge of evaluating avatar noise without a standardized approach. Traditional methods, which present participants with isolated facial animation noise to gauge perception thresholds, fall short of reflecting real-life avatar interactions. Our approach emphasizes ecological validity—the extent to which experiments mimic real-world conditions—as central in assessing avatar noise. We discovered this significantly influences participants’ response to avatars, highlighting the impact of context on noise perception. Our goal is to improve avatar acceptance, inclusivity, and communication by developing noise evaluation methods that better represent actual experiences. 

Seeing the big picture  

To set up our study, we animated two avatars using motion capture, as depicted in Figure 1 (A). We recorded the performance of two professional actors enacting a scene between an architect and a client discussing home renovations and examining a 3D model of the proposed design. We used two proprietary characters for the avatars, whose faces were animated with 91 expression blendshapes. This allowed for a broad range of facial expressions and subtle variations in emotions, contributing to a more realistic animation. To examine different dynamics, we created six variations of the scene, changing the characters’ gender, role, and whether they agreed on the renovation plan.

Figure 1: A. Motion capture of a social interaction scenario for the experiment. B. The motion capture was remapped to stylized avatars. C. Participants experienced the scene wearing a HoloLens 2 and responded to questions on a tablet app. D. The avatars’ facial features were degraded with different types of animation noises of varying severity.
Figure 1: A. Motion capture of a social interaction scenario for the experiment. B. The motion capture was remapped to stylized avatars. C. Participants experienced the scene wearing a HoloLens 2 and responded to questions on a tablet app. D. The avatars’ facial features were degraded with different types of animation noises of varying severity.

Fifty-six participants engaged in two experiments to evaluate the impact of noise on avatar facial animation. The first experiment had low ecological validity. Participants viewed fragmented clips of dialogue through a Microsoft HoloLens 2 device and used a slider to adjust any noise to an acceptable level. The second experiment featured high ecological validity, showing the scene in its full social context. Here, participants used a HoloLens 2 to judge the noise in facial expressions as either “appropriate” or “inappropriate” for the conversation. In contrast to the first experiment, this method considered the social aspects of context, culture, and character. 

Results indicate that noise was less distracting when participants viewed the scene in its entirety, revealing a greater tolerance for noise in high ecological validity scenarios. Isolated clips, on the other hand, led to greater annoyance with facial animation noise, suggesting the importance of social context over hyper-realistic animation. 

Cultural observations showed that noise perception was influenced by implicit cultural norms, particularly around gender roles and agreement levels. For example, in the second experiment, where participants viewed the conversation within its greater social context (high ecological validity), noise was deemed “appropriate” when the female architect agreed with the male client and “inappropriate” when she disagreed, revealing potential gender biases not observed in reversed gender roles. These findings emphasize the importance of applying high ecological validity in experiments to uncover socio-cultural influences on avatar perception. They also underscore the need to carefully consider context and cultural dynamics in avatar design. 

Finally, we explored the character trait of empathy. Participants with lower empathy scores were more critical of noise in context-rich scenarios. This indicates that experiments focusing solely on low ecological validity might overlook important insights on how empathy influences responses to avatar facial animation noise.

MICROSOFT RESEARCH PODCAST

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI.


Avatars need to be studied in realistic situations 

When people communicate, they engage in a complex process influenced by environment, cultural background, and the nonverbal cues they perceive and interpret. By prioritizing high ecological validity in studies on avatar perception, researchers can uncover these socio-cultural influences and trust that their findings are relevant and applicable to real-life interactions within digital spaces. 

Our research examines how different combinations of demographic characteristics change the way people react to avatars, and we hope to encourage more inclusivity in avatar design. It’s essential to have an established set of guidelines to achieve this goal, and this work is one step in that direction. While our study’s scope is limited, its methodology can be applied broadly across different devices and settings.

Acknowledgements

We would like to thank Ken Jakubzak, James Clemoes, Cornelia Treptow, Michaela Porubanova, Kerry Read, Daniel McDuff, Marina Kuznetsova and Mathew Lamb for their research collaboration. We would also like to thank Shawn Bruner for providing the characters for the study and Panagiotis Giannakopoulos for leading the animation and motion capture pipelines.

The post Exploring how context, culture, and character matter in avatar research appeared first on Microsoft Research.

Read More

Transform one-on-one customer interactions: Build speech-capable order processing agents with AWS and generative AI

Transform one-on-one customer interactions: Build speech-capable order processing agents with AWS and generative AI

In today’s landscape of one-on-one customer interactions for placing orders, the prevailing practice continues to rely on human attendants, even in settings like drive-thru coffee shops and fast-food establishments. This traditional approach poses several challenges: it heavily depends on manual processes, struggles to efficiently scale with increasing customer demands, introduces the potential for human errors, and operates within specific hours of availability. Additionally, in competitive markets, businesses adhering solely to manual processes might find it challenging to deliver efficient and competitive service. Despite technological advancements, the human-centric model remains deeply ingrained in order processing, leading to these limitations.

The prospect of utilizing technology for one-on-one order processing assistance has been available for some time. However, existing solutions can often fall into two categories: rule-based systems that demand substantial time and effort for setup and upkeep, or rigid systems that lack the flexibility required for human-like interactions with customers. As a result, businesses and organizations face challenges in swiftly and efficiently implementing such solutions. Fortunately, with the advent of generative AI and large language models (LLMs), it’s now possible to create automated systems that can handle natural language efficiently, and with an accelerated on-ramping timeline.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. In addition to Amazon Bedrock, you can use other AWS services like Amazon SageMaker JumpStart and Amazon Lex to create fully automated and easily adaptable generative AI order processing agents.

In this post, we show you how to build a speech-capable order processing agent using Amazon Lex, Amazon Bedrock, and AWS Lambda.

Solution overview

The following diagram illustrates our solution architecture.

The workflow consists of the following steps:

  1. A customer places the order using Amazon Lex.
  2. The Amazon Lex bot interprets the customer’s intents and triggers a DialogCodeHook.
  3. A Lambda function pulls the appropriate prompt template from the Lambda layer and formats model prompts by adding the customer input in the associated prompt template.
  4. The RequestValidation prompt verifies the order with the menu item and lets the customer know via Amazon Lex if there’s something they want to order that isn’t part of the menu and will provide recommendations. The prompt also performs a preliminary validation for order completeness.
  5. The ObjectCreator prompt converts the natural language requests into a data structure (JSON format).
  6. The customer validator Lambda function verifies the required attributes for the order and confirms if all necessary information is present to process the order.
  7. A customer Lambda function takes the data structure as an input for processing the order and passes the order total back to the orchestrating Lambda function.
  8. The orchestrating Lambda function calls the Amazon Bedrock LLM endpoint to generate a final order summary including the order total from the customer database system (for example, Amazon DynamoDB).
  9. The order summary is communicated back to the customer via Amazon Lex. After the customer confirms the order, the order will be processed.

Prerequisites

This post assumes that you have an active AWS account and familiarity with the following concepts and services:

Also, in order to access Amazon Bedrock from the Lambda functions, you need to make sure the Lambda runtime has the following libraries:

  • boto3>=1.28.57
  • awscli>=1.29.57
  • botocore>=1.31.57

This can be done with a Lambda layer or by using a specific AMI with the required libraries.

Furthermore, these libraries are required when calling the Amazon Bedrock API from Amazon SageMaker Studio. This can be done by running a cell with the following code:

%pip install --no-build-isolation --force-reinstall 
"boto3>=1.28.57" 
"awscli>=1.29.57" 
"botocore>=1.31.57"

Finally, you create the following policy and later attach it to any role accessing Amazon Bedrock:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": "bedrock:*",
            "Resource": "*"
        }
    ]
}

Create a DynamoDB table

In our specific scenario, we’ve created a DynamoDB table as our customer database system, but you could also use Amazon Relational Database Service (Amazon RDS). Complete the following steps to provision your DynamoDB table (or customize the settings as needed for your use case):

  1. On the DynamoDB console, choose Tables in the navigation pane.
  2. Choose Create table.

  1. For Table name, enter a name (for example, ItemDetails).
  2. For Partition key, enter a key (for this post, we use Item).
  3. For Sort key, enter a key (for this post, we use Size).
  4. Choose Create table.

Now you can load the data into the DynamoDB table. For this post, we use a CSV file. You can load the data to the DynamoDB table using Python code in a SageMaker notebook.

First, we need to set up a profile named dev.

  1. Open a new terminal in SageMaker Studio and run the following command:
aws configure --profile dev

This command will prompt you to enter your AWS access key ID, secret access key, default AWS Region, and output format.

  1. Return to the SageMaker notebook and write a Python code to set up a connection to DynamoDB using the Boto3 library in Python. This code snippet creates a session using a specific AWS profile named dev and then creates a DynamoDB client using that session. The following is the code sample to load the data:
%pip install boto3
import boto3
import csv

# Create a session using a profile named 'dev'
session = boto3.Session(profile_name='dev')

# Create a DynamoDB resource using the session
dynamodb = session.resource('dynamodb')

# Specify your DynamoDB table name
table_name = 'your_table_name'
table = dynamodb.Table(table_name)

# Specify the path to your CSV file
csv_file_path = 'path/to/your/file.csv'

# Read CSV file and put items into DynamoDB
with open(csv_file_path, 'r', encoding='utf-8-sig') as csvfile:
    csvreader = csv.reader(csvfile)
    
    # Skip the header row
    next(csvreader, None)

    for row in csvreader:
        # Extract values from the CSV row
        item = {
            'Item': row[0],  # Adjust the index based on your CSV structure
            'Size': row[1],
            'Price': row[2]
        }
        
        # Put item into DynamoDB
        response = table.put_item(Item=item)
        
        print(f"Item added: {response}")
print(f"CSV data has been loaded into the DynamoDB table: {table_name}")

Alternatively, you can use NoSQL Workbench or other tools to quickly load the data to your DynamoDB table.

The following is a screenshot after the sample data is inserted into the table.

Create templates in a SageMaker notebook using the Amazon Bedrock invocation API

To create our prompt template for this use case, we use Amazon Bedrock. You can access Amazon Bedrock from the AWS Management Console and via API invocations. In our case, we access Amazon Bedrock via API from the convenience of a SageMaker Studio notebook to create not only our prompt template, but our complete API invocation code that we can later use on our Lambda function.

  1. On the SageMaker console, access an existing SageMaker Studio domain or create a new one to access Amazon Bedrock from a SageMaker notebook.

  1. After you create the SageMaker domain and user, choose the user and choose Launch and Studio. This will open a JupyterLab environment.
  2. When the JupyterLab environment is ready, open a new notebook and begin importing the necessary libraries.

There are many FMs available via the Amazon Bedrock Python SDK. In this case, we use Claude V2, a powerful foundational model developed by Anthropic.

The order processing agent needs a few different templates. This can change depending on the use case, but we have designed a general workflow that can apply to multiple settings. For this use case, the Amazon Bedrock LLM template will accomplish the following:

  • Validate the customer intent
  • Validate the request
  • Create the order data structure
  • Pass a summary of the order to the customer
  1. To invoke the model, create a bedrock-runtime object from Boto3.

#Model api request parameters
modelId = 'anthropic.claude-v2' # change this to use a different version from the model provider
accept = 'application/json'
contentType = 'application/json'

import boto3
import json
bedrock = boto3.client(service_name='bedrock-runtime')

Let’s start by working on the intent validator prompt template. This is an iterative process, but thanks to Anthropic’s prompt engineering guide, you can quickly create a prompt that can accomplish the task.

  1. Create the first prompt template along with a utility function that will help prepare the body for the API invocations.

The following is the code for prompt_template_intent_validator.txt:

"{"prompt": "Human: I will give you some instructions to complete my request.\n<instructions>Given the Conversation between Human and Assistant, you need to identify the intent that the human wants to accomplish and respond appropriately. The valid intents are: Greeting,Place Order, Complain, Speak to Someone. Always put your response to the Human within the Response tags. Also add an XML tag to your output identifying the human intent.\nHere are some examples:\n<example><Conversation> H: hi there.\n\nA: Hi, how can I help you today?\n\nH: Yes. I would like a medium mocha please</Conversation>\n\nA:<intent>Place Order</intent><Response>\nGot it.</Response></example>\n<example><Conversation> H: hello\n\nA: Hi, how can I help you today?\n\nH: my coffee does not taste well can you please re-make it?</Conversation>\n\nA:<intent>Complain</intent><Response>\nOh, I am sorry to hear that. Let me get someone to help you.</Response></example>\n<example><Conversation> H: hi\n\nA: Hi, how can I help you today?\n\nH: I would like to speak to someone else please</Conversation>\n\nA:<intent>Speak to Someone</intent><Response>\nSure, let me get someone to help you.</Response></example>\n<example><Conversation> H: howdy\n\nA: Hi, how can I help you today?\n\nH:can I get a large americano with sugar and 2 mochas with no whipped cream</Conversation>\n\nA:<intent>Place Order</intent><Response>\nSure thing! Please give me a moment.</Response></example>\n<example><Conversation> H: hi\n\n</Conversation>\n\nA:<intent>Greeting</intent><Response>\nHi there, how can I help you today?</Response></example>\n</instructions>\n\nPlease complete this request according to the instructions and examples provided above:<request><Conversation>REPLACEME</Conversation></request>\n\nAssistant:\n", "max_tokens_to_sample": 250, "temperature": 1, "top_k": 250, "top_p": 0.75, "stop_sequences": ["\n\nHuman:", "\n\nhuman:", "\n\nCustomer:", "\n\ncustomer:"]}"


  1. Save this template into a file in order to upload to Amazon S3 and call from the Lambda function when needed. Save the templates as JSON serialized strings in a text file. The previous screenshot shows the code sample to accomplish this as well.
  2. Repeat the same steps with the other templates.

The following are some screenshots of the other templates and the results when calling Amazon Bedrock with some of them.

The following is the code for prompt_template_request_validator.txt:

"{"prompt": "Human: I will give you some instructions to complete my request.\n<instructions>Given the context do the following steps: 1. verify that the items in the input are valid. If customer provided an invalid item, recommend replacing it with a valid one. 2. verify that the customer has provided all the information marked as required. If the customer missed a required information, ask the customer for that information. 3. When the order is complete, provide a summary of the order and ask for confirmation always using this phrase: 'is this correct?' 4. If the customer confirms the order, Do not ask for confirmation again, just say the phrase inside the brackets [Great, Give me a moment while I try to process your order]</instructions>\n<context>\nThe VALID MENU ITEMS are: [latte, frappe, mocha, espresso, cappuccino, romano, americano].\nThe VALID OPTIONS are: [splenda, stevia, raw sugar, honey, whipped cream, sugar, oat milk, soy milk, regular milk, skimmed milk, whole milk, 2 percent milk, almond milk].\nThe required information is: size. Size can be: small, medium, large.\nHere are some examples: <example>H: I would like a medium latte with 1 Splenda and a small romano with no sugar please.\n\nA: <Validation>:\nThe Human is ordering a medium latte with one splenda. Latte is a valid menu item and splenda is a valid option. The Human is also ordering a small romano with no sugar. Romano is a valid menu item.</Validation>\n<Response>\nOk, I got: \n\t-Medium Latte with 1 Splenda and.\n\t-Small Romano with no Sugar.\nIs this correct?</Response>\n\nH: yep.\n\nA:\n<Response>\nGreat, Give me a moment while I try to process your order</example>\n\n<example>H: I would like a cappuccino and a mocha please.\n\nA: <Validation>:\nThe Human is ordering a cappuccino and a mocha. Both are valid menu items. The Human did not provide the size for the cappuccino. The human did not provide the size for the mocha. I will ask the Human for the required missing information.</Validation>\n<Response>\nSure thing, but can you please let me know the size for the Cappuccino and the size for the Mocha? We have Small, Medium, or Large.</Response></example>\n\n<example>H: I would like a small cappuccino and a large lemonade please.\n\nA: <Validation>:\nThe Human is ordering a small cappuccino and a large lemonade. Cappuccino is a valid menu item. Lemonade is not a valid menu item. I will suggest the Human a replacement from our valid menu items.</Validation>\n<Response>\nSorry, we don't have Lemonades, would you like to order something else instead? Perhaps a Frappe or a Latte?</Response></example>\n\n<example>H: Can I get a medium frappuccino with sugar please?\n\nA: <Validation>:\n The Human is ordering a Frappuccino. Frappuccino is not a valid menu item. I will suggest a replacement from the valid menu items in my context.</Validation>\n<Response>\nI am so sorry, but Frappuccino is not in our menu, do you want a frappe or a cappuccino instead? perhaps something else?</Response></example>\n\n<example>H: I want two large americanos and a small latte please.\n\nA: <Validation>:\n The Human is ordering 2 Large Americanos, and a Small Latte. Americano is a valid menu item. Latte is a valid menu item.</Validation>\n<Response>\nOk, I got: \n\t-2 Large Americanos and.\n\t-Small Latte.\nIs this correct?</Response>\n\nH: looks correct, yes.\n\nA:\n<Response>\nGreat, Give me a moment while I try to process your order.</Response></example>\n\n</Context>\n\nPlease complete this request according to the instructions and examples provided above:<request>REPLACEME</request>\n\nAssistant:\n", "max_tokens_to_sample": 250, "temperature": 0.3, "top_k": 250, "top_p": 0.75, "stop_sequences": ["\n\nHuman:", "\n\nhuman:", "\n\nCustomer:", "\n\ncustomer:"]}"

The following is our response from Amazon Bedrock using this template.

The following is the code for prompt_template_object_creator.txt:

"{"prompt": "Human: I will give you some instructions to complete my request.\n<instructions>Given the Conversation between Human and Assistant, you need to create a json object in Response with the appropriate attributes.\nHere are some examples:\n<example><Conversation> H: I want a latte.\n\nA:\nCan I have the size?\n\nH: Medium.\n\nA: So, a medium latte.\nIs this Correct?\n\nH: Yes.</Conversation>\n\nA:<Response>{\"1\":{\"item\":\"latte\",\"size\":\"medium\",\"addOns\":[]}}</Response></example>\n<example><Conversation> H: I want a large frappe and 2 small americanos with sugar.\n\nA: Okay, let me confirm:\n\n1 large frappe\n\n2 small americanos with sugar\n\nIs this correct?\n\nH: Yes.</Conversation>\n\nA:<Response>{\"1\":{\"item\":\"frappe\",\"size\":\"large\",\"addOns\":[]},\"2\":{\"item\":\"americano\",\"size\":\"small\",\"addOns\":[\"sugar\"]},\"3\":{\"item\":\"americano\",\"size\":\"small\",\"addOns\":[\"sugar\"]}}</Response>\n</example>\n<example><Conversation> H: I want a medium americano.\n\nA: Okay, let me confirm:\n\n1 medium americano\n\nIs this correct?\n\nH: Yes.</Conversation>\n\nA:<Response>{\"1\":{\"item\":\"americano\",\"size\":\"medium\",\"addOns\":[]}}</Response></example>\n<example><Conversation> H: I want a large latte with oatmilk.\n\nA: Okay, let me confirm:\n\nLarge latte with oatmilk\n\nIs this correct?\n\nH: Yes.</Conversation>\n\nA:<Response>{\"1\":{\"item\":\"latte\",\"size\":\"large\",\"addOns\":[\"oatmilk\"]}}</Response></example>\n<example><Conversation> H: I want a small mocha with no whipped cream please.\n\nA: Okay, let me confirm:\n\nSmall mocha with no whipped cream\n\nIs this correct?\n\nH: Yes.</Conversation>\n\nA:<Response>{\"1\":{\"item\":\"mocha\",\"size\":\"small\",\"addOns\":[\"no whipped cream\"]}}</Response>\n\n</example></instructions>\n\nPlease complete this request according to the instructions and examples provided above:<request><Conversation>REPLACEME</Conversation></request>\n\nAssistant:\n", "max_tokens_to_sample": 250, "temperature": 0.3, "top_k": 250, "top_p": 0.75, "stop_sequences": ["\n\nHuman:", "\n\nhuman:", "\n\nCustomer:", "\n\ncustomer:"]}"


The following is the code for prompt_template_order_summary.txt:

"{"prompt": "Human: I will give you some instructions to complete my request.\n<instructions>Given the Conversation between Human and Assistant, you need to create a summary of the order with bullet points and include the order total.\nHere are some examples:\n<example><Conversation> H: I want a large frappe and 2 small americanos with sugar.\n\nA: Okay, let me confirm:\n\n1 large frappe\n\n2 small americanos with sugar\n\nIs this correct?\n\nH: Yes.</Conversation>\n\n<OrderTotal>10.50</OrderTotal>\n\nA:<Response>\nHere is a summary of your order along with the total:\n\n1 large frappe\n\n2 small americanos with sugar.\nYour Order total is $10.50</Response></example>\n<example><Conversation> H: I want a medium americano.\n\nA: Okay, let me confirm:\n\n1 medium americano\n\nIs this correct?\n\nH: Yes.</Conversation>\n\n<OrderTotal>3.50</OrderTotal>\n\nA:<Response>\nHere is a summary of your order along with the total:\n\n1 medium americano.\nYour Order total is $3.50</Response></example>\n<example><Conversation> H: I want a large latte with oat milk.\n\nA: Okay, let me confirm:\n\nLarge latte with oat milk\n\nIs this correct?\n\nH: Yes.</Conversation>\n\n<OrderTotal>6.75</OrderTotal>\n\nA:<Response>\nHere is a summary of your order along with the total:\n\nLarge latte with oat milk.\nYour Order total is $6.75</Response></example>\n<example><Conversation> H: I want a small mocha with no whipped cream please.\n\nA: Okay, let me confirm:\n\nSmall mocha with no whipped cream\n\nIs this correct?\n\nH: Yes.</Conversation>\n\n<OrderTotal>4.25</OrderTotal>\n\nA:<Response>\nHere is a summary of your order along with the total:\n\nSmall mocha with no whipped cream.\nYour Order total is $6.75</Response>\n\n</example>\n</instructions>\n\nPlease complete this request according to the instructions and examples provided above:<request><Conversation>REPLACEME</Conversation>\n\n<OrderTotal>REPLACETOTAL</OrderTotal></request>\n\nAssistant:\n", "max_tokens_to_sample": 250, "temperature": 0.3, "top_k": 250, "top_p": 0.75, "stop_sequences": ["\n\nHuman:", "\n\nhuman:", "\n\nCustomer:", "\n\ncustomer:", "[Conversation]"]}"


As you can see, we have used our prompt templates to validate menu items, identify missing required information, create a data structure, and summarize the order. The foundational models available on Amazon Bedrock are very powerful, so you could accomplish even more tasks via these templates.

You have completed engineering the prompts and saved the templates to text files. You can now begin creating the Amazon Lex bot and the associated Lambda functions.

Create a Lambda layer with the prompt templates

Complete the following steps to create your Lambda layer:

  1. In SageMaker Studio, create a new folder with a subfolder named python.
  2. Copy your prompt files to the python folder.

  1. You can add the ZIP library to your notebook instance by running the following command.
!conda install -y -c conda-forge zip

  1. Now, run the following command to create the ZIP file for uploading to the Lambda layer.
!zip -r prompt_templates_layer.zip prompt_templates_layer/.

  1. After you create the ZIP file, you can download the file. Go to Lambda, create a new layer by uploading the file directly or by uploading to Amazon S3 first.
  2. Then attach this new layer to the orchestration Lambda function.

Now your prompt template files are locally stored in your Lambda runtime environment. This will speed up the process during your bot runs.

Create a Lambda layer with the required libraries

Complete the following steps to create your Lambda layer with the required librarues:

  1. Open an AWS Cloud9 instance environment, create a folder with a subfolder called python.
  2. Open a terminal inside the python folder.
  3. Run the following commands from the terminal:
pip install “boto3>=1.28.57” -t .
pip install “awscli>=1.29.57" -t .
pip install “botocore>=1.31.57” -t .
  1. Run cd .. and position yourself inside your new folder where you also have the python subfolder.
  2. Run the following command:
zip -r lambda-layer.zip
  1. After you create the ZIP file, you can download the file. Go to Lambda, create a new layer by uploading the file directly or by uploading to Amazon S3 first.
  2. Then attach this new layer to the orchestration Lambda function.

Create the bot in Amazon Lex v2

For this use case, we build an Amazon Lex bot that can provide an input/output interface for the architecture in order to call Amazon Bedrock using voice or text from any interface. Because the LLM will handle the conversation piece of this order processing agent, and Lambda will orchestrate the workflow, you can create a bot with three intents and no slots.

  1. On the Amazon Lex console, create a new bot with the method Create a blank bot.

Now you can add an intent with any appropriate initial utterance for the end-users to start the conversation with the bot. We use simple greetings and add an initial bot response so end-users can provide their requests. When creating the bot, make sure to use a Lambda code hook with the intents; this will trigger a Lambda function that will orchestrate the workflow between the customer, Amazon Lex, and the LLM.

  1. Add your first intent, which triggers the workflow and uses the intent validation prompt template to call Amazon Bedrock and identify what the customer is trying to accomplish. Add a few simple utterances for end-users to start conversation.

You don’t need to use any slots or initial reading in any of the bot intents. In fact, you don’t need to add utterances to the second or third intents. That is because the LLM will guide Lambda throughout the process.

  1. Add a confirmation prompt. You can customize this message in the Lambda function later.

  1. Under Code hooks, select Use a Lambda function for initialization and validation.

  1. Create a second intent with no utterance and no initial response. This is the PlaceOrder intent.

When the LLM identifies that the customer is trying to place an order, the Lambda function will trigger this intent and validate the customer request against the menu, and make sure that no required information is missing. Remember that all of this is on the prompt templates, so you can adapt this workflow for any use case by changing the prompt templates.

  1. Don’t add any slots, but add a confirmation prompt and decline response.

  1. Select Use a Lambda function for initialization and validation.

  1. Create a third intent named ProcessOrder with no sample utterances and no slots.
  2. Add an initial response, a confirmation prompt, and a decline response.

After the LLM has validated the customer request, the Lambda function triggers the third and last intent to process the order. Here, Lambda will use the object creator template to generate the order JSON data structure to query the DynamoDB table, and then use the order summary template to summarize the whole order along with the total so Amazon Lex can pass it to the customer.

  1. Select Use a Lambda function for initialization and validation. This can use any Lambda function to process the order after the customer has given the final confirmation.

  1. After you create all three intents, go to the Visual builder for the ValidateIntent, add a go-to intent step, and connect the output of the positive confirmation to that step.
  2. After you add the go-to intent, edit it and choose the PlaceOrder intent as the intent name.

  1. Similarly, to go the Visual builder for the PlaceOrder intent and connect the output of the positive confirmation to the ProcessOrder go-to intent. No editing is required for the ProcessOrder intent.
  2. You now need to create the Lambda function that orchestrates Amazon Lex and calls the DynamoDB table, as detailed in the following section.

Create a Lambda function to orchestrate the Amazon Lex bot

You can now build the Lambda function that orchestrates the Amazon Lex bot and workflow. Complete the following steps:

  1. Create a Lambda function with the standard execution policy and let Lambda create a role for you.
  2. In the code window of your function, add a few utility functions that will help: format the prompts by adding the lex context to the template, call the Amazon Bedrock LLM API, extract the desired text from the responses, and more. See the following code:
import json
import re
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

bedrock = boto3.client(service_name='bedrock-runtime')
def CreatingCustomPromptFromLambdaLayer(object_key,replace_items):
   
    folder_path = '/opt/order_processing_agent_prompt_templates/python/'
    try:
        file_path = folder_path + object_key
        with open(file_path, "r") as file1:
            raw_template = file1.read()
            # Modify the template with the custom input prompt
            #template['inputs'][0].insert(1, {"role": "user", "content": '### Input:n' + user_request})
            for key,value in replace_items.items():
                value = json.dumps(json.dumps(value).replace('"','')).replace('"','')
                raw_template = raw_template.replace(key,value)
            modified_prompt = raw_template

            return modified_prompt
    except Exception as e:
        return {
            'statusCode': 500,
            'body': f'An error occurred: {str(e)}'
        }
def CreatingCustomPrompt(object_key,replace_items):
    logger.debug('replace_items is: {}'.format(replace_items))
    #retrieve user request from intent_request
    #we first propmt the model with current order
    
    bucket_name = 'your-bucket-name'
    
    #object_key = 'prompt_template_order_processing.txt'
    try:
        s3 = boto3.client('s3')
        # Retrieve the existing template from S3
        response = s3.get_object(Bucket=bucket_name, Key=object_key)
        raw_template = response['Body'].read().decode('utf-8')
        raw_template = json.loads(raw_template)
        logger.debug('raw template is {}'.format(raw_template))
        #template_json = json.loads(raw_template)
        #logger.debug('template_json is {}'.format(template_json))
        #template = json.dumps(template_json)
        #logger.debug('template is {}'.format(template))

        # Modify the template with the custom input prompt
        #template['inputs'][0].insert(1, {"role": "user", "content": '### Input:n' + user_request})
        for key,value in replace_items.items():
            raw_template = raw_template.replace(key,value)
            logger.debug("Replacing: {} nwith: {}".format(key,value))
        modified_prompt = json.dumps(raw_template)
        logger.debug("Modified template: {}".format(modified_prompt))
        logger.debug("Modified template type is: {}".format(print(type(modified_prompt))))
        
        #modified_template_json = json.loads(modified_prompt)
        #logger.debug("Modified template json: {}".format(modified_template_json))
        
        return modified_prompt
    except Exception as e:
        return {
            'statusCode': 500,
            'body': f'An error occurred: {str(e)}'
        }
    
def validate_intent(intent_request):
    logger.debug('starting validate_intent: {}'.format(intent_request))
    #retrieve user request from intent_request
    user_request = 'Human: ' + intent_request['inputTranscript'].lower()
    #getting current context variable
    current_session_attributes =  intent_request['sessionState']['sessionAttributes']
    if len(current_session_attributes) > 0:
        full_context = current_session_attributes['fullContext'] + '\n\n' + user_request
        dialog_context = current_session_attributes['dialogContext'] + '\n\n' + user_request
    else:
        full_context = user_request
        dialog_context = user_request
    #Preparing validation prompt by adding context to prompt template
    object_key = 'prompt_template_intent_validator.txt'
    #replace_items = {"REPLACEME":full_context}
    #replace_items = {"REPLACEME":dialog_context}
    replace_items = {"REPLACEME":dialog_context}
    #validation_prompt = CreatingCustomPrompt(object_key,replace_items)
    validation_prompt = CreatingCustomPromptFromLambdaLayer(object_key,replace_items)

    #Prompting model for request validation
    intent_validation_completion = prompt_bedrock(validation_prompt)
    intent_validation_completion = re.sub(r'["]','',intent_validation_completion)

    #extracting response from response completion and removing some special characters
    validation_response = extract_response(intent_validation_completion)
    validation_intent = extract_intent(intent_validation_completion)
    
    

    #business logic depending on intents
    if validation_intent == 'Place Order':
        return validate_request(intent_request)
    elif validation_intent in ['Complain','Speak to Someone']:
        ##adding session attributes to keep current context
        full_context = full_context + '\n\n' + intent_validation_completion
        dialog_context = dialog_context + '\n\nAssistant: ' + validation_response
        intent_request['sessionState']['sessionAttributes']['fullContext'] = full_context
        intent_request['sessionState']['sessionAttributes']['dialogContext'] = dialog_context
        intent_request['sessionState']['sessionAttributes']['customerIntent'] = validation_intent
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close',validation_response)
    if validation_intent == 'Greeting':
        ##adding session attributes to keep current context
        full_context = full_context + '\n\n' + intent_validation_completion
        dialog_context = dialog_context + '\n\nAssistant: ' + validation_response
        intent_request['sessionState']['sessionAttributes']['fullContext'] = full_context
        intent_request['sessionState']['sessionAttributes']['dialogContext'] = dialog_context
        intent_request['sessionState']['sessionAttributes']['customerIntent'] = validation_intent
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'InProgress','ConfirmIntent',validation_response)

def validate_request(intent_request):
    logger.debug('starting validate_request: {}'.format(intent_request))
    #retrieve user request from intent_request
    user_request = 'Human: ' + intent_request['inputTranscript'].lower()
    #getting current context variable
    current_session_attributes =  intent_request['sessionState']['sessionAttributes']
    if len(current_session_attributes) > 0:
        full_context = current_session_attributes['fullContext'] + '\n\n' + user_request
        dialog_context = current_session_attributes['dialogContext'] + '\n\n' + user_request
    else:
        full_context = user_request
        dialog_context = user_request
   
    #Preparing validation prompt by adding context to prompt template
    object_key = 'prompt_template_request_validator.txt'
    replace_items = {"REPLACEME":dialog_context}
    #validation_prompt = CreatingCustomPrompt(object_key,replace_items)
    validation_prompt = CreatingCustomPromptFromLambdaLayer(object_key,replace_items)

    #Prompting model for request validation
    request_validation_completion = prompt_bedrock(validation_prompt)
    request_validation_completion = re.sub(r'["]','',request_validation_completion)

    #extracting response from response completion and removing some special characters
    validation_response = extract_response(request_validation_completion)

    ##adding session attributes to keep current context
    full_context = full_context + '\n\n' + request_validation_completion
    dialog_context = dialog_context + '\n\nAssistant: ' + validation_response
    intent_request['sessionState']['sessionAttributes']['fullContext'] = full_context
    intent_request['sessionState']['sessionAttributes']['dialogContext'] = dialog_context
    
    return close(intent_request['sessionState']['sessionAttributes'],'PlaceOrder','InProgress','ConfirmIntent',validation_response)
    
def process_order(intent_request):
    logger.debug('starting process_order: {}'.format(intent_request))

     #retrieve user request from intent_request
    user_request = 'Human: ' + intent_request['inputTranscript'].lower()
    #getting current context variable
    current_session_attributes =  intent_request['sessionState']['sessionAttributes']
    if len(current_session_attributes) > 0:
        full_context = current_session_attributes['fullContext'] + '\n\n' + user_request
        dialog_context = current_session_attributes['dialogContext'] + '\n\n' + user_request
    else:
        full_context = user_request
        dialog_context = user_request
    #   Preparing object creator prompt by adding context to prompt template
    object_key = 'prompt_template_object_creator.txt'
    replace_items = {"REPLACEME":dialog_context}
    #object_creator_prompt = CreatingCustomPrompt(object_key,replace_items)
    object_creator_prompt = CreatingCustomPromptFromLambdaLayer(object_key,replace_items)
    #Prompting model for object creation
    object_creation_completion = prompt_bedrock(object_creator_prompt)
    #extracting response from response completion
    object_creation_response = extract_response(object_creation_completion)
    inputParams = json.loads(object_creation_response)
    inputParams = json.dumps(json.dumps(inputParams))
    logger.debug('inputParams is: {}'.format(inputParams))
    client = boto3.client('lambda')
    response = client.invoke(FunctionName = 'arn:aws:lambda:us-east-1:<AccountNumber>:function:aws-blog-order-validator',InvocationType = 'RequestResponse',Payload = inputParams)
    responseFromChild = json.load(response['Payload'])
    validationResult = responseFromChild['statusCode']
    if validationResult == 205:
        order_validation_error = responseFromChild['validator_response']
        return close(intent_request['sessionState']['sessionAttributes'],'PlaceOrder','InProgress','ConfirmIntent',order_validation_error)
    #invokes Order Processing lambda to query DynamoDB table and returns order total
    response = client.invoke(FunctionName = 'arn:aws:lambda:us-east-1: <AccountNumber>:function:aws-blog-order-processing',InvocationType = 'RequestResponse',Payload = inputParams)
    responseFromChild = json.load(response['Payload'])
    orderTotal = responseFromChild['body']
    ###Prompting the model to summarize the order along with order total
    object_key = 'prompt_template_order_summary.txt'
    replace_items = {"REPLACEME":dialog_context,"REPLACETOTAL":orderTotal}
    #order_summary_prompt = CreatingCustomPrompt(object_key,replace_items)
    order_summary_prompt = CreatingCustomPromptFromLambdaLayer(object_key,replace_items)
    order_summary_completion = prompt_bedrock(order_summary_prompt)
    #extracting response from response completion
    order_summary_response = extract_response(order_summary_completion)  
    order_summary_response = order_summary_response + '. Shall I finalize processing your order?'
    ##adding session attributes to keep current context
    full_context = full_context + '\n\n' + order_summary_completion
    dialog_context = dialog_context + '\n\nAssistant: ' + order_summary_response
    intent_request['sessionState']['sessionAttributes']['fullContext'] = full_context
    intent_request['sessionState']['sessionAttributes']['dialogContext'] = dialog_context
    return close(intent_request['sessionState']['sessionAttributes'],'ProcessOrder','InProgress','ConfirmIntent',order_summary_response)
    

""" --- Main handler and Workflow functions --- """

def lambda_handler(event, context):
    """
    Route the incoming request based on intent.
    The JSON body of the request is provided in the event slot.
    """
    logger.debug('event is: {}'.format(event))

    return dispatch(event)

def dispatch(intent_request):
    """
    Called when the user specifies an intent for this bot. If intent is not valid then returns error name
    """
    logger.debug('intent_request is: {}'.format(intent_request))
    intent_name = intent_request['sessionState']['intent']['name']
    confirmation_state = intent_request['sessionState']['intent']['confirmationState']
    # Dispatch to your bot's intent handlers
    if intent_name == 'ValidateIntent' and confirmation_state == 'None':
        return validate_intent(intent_request)
    if intent_name == 'PlaceOrder' and confirmation_state == 'None':
        return validate_request(intent_request)
    elif intent_name == 'PlaceOrder' and confirmation_state == 'Confirmed':
        return process_order(intent_request)
    elif intent_name == 'PlaceOrder' and confirmation_state == 'Denied':
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close','Got it. Let me know if I can help you with something else.')
    elif intent_name == 'PlaceOrder' and confirmation_state not in ['Denied','Confirmed','None']:
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close','Sorry. I am having trouble completing the request. Let me get someone to help you.')
        logger.debug('exiting intent {} here'.format(intent_request['sessionState']['intent']['name']))
    elif intent_name == 'ProcessOrder' and confirmation_state == 'None':
        return validate_request(intent_request)
    elif intent_name == 'ProcessOrder' and confirmation_state == 'Confirmed':
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close','Perfect! Your order has been processed. Please proceed to payment.')
    elif intent_name == 'ProcessOrder' and confirmation_state == 'Denied':
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close','Got it. Let me know if I can help you with something else.')
    elif intent_name == 'ProcessOrder' and confirmation_state not in ['Denied','Confirmed','None']:
        return close(intent_request['sessionState']['sessionAttributes'],intent_request['sessionState']['intent']['name'],'Fulfilled','Close','Sorry. I am having trouble completing the request. Let me get someone to help you.')
        logger.debug('exiting intent {} here'.format(intent_request['sessionState']['intent']['name']))
    raise Exception('Intent with name ' + intent_name + ' not supported')
    
def prompt_bedrock(formatted_template):
    logger.debug('prompt bedrock input is:'.format(formatted_template))
    body = json.loads(formatted_template)

    modelId = 'anthropic.claude-v2' # change this to use a different version from the model provider
    accept = 'application/json'
    contentType = 'application/json'

    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    response_completion = response_body.get('completion')
    logger.debug('response is: {}'.format(response_completion))

    #print_ww(response_body.get('completion'))
    #print(response_body.get('results')[0].get('outputText'))
    return response_completion

#function to extract text between the <Response> and </Response> tags within model completion
def extract_response(response_completion):
    
    if '<Response>' in response_completion:
        customer_response = response_completion.replace('<Response>','||').replace('</Response>','').split('||')[1]
        
        logger.debug('modified response is: {}'.format(response_completion))

        return customer_response
    else:
        
        logger.debug('modified response is: {}'.format(response_completion))

        return response_completion
        
#function to extract text between the <Response> and </Response> tags within model completion
def extract_intent(response_completion):
    if '<intent>' in response_completion:
        customer_intent = response_completion.replace('<intent>','||').replace('</intent>','||').split('||')[1]
        return customer_intent
    else:
        return customer_intent
        
def close(session_attributes, intent, fulfillment_state, action_type, message):
    #This function prepares the response in the appropiate format for Lex V2

    response = {
        "sessionState": {
            "sessionAttributes":session_attributes,
            "dialogAction": {
                "type": action_type
            },
            "intent": {
                "name":intent,
                "state":fulfillment_state
                
            },
            
            },
        "messages":
            [{
                "contentType":"PlainText",
                "content":message,
            }]
            ,
        
    }
    return response
  1. Attach the Lambda layer you created earlier to this function.
  2. Additionally, attach the layer to the prompt templates you created.
  3. In the Lambda execution role, attach the policy to access Amazon Bedrock, which was created earlier.

The Lambda execution role should have the following permissions.

Attach the Orchestration Lambda function to the Amazon Lex bot

  1. After you create the function in the previous section, return to the Amazon Lex console and navigate to your bot.
  2. Under Languages in the navigation pane, choose English.
  3. For Source, choose your order processing bot.
  4. For Lambda function version or alias, choose $LATEST.
  5. Choose Save.

Create assisting Lambda functions

Complete the following steps to create additional Lambda functions:

  1. Create a Lambda function to query the DynamoDB table that you created earlier:
import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
# Initialize the DynamoDB client
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('your-table-name')

def calculate_grand_total(input_data):
    # Initialize the total price
    total_price = 0
    
    try:
        # Loop through each item in the input JSON
        for item_id, item_data in input_data.items():
            item_name = item_data['item'].lower()  # Convert item name to lowercase
            item_size = item_data['size'].lower()  # Convert item size to lowercase
            
            # Query the DynamoDB table for the item based on Item and Size
            response = table.get_item(
                Key={'Item': item_name,
                    'Size': item_size}
            )
            
            # Check if the item was found in the table
            if 'Item' in response:
                item = response['Item']
                price = float(item['Price'])
                total_price += price  # Add the item's price to the total
    
        return total_price
    except Exception as e:
        raise Exception('An error occurred: {}'.format(str(e)))

def lambda_handler(event, context):
    try:
       
        # Parse the input JSON from the Lambda event
        input_json = json.loads(event)

        # Calculate the grand total
        grand_total = calculate_grand_total(input_json)
    
        # Return the grand total in the response
        return {'statusCode': 200,'body': json.dumps(grand_total)}
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps('An error occurred: {}'.format(str(e)))
  1. Navigate to the Configuration tab in the Lambda function and choose Permissions.
  2. Attach a resource-based policy statement allowing the order processing Lambda function to invoke this function.

  1. Navigate to the IAM execution role for this Lambda function and add a policy to access the DynamoDB table.

  1. Create another Lambda function to validate if all required attributes were passed from the customer. In the following example, we validate if the size attribute is captured for an order:
import json
import logging

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

def lambda_handler(event, context):
    # Define customer orders from the input event
    customer_orders = json.loads(event)

    # Initialize a list to collect error messages
    order_errors = {}
    missing_size = []
    error_messages = []
    # Iterate through each order in customer_orders
    for order_id, order in customer_orders.items():
        if "size" not in order or order["size"] == "":
            missing_size.append(order['item'])
            order_errors['size'] = missing_size
    if order_errors:
        items_missing_size = order_errors['size']
        error_message = f"could you please provide the size for the following items: {', '.join(items_missing_size)}?"
        error_messages.append(error_message)

    # Prepare the response message
    if error_messages:
        response_message = "n".join(error_messages)
        return {
        'statusCode': 205,
        'validator_response': response_message
            }   
    else:
        response_message = "Order is validated successfully"
        return {
        'statusCode': 200,
        'validator_response': response_message
        }
  1. Navigate to the Configuration tab in the Lambda function and choose Permissions.
  2. Attach a resource-based policy statement allowing the order processing Lambda function to invoke this function.

Test the solution

Now we can test the solution with example orders that customers place via Amazon Lex.

For our first example, the customer asked for a frappuccino, which is not on the menu. The model validates with the help of order validator template and suggests some recommendations based on the menu. After the customer confirms their order, they are notified of the order total and order summary. The order will be processed based on the customer’s final confirmation.

In our next example, the customer is ordering for large cappuccino and then modifying the size from large to medium. The model captures all necessary changes and requests the customer to confirm the order. The model presents the order total and order summary, and processes the order based on the customer’s final confirmation.

For our final example, the customer placed an order for multiple items and the size is missing for a couple of items. The model and Lambda function will verify if all required attributes are present to process the order and then ask the customer to provide the missing information. After the customer provides the missing information (in this case, the size of the coffee), they’re shown the order total and order summary. The order will be processed based on the customer’s final confirmation.

LLM limitations

LLM outputs are stochastic by nature, which means that the results from our LLM can vary in format, or even in the form of untruthful content (hallucinations). Therefore, developers need to rely on a good error handling logic throughout their code in order to handle these scenarios and avoid a degraded end-user experience.

Clean up

If you no longer need this solution, you can delete the following resources:

  • Lambda functions
  • Amazon Lex box
  • DynamoDB table
  • S3 bucket

Additionally, shut down the SageMaker Studio instance if the application is no longer required.

Cost assessment

For pricing information for the main services used by this solution, see the following:

Note that you can use Claude v2 without the need for provisioning, so overall costs remain at a minimum. To further reduce costs, you can configure the DynamoDB table with the on-demand setting.

Conclusion

This post demonstrated how to build a speech-enabled AI order processing agent using Amazon Lex, Amazon Bedrock, and other AWS services. We showed how prompt engineering with a powerful generative AI model like Claude can enable robust natural language understanding and conversation flows for order processing without the need for extensive training data.

The solution architecture uses serverless components like Lambda, Amazon S3, and DynamoDB to enable a flexible and scalable implementation. Storing the prompt templates in Amazon S3 allows you to customize the solution for different use cases.

Next steps could include expanding the agent’s capabilities to handle a wider range of customer requests and edge cases. The prompt templates provide a way to iteratively improve the agent’s skills. Additional customizations could involve integrating the order data with backend systems like inventory, CRM, or POS. Lastly, the agent could be made available across various customer touchpoints like mobile apps, drive-thru, kiosks, and more using the multi-channel capabilities of Amazon Lex.

To learn more, refer to the following related resources:


About the Authors

Moumita Dutta is a Partner Solution Architect at Amazon Web Services. In her role, she collaborates closely with partners to develop scalable and reusable assets that streamline cloud deployments and enhance operational efficiency. She is a member of AI/ML community and a Generative AI expert at AWS. In her leisure, she enjoys gardening and cycling.

Fernando Lammoglia is a Partner Solutions Architect at Amazon Web Services, working closely with AWS partners in spearheading the development and adoption of cutting-edge AI solutions across business units. A strategic leader with expertise in cloud architecture, generative AI, machine learning, and data analytics. He specializes in executing go-to-market strategies and delivering impactful AI solutions aligned with organizational goals. On his free time he loves to spend time with his family and travel to other countries.

Mitul Patel is a Senior Solution Architect at Amazon Web Services. In his role as a cloud technology enabler, he works with customers to understand their goals and challenges, and provides prescriptive guidance to achieve their objective with AWS offerings. He is a member of AI/ML community and a Generative AI ambassador at AWS. In his free time, he enjoys hiking and playing soccer.

Read More

HEAL: A framework for health equity assessment of machine learning performance

HEAL: A framework for health equity assessment of machine learning performance

Health equity is a major societal concern worldwide with disparities having many causes. These sources include limitations in access to healthcare, differences in clinical treatment, and even fundamental differences in the diagnostic technology. In dermatology for example, skin cancer outcomes are worse for populations such as minorities, those with lower socioeconomic status, or individuals with limited healthcare access. While there is great promise in recent advances in machine learning (ML) and artificial intelligence (AI) to help improve healthcare, this transition from research to bedside must be accompanied by a careful understanding of whether and how they impact health equity.

Health equity is defined by public health organizations as fairness of opportunity for everyone to be as healthy as possible. Importantly, equity may be different from equality. For example, people with greater barriers to improving their health may require more or different effort to experience this fair opportunity. Similarly, equity is not fairness as defined in the AI for healthcare literature. Whereas AI fairness often strives for equal performance of the AI technology across different patient populations, this does not center the goal of prioritizing performance with respect to pre-existing health disparities.

Health equity considerations. An intervention (e.g., an ML-based tool, indicated in dark blue) promotes health equity if it helps reduce existing disparities in health outcomes (indicated in lighter blue).

In “Health Equity Assessment of machine Learning performance (HEAL): a framework and dermatology AI model case study”, published in The Lancet eClinicalMedicine, we propose a methodology to quantitatively assess whether ML-based health technologies perform equitably. In other words, does the ML model perform well for those with the worst health outcomes for the condition(s) the model is meant to address? This goal anchors on the principle that health equity should prioritize and measure model performance with respect to disparate health outcomes, which may be due to a number of factors that include structural inequities (e.g., demographic, social, cultural, political, economic, environmental and geographic).

The health equity framework (HEAL)

The HEAL framework proposes a 4-step process to estimate the likelihood that an ML-based health technology performs equitably:

  1. Identify factors associated with health inequities and define tool performance metrics,
  2. Identify and quantify pre-existing health disparities,
  3. Measure the performance of the tool for each subpopulation,
  4. Measure the likelihood that the tool prioritizes performance with respect to health disparities.

The final step’s output is termed the HEAL metric, which quantifies how anticorrelated the ML model’s performance is with health disparities. In other words, does the model perform better with populations that have the worse health outcomes?

This 4-step process is designed to inform improvements for making ML model performance more equitable, and is meant to be iterative and re-evaluated on a regular basis. For example, the availability of health outcomes data in step (2) can inform the choice of demographic factors and brackets in step (1), and the framework can be applied again with new datasets, models and populations.

Framework for Health Equity Assessment of machine Learning performance (HEAL). Our guiding principle is to avoid exacerbating health inequities, and these steps help us identify disparities and assess for inequitable model performance to move towards better outcomes for all.

With this work, we take a step towards encouraging explicit assessment of the health equity considerations of AI technologies, and encourage prioritization of efforts during model development to reduce health inequities for subpopulations exposed to structural inequities that can precipitate disparate outcomes. We should note that the present framework does not model causal relationships and, therefore, cannot quantify the actual impact a new technology will have on reducing health outcome disparities. However, the HEAL metric may help identify opportunities for improvement, where the current performance is not prioritized with respect to pre-existing health disparities.

Case study on a dermatology model

As an illustrative case study, we applied the framework to a dermatology model, which utilizes a convolutional neural network similar to that described in prior work. This example dermatology model was trained to classify 288 skin conditions using a development dataset of 29k cases. The input to the model consists of three photos of a skin concern along with demographic information and a brief structured medical history. The output consists of a ranked list of possible matching skin conditions.

Using the HEAL framework, we evaluated this model by assessing whether it prioritized performance with respect to pre-existing health outcomes. The model was designed to predict possible dermatologic conditions (from a list of hundreds) based on photos of a skin concern and patient metadata. Evaluation of the model is done using a top-3 agreement metric, which quantifies how often the top 3 output conditions match the most likely condition as suggested by a dermatologist panel. The HEAL metric is computed via the anticorrelation of this top-3 agreement with health outcome rankings.

We used a dataset of 5,420 teledermatology cases, enriched for diversity in age, sex and race/ethnicity, to retrospectively evaluate the model’s HEAL metric. The dataset consisted of “store-and-forward” cases from patients of 20 years or older from primary care providers in the USA and skin cancer clinics in Australia. Based on a review of the literature, we decided to explore race/ethnicity, sex and age as potential factors of inequity, and used sampling techniques to ensure that our evaluation dataset had sufficient representation of all race/ethnicity, sex and age groups. To quantify pre-existing health outcomes for each subgroup we relied on measurements from public databases endorsed by the World Health Organization, such as Years of Life Lost (YLLs) and Disability-Adjusted Life Years (DALYs; years of life lost plus years lived with disability).

HEAL metric for all dermatologic conditions across race/ethnicity subpopulations, including health outcomes (YLLs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance.
(* Higher is better; measures the likelihood the model performs equitably with respect to the axes in this table.)
HEAL metric for all dermatologic conditions across sexes, including health outcomes (DALYs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance. (* As above.)

Our analysis estimated that the model was 80.5% likely to perform equitably across race/ethnicity subgroups and 92.1% likely to perform equitably across sexes.

However, while the model was likely to perform equitably across age groups for cancer conditions specifically, we discovered that it had room for improvement across age groups for non-cancer conditions. For example, those 70+ have the poorest health outcomes related to non-cancer skin conditions, yet the model didn’t prioritize performance for this subgroup.

HEAL metrics for all cancer and non-cancer dermatologic conditions across age groups, including health outcomes (DALYs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance. (* As above.)

Putting things in context

For holistic evaluation, the HEAL metric cannot be employed in isolation. Instead this metric should be contextualized alongside many other factors ranging from computational efficiency and data privacy to ethical values, and aspects that may influence the results (e.g., selection bias or differences in representativeness of the evaluation data across demographic groups).

As an adversarial example, the HEAL metric can be artificially improved by deliberately reducing model performance for the most advantaged subpopulation until performance for that subpopulation is worse than all others. For illustrative purposes, given subpopulations A and B where A has worse health outcomes than B, consider the choice between two models: Model 1 (M1) performs 5% better for subpopulation A than for subpopulation B. Model 2 (M2) performs 5% worse on subpopulation A than B. The HEAL metric would be higher for M1 because it prioritizes performance on a subpopulation with worse outcomes. However, M1 may have absolute performances of just 75% and 70% for subpopulations A and B respectively, while M2 has absolute performances of 75% and 80% for subpopulations A and B respectively. Choosing M1 over M2 would lead to worse overall performance for all subpopulations because some subpopulations are worse-off while no subpopulation is better-off.

Accordingly, the HEAL metric should be used alongside a Pareto condition (discussed further in the paper), which restricts model changes so that outcomes for each subpopulation are either unchanged or improved compared to the status quo, and performance does not worsen for any subpopulation.

The HEAL framework, in its current form, assesses the likelihood that an ML-based model prioritizes performance for subpopulations with respect to pre-existing health disparities for specific subpopulations. This differs from the goal of understanding whether ML will reduce disparities in outcomes across subpopulations in reality. Specifically, modeling improvements in outcomes requires a causal understanding of steps in the care journey that happen both before and after use of any given model. Future research is needed to address this gap.

Conclusion

The HEAL framework enables a quantitative assessment of the likelihood that health AI technologies prioritize performance with respect to health disparities. The case study demonstrates how to apply the framework in the dermatological domain, indicating a high likelihood that model performance is prioritized with respect to health disparities across sex and race/ethnicity, but also revealing the potential for improvements for non-cancer conditions across age. The case study also illustrates limitations in the ability to apply all recommended aspects of the framework (e.g., mapping societal context, availability of data), thus highlighting the complexity of health equity considerations of ML-based tools.

This work is a proposed approach to address a grand challenge for AI and health equity, and may provide a useful evaluation framework not only during model development, but during pre-implementation and real-world monitoring stages, e.g., in the form of health equity dashboards. We hold that the strength of the HEAL framework is in its future application to various AI tools and use cases and its refinement in the process. Finally, we acknowledge that a successful approach towards understanding the impact of AI technologies on health equity needs to be more than a set of metrics. It will require a set of goals agreed upon by a community that represents those who will be most impacted by a model.

Acknowledgements

The research described here is joint work across many teams at Google. We are grateful to all our co-authors: Terry Spitz, Malcolm Pyles, Heather Cole-Lewis, Ellery Wulczyn, Stephen R. Pfohl, Donald Martin, Jr., Ronnachai Jaroensri, Geoff Keeling, Yuan Liu, Stephanie Farquhar, Qinghan Xue, Jenna Lester, Cían Hughes, Patricia Strachan, Fraser Tan, Peggy Bui, Craig H. Mermel, Lily H. Peng, Yossi Matias, Greg S. Corrado, Dale R. Webster, Sunny Virmani, Christopher Semturs, Yun Liu, and Po-Hsuan Cameron Chen. We also thank Lauren Winer, Sami Lachgar, Ting-An Lin, Aaron Loh, Morgan Du, Jenny Rizk, Renee Wong, Ashley Carrick, Preeti Singh, Annisah Um’rani, Jessica Schrouff, Alexander Brown, and Anna Iurchenko for their support of this project.

Read More

Federated learning on AWS using FedML, Amazon EKS, and Amazon SageMaker

Federated learning on AWS using FedML, Amazon EKS, and Amazon SageMaker

This post is co-written with Chaoyang He, Al Nevarez and Salman Avestimehr from FedML.

Many organizations are implementing machine learning (ML) to enhance their business decision-making through automation and the use of large distributed datasets. With increased access to data, ML has the potential to provide unparalleled business insights and opportunities. However, the sharing of raw, non-sanitized sensitive information across different locations poses significant security and privacy risks, especially in regulated industries such as healthcare.

To address this issue, federated learning (FL) is a decentralized and collaborative ML training technique that offers data privacy while maintaining accuracy and fidelity. Unlike traditional ML training, FL training occurs within an isolated client location using an independent secure session. The client only shares its output model parameters with a centralized server, known as the training coordinator or aggregation server, and not the actual data used to train the model. This approach alleviates many data privacy concerns while enabling effective collaboration on model training.

Although FL is a step towards achieving better data privacy and security, it’s not a guaranteed solution. Insecure networks lacking access control and encryption can still expose sensitive information to attackers. Additionally, locally trained information can expose private data if reconstructed through an inference attack. To mitigate these risks, the FL model uses personalized training algorithms and effective masking and parameterization before sharing information with the training coordinator. Strong network controls at local and centralized locations can further reduce inference and exfiltration risks.

In this post, we share an FL approach using FedML, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker to improve patient outcomes while addressing data privacy and security concerns.

The need for federated learning in healthcare

Healthcare relies heavily on distributed data sources to make accurate predictions and assessments about patient care. Limiting the available data sources to protect privacy negatively affects result accuracy and, ultimately, the quality of patient care. Therefore, ML creates challenges for AWS customers who need to ensure privacy and security across distributed entities without compromising patient outcomes.

Healthcare organizations must navigate strict compliance regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States, while implementing FL solutions. Ensuring data privacy, security, and compliance becomes even more critical in healthcare, requiring robust encryption, access controls, auditing mechanisms, and secure communication protocols. Additionally, healthcare datasets often contain complex and heterogeneous data types, making data standardization and interoperability a challenge in FL settings.

Use case overview

The use case outlined in this post is of heart disease data in different organizations, on which an ML model will run classification algorithms to predict heart disease in the patient. Because this data is across organizations, we use federated learning to collate the findings.

The Heart Disease dataset from the University of California Irvine’s Machine Learning Repository is a widely used dataset for cardiovascular research and predictive modeling. It consists of 303 samples, each representing a patient, and contains a combination of clinical and demographic attributes, as well as the presence or absence of heart disease.

This multivariate dataset has 76 attributes in the patient information, out of which 14 attributes are most commonly used for developing and evaluating ML algorithms to predict the presence of heart disease based on the given attributes.

FedML framework

There is a wide selection of FL frameworks, but we decided to use the FedML framework for this use case because it is open source and supports several FL paradigms. FedML provides a popular open source library, MLOps platform, and application ecosystem for FL. These facilitate the development and deployment of FL solutions. It provides a comprehensive suite of tools, libraries, and algorithms that enable researchers and practitioners to implement and experiment with FL algorithms in a distributed environment. FedML addresses the challenges of data privacy, communication, and model aggregation in FL, offering a user-friendly interface and customizable components. With its focus on collaboration and knowledge sharing, FedML aims to accelerate the adoption of FL and drive innovation in this emerging field. The FedML framework is model agnostic, including recently added support for large language models (LLMs). For more information, refer to Releasing FedLLM: Build Your Own Large Language Models on Proprietary Data using the FedML Platform.

FedML Octopus

System hierarchy and heterogeneity is a key challenge in real-life FL use cases, where different data silos may have different infrastructure with CPU and GPUs. In such scenarios, you can use FedML Octopus.

FedML Octopus is the industrial-grade platform of cross-silo FL for cross-organization and cross-account training. Coupled with FedML MLOps, it enables developers or organizations to conduct open collaboration from anywhere at any scale in a secure manner. FedML Octopus runs a distributed training paradigm inside each data silo and uses synchronous or asynchronous trainings.

FedML MLOps

FedML MLOps enables local development of code that can later be deployed anywhere using FedML frameworks. Before initiating training, you must create a FedML account, as well as create and upload the server and client packages in FedML Octopus. For more details, refer to steps and Introducing FedML Octopus: scaling federated learning into production with simplified MLOps.

Solution overview

We deploy FedML into multiple EKS clusters integrated with SageMaker for experiment tracking. We use Amazon EKS Blueprints for Terraform to deploy the required infrastructure. EKS Blueprints helps compose complete EKS clusters that are fully bootstrapped with the operational software that is needed to deploy and operate workloads. With EKS Blueprints, the configuration for the desired state of EKS environment, such as the control plane, worker nodes, and Kubernetes add-ons, is described as an infrastructure as code (IaC) blueprint. After a blueprint is configured, it can be used to create consistent environments across multiple AWS accounts and Regions using continuous deployment automation.

The content shared in this post reflects real-life situations and experiences, but it’s important to note that the deployment of these situations in different locations may vary. Although we utilize a single AWS account with separate VPCs, it’s crucial to understand that individual circumstances and configurations may differ. Therefore, the information provided should be used as a general guide and may require adaptation based on specific requirements and local conditions.

The following diagram illustrates our solution architecture.

In addition to the tracking provided by FedML MLOps for each training run, we use Amazon SageMaker Experiments to track the performance of each client model and the centralized (aggregator) model.

SageMaker Experiments is a capability of SageMaker that lets you create, manage, analyze, and compare your ML experiments. By recording experiment details, parameters, and results, researchers can accurately reproduce and validate their work. It allows for effective comparison and analysis of different approaches, leading to informed decision-making. Additionally, tracking experiments facilitates iterative improvement by providing insights into the progression of models and enabling researchers to learn from previous iterations, ultimately accelerating the development of more effective solutions.

We send the following to SageMaker Experiments for each run:

  • Model evaluation metrics – Training loss and Area Under the Curve (AUC)
  • Hyperparameters – Epoch, learning rate, batch size, optimizer, and weight decay

Prerequisites

To follow along with this post, you should have the following prerequisites:

Deploy the solution

To begin, clone the repository hosting the sample code locally:

git clone git@ssh.gitlab.aws.dev:west-ml-sa/fl_fedml.ai.git

Then deploy the use case infrastructure using the following commands:

terraform init
terraform apply

The Terraform template may take 20–30 minutes to fully deploy. After it’s deployed, follow the steps in the next sections to run the FL application.

Create an MLOps deployment package

As a part of the FedML documentation, we need to create the client and server packages, which the MLOps platform will distribute to the server and clients to begin training.

To create these packages, run the following script found in the root directory:

. ./build_mlops_pkg.sh

This will create the respective packages in the following directory in the project’s root directory:

mlops/dist-packages

Upload the packages to the FedML MLOps platform

Complete the following steps to upload the packages:

  1. On the FedML UI, choose My Applications in the navigation pane.
  2. Choose New Application.
  3. Upload the client and server packages from your workstation.
  4. You can also adjust the hyperparameters or create new ones.

Trigger federated training

To run federated training, complete the following steps:

  1. On the FedML UI, choose Project List in the navigation pane.
  2. Choose Create a new project.
  3. Enter a group name and a project name, then choose OK.
  4. Choose the newly created project and choose Create new run to trigger a training run.
  5. Select the edge client devices and the central aggregator server for this training run.
  6. Choose the application that you created in the previous steps.
  7. Update any of the hyperparameters or use the default settings.
  8. Choose Start to start training.
  9. Choose the Training Status tab and wait for the training run to complete. You can also navigate to the tabs available.
  10. When training is complete, choose the System tab to see the training time durations on your edge servers and aggregation events.

View results and experiment details

When the training is complete, you can view the results using FedML and SageMaker.

On the FedML UI, on the Models tab, you can see the aggregator and client model. You can also download these models from the website.

You can also log in to Amazon SageMaker Studio and choose Experiments in the navigation pane.

The following screenshot shows the logged experiments.

Experiment tracking code

In this section, we explore the code that integrates SageMaker experiment tracking with the FL framework training.

In an editor of your choice, open the following folder to see the edits to the code to inject SageMaker experiment tracking code as a part of the training:

cd fl_fedml.ai/

For tracking the training, we create a SageMaker experiment with parameters and metrics logged using the log_parameter and log_metric command as outlined in the following code sample.

An entry in the config/fedml_config.yaml file declares the experiment prefix, which is referenced in the code to create unique experiment names: sm_experiment_name: "fed-heart-disease". You can update this to any value of your choice.

For example, see the following code for the heart_disease_trainer.py, which is used by each client to train the model on their own dataset:

# Add this code before the for loop on epochs
# We are passing the experiment prefix & client-rank from the config
# to the function to create a unique name
experiment_name = unique_name_from_base(args.sm_experiment_name + "-client-" + str(args.rank))
print(f"Sagemaker Experiment Name: {experiment_name}")

For each client run, the experiment details are tracked using the following code in heart_disease_trainer.py:

# create an experiment and start a new run
with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=Session()) as run:
run.log_parameters(
{ "Train Data Size": str(len(train_data.dataset)),
"device": "cpu",
"center": args.rank,
"learning-rate": args.lr,
"batch-size": args.batch_size,
"client-optimizer" : args.client_optimizer,
"weight-decay": args.weight_decay
}
)
run.log_metric(name="Validation:AUC", value=epoch_auc)
run.log_metric(name="Training:Loss", value=epoch_loss)

Similarly, you can use the code in heart_disease_aggregator.py to run a test on local data after updating the model weights. The details are logged after each communication run with the clients.

# create an experiment and start a new run
with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=Session()) as run:
run.log_parameters(
{ "Train Data Size": str(len(test_data_local_dict[i])),
"device": "cpu",
"round": i,
"learning-rate": args.lr,
"batch-size": args.batch_size,
"client-optimizer" : args.client_optimizer,
"weight-decay": args.weight_decay
}
)
run.log_metric(name="Test:AUC", value=test_auc_metrics)
run.log_metric(name="Test:Loss", value=test_loss_metrics)

Clean up

When you’re done with the solution, make sure to clean up the resources used to ensure efficient resource utilization and cost management, and avoid unnecessary expenses and resource wastage. Active tidying up the environment, such as deleting unused instances, stopping unnecessary services, and removing temporary data, contributes to a clean and organized infrastructure. You can use the following code to clean up your resources:

terraform destroy -target=module.m_fedml_edge_server.module.eks_blueprints_kubernetes_addons -auto-approve
terraform destroy -target=module.m_fedml_edge_client_1.module.eks_blueprints_kubernetes_addons -auto-approve
terraform destroy -target=module.m_fedml_edge_client_2.module.eks_blueprints_kubernetes_addons -auto-approve

terraform destroy -target=module.m_fedml_edge_client_1.module.eks -auto-approve
terraform destroy -target=module.m_fedml_edge_client_2.module.eks -auto-approve
terraform destroy -target=module.m_fedml_edge_server.module.eks -auto-approve

terraform destroy

Summary

By using Amazon EKS as the infrastructure and FedML as the framework for FL, we are able to provide a scalable and managed environment for training and deploying shared models while respecting data privacy. With the decentralized nature of FL, organizations can collaborate securely, unlock the potential of distributed data, and improve ML models without compromising data privacy.

As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section.


About the Authors

Randy DeFauwRandy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. He entered the big data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences, including Strata and GlueCon.

Arnab Sinha is a Senior Solutions Architect for AWS, acting as Field CTO to help organizations design and build scalable solutions supporting business outcomes across data center migrations, digital transformation and application modernization, big data, and machine learning. He has supported customers across a variety of industries, including energy, retail, manufacturing, healthcare, and life sciences. Arnab holds all AWS Certifications, including the ML Specialty Certification. Prior to joining AWS, Arnab was a technology leader and previously held architect and engineering leadership roles.

Prachi Kulkarni is a Senior Solutions Architect at AWS. Her specialization is machine learning, and she is actively working on designing solutions using various AWS ML, big data, and analytics offerings. Prachi has experience in multiple domains, including healthcare, benefits, retail, and education, and has worked in a range of positions in product engineering and architecture, management, and customer success.

Tamer Sherif is a Principal Solutions Architect at AWS, with a diverse background in the technology and enterprise consulting services realm, spanning over 17 years as a Solutions Architect. With a focus on infrastructure, Tamer’s expertise covers a broad spectrum of industry verticals, including commercial, healthcare, automotive, public sector, manufacturing, oil and gas, media services, and more. His proficiency extends to various domains, such as cloud architecture, edge computing, networking, storage, virtualization, business productivity, and technical leadership.

Hans Nesbitt is a Senior Solutions Architect at AWS based out of Southern California. He works with customers across the western US to craft highly scalable, flexible, and resilient cloud architectures. In his spare time, he enjoys spending time with his family, cooking, and playing guitar.

Chaoyang He is Co-founder and CTO of FedML, Inc., a startup running for a community building open and collaborative AI from anywhere at any scale. His research focuses on distributed and federated machine learning algorithms, systems, and applications. He received his PhD in Computer Science from the University of Southern California.

Al Nevarez is Director of Product Management at FedML. Before FedML, he was a group product manager at Google, and a senior manager of data science at LinkedIn. He has several data product-related patents, and he studied engineering at Stanford University.

Salman Avestimehr is Co-founder and CEO of FedML. He has been a Dean’s Professor at USC, Director of the USC-Amazon Center on Trustworthy AI, and an Amazon Scholar in Alexa AI. He is an expert on federated and decentralized machine learning, information theory, security, and privacy. He is a Fellow of IEEE and received his PhD in EECS from UC Berkeley.

Samir Lad is an accomplished enterprise technologist with AWS who works closely with customers’ C-level executives. As a former C-suite executive who has driven transformations across multiple Fortune 100 companies, Samir shares his invaluable experiences to help his clients succeed in their own transformation journey.

Stephen Kraemer is a Board and CxO advisor and former executive at AWS. Stephen advocates culture and leadership as the foundations of success. He professes security and innovation the drivers of cloud transformation enabling highly competitive, data-driven organizations.

Read More

Enable data sharing through federated learning: A policy approach for chief digital officers

Enable data sharing through federated learning: A policy approach for chief digital officers

This is a guest blog post written by Nitin Kumar, a Lead Data Scientist at T and T Consulting Services, Inc.

In this post, we discuss the value and potential impact of federated learning in the healthcare field. This approach can help heart stroke patients, doctors, and researchers with faster diagnosis, enriched decision-making, and more informed, inclusive research work on stroke-related health issues, using a cloud-native approach with AWS services for lightweight lift and straightforward adoption.

Diagnosis challenges with heart strokes

Statistics from the Centers for Disease Control and Prevention (CDC) show that each year in the US, more than 795,000 people suffer from their first stroke, and about 25% of them experience recurrent attacks. It is the number five cause of death according to the American Stroke Association and a leading cause of disability in the US. Therefore, it’s crucial to have prompt diagnosis and treatment to reduce brain damage and other complications in acute stroke patients.

CTs and MRIs are the gold standard in imaging technologies for classifying different sub-types of strokes and are crucial during preliminary assessment of patients, determining the root cause, and treatment. One critical challenge here, especially in the case of acute stroke, is the time of imaging diagnosis, which on average ranges from 30 minutes up to an hour and can be much longer depending on emergency department crowding.

Doctors and medical staff need quick and accurate image diagnosis to evaluate a patient’s condition and propose treatment options. In Dr. Werner Vogels’s own words at AWS re:Invent 2023, “every second that a person has a stroke counts.” Stroke victims can lose around 1.9 billion neurons every second they are not being treated.

Medical data restrictions

You can use machine learning (ML) to assist doctors and researchers in diagnosis tasks, thereby speeding up the process. However, the datasets needed to build the ML models and give reliable results are sitting in silos across different healthcare systems and organizations. This isolated legacy data has the potential for massive impact if cumulated. So why hasn’t it been used yet?

There are multiple challenges when working with medical domain datasets and building ML solutions, including patient privacy, security of personal data, and certain bureaucratic and policy restrictions. Additionally, research institutions have been tightening their data sharing practices. These obstacles also prevent international research teams from working together on diverse and rich datasets, which could save lives and prevent disabilities that can result from heart strokes, among other benefits.

Policies and regulations like General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPPA), and California Consumer Privacy Act (CCPA) put guardrails on sharing data from the medical domain, especially patient data. Additionally, the datasets at individual institutes, organizations, and hospitals are often too small, are unbalanced, or have biased distribution, leading to model generalization constraints.

Federated learning: An introduction

Federated learning (FL) is a decentralized form of ML—a dynamic engineering approach. In this decentralized ML approach, the ML model is shared between organizations for training on proprietary data subsets, unlike traditional centralized ML training, where the model generally trains on aggregated datasets. The data stays protected behind the organization’s firewalls or VPC, while the model with its metadata is shared.

In the training phase, a global FL model is disseminated and synchronized between unit organizations for training on individual datasets, and a local trained model is returned. The final global model is available to use to make predictions for everyone among the participants, and can also be used as a base for further training to build local custom models for participating organizations. It can further be extended to benefit other institutes. This approach can significantly reduce the cybersecurity requirements for data in transit by removing the need for data to transit outside of the organization’s boundaries at all.

The following diagram illustrates an example architecture.

In the following sections, we discuss how federated learning can help.

Federation learning to save the day (and save lives)

For good artificial intelligence (AI), you need good data.

Legacy systems, which are frequently found in the federal domain, pose significant data processing challenges before you can derive any intelligence or merge them with newer datasets. This is an obstacle in providing valuable intelligence to leaders. It can lead to inaccurate decision-making because the proportion of legacy data is sometimes much more valuable compared to the newer small dataset. You want to resolve this bottleneck effectively and without workloads of manual consolidation and integration efforts (including cumbersome mapping processes) for legacy and newer datasets sitting across hospitals and institutes, which can take many months—if not years, in many cases. The legacy data is quite valuable because it holds important contextual information needed for accurate decision-making and well-informed model training, leading to reliable AI in the real world. Duration of data informs on long-term variations and patterns in the dataset that would otherwise go undetected and lead to biased and ill-informed predictions.

Breaking down these data silos to unite the untapped potential of the scattered data can save and transform many lives. It can also accelerate the research related to secondary health issues arising from heart strokes. This solution can help you share insights from data isolated between institutes due to policy and other reasons, whether you are a hospital, a research institute, or other health data-focused organizations. It can enable informed decisions on research direction and diagnosis. Additionally, it results in a centralized repository of intelligence via a secure, private, and global knowledge base.

Federated learning has many benefits in general and specifically for medical data settings.

Security and Privacy features:

  • Keeps sensitive data away from the internet and still uses it for ML, and harnesses its intelligence with differential privacy
  • Enables you to build, train, and deploy unbiased and robust models across not just machines but also networks, without any data security hazards
  • Overcomes the hurdles with multiple vendors managing the data
  • Eliminates the need for cross-site data sharing and global governance
  • Preserves privacy with differential privacy and offers secure multi-party computation with local training

Performance Improvements:

  • Addresses the small sample size problem in the medical imaging space and costly labeling processes
  • Balances the distribution of the data
  • Enables you to incorporate most traditional ML and deep learning (DL) methods
  • Uses pooled image sets to help improve statistical power, overcoming the sample size limitation of individual institutions

Resilience Benefits:

  • If any one party decides to leave, it won’t hinder the training
  • A new hospital or institute can join at any time; it’s not reliant on any specific dataset with any node organization
  • There is no need for extensive data engineering pipelines for the legacy data scattered across widespread geographical locations

These features can help bring the walls down between institutions hosting isolated datasets on similar domains. The solution can become a force multiplier by harnessing the unified powers of distributed datasets and improving efficiency by radically transforming the scalability aspect without the heavy infrastructure lift. This approach helps ML reach its full potential, becoming proficient at the clinical level and not just research.

Federated learning has comparable performance to regular ML, as shown in the following experiment by NVidia Clara (on Medical Modal ARchive (MMAR) using the BRATS2018 dataset). Here, FL achieved a comparable segmentation performance compared to training with centralized data: over 80% with approximately 600 epochs while training a multi-modal, multi-class brain tumor segmentation task.

Federated learning has been tested recently in a few medical sub-fields for use cases including patient similarity learning, patient representation learning, phenotyping, and predictive modeling.

Application blueprint: Federated learning makes it possible and straightforward

To get started with FL, you can choose from many high-quality datasets. For example, datasets with brain images include ABIDE (Autism Brain Imaging Data Exchange initiative), ADNI (Alzheimer’s Disease Neuroimaging Initiative), RSNA (Radiological Society of North America) Brain CT, BraTS (Multimodal Brain Tumor Image Segmentation Benchmark) updated regularly for the Brain Tumor Segmentation Challenge under UPenn (University of Pennsylvania), UK BioBank (covered in the following NIH paper), and IXI. Similarly for heart images, you can choose from several publicly available options, including ACDC (Automatic Cardiac Diagnosis Challenge), which is a cardiac MRI assessment dataset with full annotation mentioned by the National Library of Medicine in the following paper, and M&M (Multi-Center, Multi-Vendor, and Multi-Disease) Cardiac Segmentation Challenge mentioned in the following IEEE paper.

The following images show a probabilistic lesion overlap map for the primary lesions from the ATLAS R1.1 dataset. (Strokes are one of the most common causes of brain lesions according to Cleveland Clinic.)

For Electronic Health Records (EHR) data, a few datasets are available that follow the Fast Healthcare Interoperability Resources (FHIR) standard. This standard helps you build straightforward pilots by removing certain challenges with heterogenous, non-normalized datasets, allowing for seamless and secure exchange, sharing, and integration of datasets. The FHIR enables maximum interoperability. Dataset examples include MIMIC-IV (Medical Information Mart for Intensive Care). Other good-quality datasets that aren’t currently FHIR but can be easily converted include Centers for Medicare & Medicaid Services (CMS) Public Use Files (PUF) and eICU Collaborative Research Database from MIT (Massachusetts Institute of Technology). There are also other resources becoming available that offer FHIR-based datasets.

The lifecycle for implementing FL can include the following steps: task initialization, selection, configuration, model training, client/server communication, scheduling and optimization, versioning, testing, deployment, and termination. There are many time-intensive steps that go into preparing medical imaging data for traditional ML, as described in the following paper. Domain knowledge might be needed in some scenarios to preprocess raw patient data, especially due to its sensitive and private nature. These can be consolidated and sometimes eliminated for FL, saving crucial time for training and providing faster results.

Implementation

FL tools and libraries have grown with widespread support, making it straightforward to use FL without a heavy overhead lift. There are a lot of good resources and framework options available to get started. You can refer to the following extensive list of the most popular frameworks and tools in the FL domain, including PySyft, FedML, Flower, OpenFL, FATE, TensorFlow Federated, and NVFlare. It provides a beginner’s list of projects to get started quickly and build upon.

You can implement a cloud-native approach with Amazon SageMaker that seamlessly works with AWS VPC peering, keeping each node’s training in a private subnet in their respective VPC and enabling communication via private IPv4 addresses. Furthermore, model hosting on Amazon SageMaker JumpStart can help by exposing the endpoint API without sharing model weights.

It also takes away potential high-level compute challenges with on-premises hardware with Amazon Elastic Compute Cloud (Amazon EC2) resources. You can implement the FL client and servers on AWS with SageMaker notebooks and Amazon Simple Storage Service (Amazon S3), maintain regulated access to the data and model with AWS Identity and Access Management (IAM) roles, and use AWS Security Token Service (AWS STS) for client-side security. You can also build your own custom system for FL using Amazon EC2.

For a detailed overview of implementing FL with the Flower framework on SageMaker, and a discussion of its difference from distributed training, refer to Machine learning with decentralized training data using federated learning on Amazon SageMaker.

The following figures illustrate the architecture of transfer learning in FL.

Addressing FL data challenges

Federated learning comes with its own data challenges, including privacy and security, but they are straightforward to address. First, you need to address the data heterogeneity problem with medical imaging data arising from data being stored across different sites and participating organizations, known as a domain shift problem (also referred to as client shift in an FL system), as highlighted by Guan and Liu in the following paper. This can lead to a difference in convergence of the global model.

Other components for consideration include ensuring data quality and uniformity at the source, incorporating expert knowledge into the learning process to inspire confidence in the system among medical professionals, and achieving model precision. For more information about some of the potential challenges you may face during implementation, refer to the following paper.

AWS helps you resolve these challenges with features like the flexible compute of Amazon EC2 and pre-built Docker images in SageMaker for straightforward deployment. You can resolve client-side problems like unbalanced data and computation resources for each node organization. You can address server-side learning problems like poisoning attacks from malicious parties with Amazon Virtual Private Cloud (Amazon VPC), security groups, and other security standards, preventing client corruption and implementing AWS anomaly detection services.

AWS also helps in addressing real-world implementation challenges, which can include integration challenges, compatibility issues with current or legacy hospital systems, and user adoption hurdles, by offering flexible, easy-to-use, and effortless lift tech solutions.

With AWS services, you can enable large-scale FL-based research and clinical implementation and deployment, which can consist of various sites across the world.

Recent policies on interoperability highlight the need for federated learning

Many laws recently passed by the government include a focus on data interoperability, bolstering the need for cross-organizational interoperability of data for intelligence. This can be fulfilled by using FL, including frameworks like the TEFCA (Trusted Exchange Framework and Common Agreement) and the expanded USCID (United States Core Data for Interoperability).

The proposed idea also contributes towards the CDC’s capture and distribution initiative CDC Moving Forward. The following quote from the GovCIO article Data Sharing and AI Top Federal Health Agency Priorities in 2024 also echoes a similar theme: “These capabilities can also support the public in an equitable way, meeting patients where they are and unlocking critical access to these services. Much of this work comes down to the data.”

This can help medical institutes and agencies around the country (and across the globe) with data silos. They can benefit from seamless and secure integration and data interoperability, making medical data usable for impactful ML-based predictions and pattern recognition. You can start with images, but the approach is applicable to all EHR as well. The goal is to find the best approach for data stakeholders, with a cloud-native pipeline to normalize and standardize the data or directly use it for FL.

Let’s explore an example use case. Heart stroke imaging data and scans are scattered around the country and the world, sitting in isolated silos in institutes, universities, and hospitals, and separated by bureaucratic, geographical, and political boundaries. There is no single aggregated source and no easy way for medical professionals (non-programmers) to extract insights from it. At the same time, it’s not feasible to train ML and DL models on this data, which could help medical professionals make faster, more accurate decisions in critical times when heart scans can take hours to come in while the patient’s life could be hanging in the balance.

Other known use cases include POTS (Purchasing Online Tracking System) at NIH (National Institutes of Health) and cybersecurity for scattered and tiered intelligence solution needs at COMCOMs/MAJCOMs locations around the globe.

Conclusion

Federated learning holds great promise for legacy healthcare data analytics and intelligence. It’s straightforward to implement a cloud-native solution with AWS services, and FL is especially helpful for medical organizations with legacy data and technical challenges. FL can have a potential impact on the entire treatment cycle, and now even more so with the focus on data interoperability from large federal organizations and government leaders.

This solution can help you avoid reinventing the wheel and use the latest technology to take a leap from legacy systems and be at the forefront in this ever-evolving world of AI. You can also become a leader for best practices and an efficient approach to data interoperability within and across agencies and institutes in the health domain and beyond. If you are an institute or agency with data silos scattered around the country, you can benefit from this seamless and secure integration.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post. It is each customers’ responsibility to determine whether they are subject to HIPAA, and if so, how best to comply with HIPAA and its implementing regulations. Before using AWS in connection with protected health information, customers must enter an AWS Business Associate Addendum (BAA) and follow its configuration requirements.


About the Author

Nitin Kumar (MS, CMU) is a Lead Data Scientist at T and T Consulting Services, Inc. He has extensive experience with R&D prototyping, health informatics, public sector data, and data interoperability. He applies his knowledge of cutting-edge research methods to the federal sector to deliver innovative technical papers, POCs, and MVPs. He has worked with multiple federal agencies to advance their data and AI goals. Nitin’s other focus areas include natural language processing (NLP), data pipelines, and generative AI.

Read More