Deploy self-service question answering with the QnABot on AWS solution powered by Amazon Lex with Amazon Kendra and large language models

Deploy self-service question answering with the QnABot on AWS solution powered by Amazon Lex with Amazon Kendra and large language models

Powered by Amazon Lex, the QnABot on AWS solution is an open-source, multi-channel, multi-language conversational chatbot. QnABot allows you to quickly deploy self-service conversational AI into your contact center, websites, and social media channels, reducing costs, shortening hold times, and improving customer experience and brand sentiment. Customers now want to apply the power of large language models (LLMs) to further improve the customer experience with generative AI capabilities. This includes automatically generating accurate answers from existing company documents and knowledge bases, and making their self-service chatbots more conversational.

Our latest QnABot releases, v5.4.0+, can now use an LLM to disambiguate customer questions by taking conversational context into account, dynamically generating answers from relevant FAQs or Amazon Kendra search results and document passages. It also provides attribution and transparency by displaying links to the reference documents and context passages that were used by the LLM to construct the answers.

When you deploy QnABot, you can choose to automatically deploy a state-of-the-art open-source LLM model (Falcon-40B-instruct) on an Amazon SageMaker endpoint. The LLM landscape is constantly evolving—new models are released frequently and our customers want to experiment with different models and providers to see what works best for their use cases. This is why QnABot also integrates with any other LLM using an AWS Lambda function that you provide. To help you get started, we’ve also released a set of sample one-click deployable Lambda functions (plugins) to integrate QnABot with your choice of leading LLM providers, including our own Amazon Bedrock service and APIs from third-party providers, Anthropic and AI21.

In this post, we introduce the new Generative AI features for QnABot and walk through a tutorial to create, deploy, and customize QnABot to use these features. We also discuss some relevant use cases.

New Generative AI features

Using the LLM, QnABot now has two new important features, which we discuss in this section.

Generate answers to questions from Amazon Kendra search results or text passages

QnABot can now generate concise answers to questions from document extracts provided by an Amazon Kendra search, or text passages created or imported directly. This provides the following advantages:

  • The number of FAQs that you need to maintain and import into QnABot is reduced, because you can now synthesize concise answers on the fly from your existing documents.
  • Generated answers can be modified to create the best experience for the intended channel. For example, you can set the answers to be short, concise, and suitable for voice channel contact center bots, and website or text bots could potentially provide more detailed information.
  • Generated answers are fully compatible with QnABot’s multi-language support—users can interact in their chosen languages and receive generated answers in the same language.
  • Generated answers can include links to the reference documents and context passages used, to provide attribution and transparency on how the LLM constructed the answers.

For example, when asked “What is Amazon Lex?”, QnABot can retrieve relevant passages from an Amazon Kendra index (containing AWS documentation). QnABot then asks (prompts) the LLM to answer the question based on the context of the passages (which can also optionally be viewed in the web client). The following screenshot shows an example.

Disambiguate follow-up questions that rely on preceding conversation context

Understanding the direction and context of an ever-evolving conversation is key to building natural, human-like conversational interfaces. User queries often require a bot to interpret requests based on conversation memory and context. Now QnABot will ask the LLM to generate a disambiguated question based on the conversation history. This can then be used as a search query to retrieve the FAQs, passages, or Amazon Kendra results to answer the user’s question. The following is an example chat history:

Human: What is Amazon Lex?
AI: "Amazon Lex is an AWS service for building conversational interfaces for applications using voice and text..."
Human: Can it integrate with my CRM?

QnABot uses the LLM to rewrite the follow-up question to make “it” unambiguous, for example, “Can Amazon Lex integrate with my CRM system?” This allows users to interact like they would in a human conversation, and QnABot generates clear search queries to find the relevant FAQs or document passages that have the information to answer the user’s question.

These new features make QnABot more conversational and provide the ability to dynamically generate responses based on a knowledge base. This is still an experimental feature with tremendous potential. We strongly encourage users to experiment to find the best LLM and corresponding prompts and model parameters to use. QnABot makes it straightforward to experiment!

Tutorial

Time to try it! Let’s deploy the latest QnABot (v5.4.0 or later) and enable the new Generative AI features. The high-level steps are as follows:

  1. Create and populate an Amazon Kendra index.
  2. Choose and deploy an LLM plugin (optional).
  3. Deploy QnABot.
  4. Configure QnABot for your Lambda plugin (if using a plugin).
  5. Access the QnABot web client and start experimenting.
  6. Customize behavior using QnABot settings.
  7. Add curated Q&As and text passages to the knowledge base.

Create and populate an Amazon Kendra Index

Download and use the following AWS CloudFormation template to create a new Amazon Kendra index.

This template includes sample data containing AWS online documentation for Amazon Kendra, Amazon Lex, and SageMaker. Deploying the stack requires about 30 minutes followed by about 15 minutes to synchronize it and ingest the data in the index.

When the Amazon Kendra index stack is successfully deployed, navigate to the stack’s Outputs tab and note the Index Id, which you will use later when deploying QnABot.

Alternatively, if you already have an Amazon Kendra index with your own content, you can use it instead with your own example questions for the tutorial.

Choose and deploy an LLM plugin (optional)

QnABot can deploy a built-in LLM (Falcon-40B-instruct on SageMaker) or use Lambda functions to call any other LLMs of your choice. In this section, we show you how to use the Lambda option with a pre-built sample Lambda function. Skip to the next step if you want to use the built-in LLM instead.

First, choose the plugin LLM you want to use. Review your options from the qnabot-on-aws-plugin-samples repository README. As of this writing, plugins are available for Amazon Bedrock (in preview), and for AI21 and Anthropic third-party APIs. We expect to add more sample plugins over time.

Deploy your chosen plugin by choosing Launch Stack in the Deploy a new Plugin stack section, which will deploy into the us-east-1 Region by default (to deploy in other Regions, see Build and Publish QnABot Plugins CloudFormation artifacts).

When the Plugin stack is successfully deployed, navigate to the stack’s Outputs tab (see the following screenshot) and inspect its contents, which you will use in the following steps to deploy and configure QnABot. Keep this tab open in your browser.

Deploy QnABot

Choose Launch Solution from the QnABot implementation guide to deploy the latest QnABot template via AWS CloudFormation. Provide the following parameters:

  • For DefaultKendraIndexId, use the Amazon Kendra Index ID (a GUID) you collected earlier
  • For EmbeddingsApi (see Semantic Search using Text Embeddings), choose one of the following:
    • SAGEMAKER (the default built-in embeddings model)
    • LAMBDA (to use the Amazon Bedrock embeddings API with the BEDROCK-EMBEDDINGS-AND-LLM Plugin)
      • For EmbeddingsLambdaArn, use the EmbeddingsLambdaArn output value from your BEDROCK-EMBEDDINGS-AND-LLM Plugin stack.
  • For LLMApi (see Query Disambiguation for Conversational Retrieval, and Generative Question Answering), choose one of the following:
    • SAGEMAKER (the default built-in LLM model)
    • LAMBDA (to use the LLM Plugin deployed earlier)
      • For LLMLambdaArn, use the LLMLambdaArn output value from your Plugin stack

For all other parameters, accept the defaults (see the implementation guide for parameter definitions), and proceed to launch the QnABot stack.

Configure QnABot for your Lambda plugin (if using a plugin)

If you deployed QnABot using a sample LLM Lambda plugin to access a different LLM, update the QnABot model parameters and prompt template settings as recommended for your chosen plugin. For more information, see Update QnABot Settings. If you used the SageMaker (built-in) LLM option, skip to the next step, because the settings are already configured for you.

Access the QnABot web client and start experimenting

On the AWS CloudFormation console, choose the Outputs tab of the QnABot CloudFormation stack and choose the ClientURL link. Alternatively, launch the client by choosing QnABot on AWS Client from the Content Designer tools menu.

Now, try to ask questions related to AWS services, for example:

  • What is Amazon Lex?
  • How does SageMaker scale up inference workloads?
  • Is Kendra a search service?

Then you can ask follow-up questions without specifying the previously mentioned services or context, for example:

  • Is it secure?
  • Does it scale?

Customize behavior using QnABot settings

You can customize many settings on the QnABot Content Designer Settings page—see README – LLM Settings for a full list of relevant settings. For example, try the following:

  • Set ENABLE_DEBUG_RESPONSES to TRUE, save the settings, and try the previous questions again. Now you will see additional debug output at the top of each response, showing you how the LLM generates the Amazon Kendra search query based on the chat history, how long the LLM inferences took to run, and more. For example:
    [User Input: "Is it fast?", LLM generated query (1207 ms): "Does Amazon Kendra provide search results quickly?", Search string: "Is it fast? / Does Amazon Kendra provide search results quickly?"["LLM: LAMBDA"], Source: KENDRA RETRIEVE API

  • Set ENABLE_DEBUG_RESPONSES back to FALSE, set LLM_QA_SHOW_CONTEXT_TEXT and LLM_QA_SHOW_SOURCE_LINKS to FALSE, and try the examples again. Now the context and sources links are not shown, and the output contains only the LLM-generated response.
  • If you feel adventurous, experiment also with the LLM prompt template settings—LLM_GENERATE_QUERY_PROMPT_TEMPLATE and LLM_QA_PROMPT_TEMPLATE. Refer to README – LLM Settings to see how you can use placeholders for runtime values like chat history, context, user input, query, and more. Note that the default prompts can most likely be improved and customized to better suit your use cases, so don’t be afraid to experiment! If you break something, you can always revert to the default settings using the RESET TO DEFAULTS option on the settings page.

Add curated Q&As and text passages to the knowledge base

QnABot can, of course, continue to answer questions based on curated Q&As. It can also use the LLM to generate answers from text passages created or imported directly into QnABot, in addition to using Amazon Kendra index.

QnABot attempts to find a good answer to the disambiguated user question in the following sequence:

  1. QnA items
  2. Text passage items
  3. Amazon Kendra index

Let’s try some examples.

On the QnABot Content Designer tools menu, choose Import, then load the two example packages:

  • TextPassages-NurseryRhymeExamples
  • blog-samples-final

QnABot can use text embeddings to provide semantic search capability (using QnABot’s built-in OpenSearch index as a vector store), which improves accuracy and reduces question tuning, compared to standard OpenSearch keyword based matching. To illustrate this, try questions like the following:

  • “Tell me about the Alexa device with the screen”
  • “Tell me about Amazon’s video streaming device?”

These should ideally match the sample QNA you imported, even though the words used to ask the question are poor keyword matches (but good semantic matches) with the configured QnA items: Alexa.001 (What is an Amazon Echo Show) and FireTV.001 (What is an Amazon Fire TV).

Even if you are not (yet) using Amazon Kendra (and you should!), QnABot can also answer questions based on passages created or imported into Content Designer. The following questions (and follow-up questions) are all answered from an imported text passage item that contains the nursery rhyme 0.HumptyDumpty:

  • “Where did Humpty Dumpty sit before he fell?”
  • “What happened after he fell? Was he OK?”

When using embeddings, a good answer is an answer that returns a similarity score above the threshold defined by the corresponding threshold setting. See Semantic question matching, using Large Language Model Text Embeddings for more details on how to test and tune the threshold settings.

If there are no good answers, or if the LLM’s response matches the regular expression defined in LLM_QA_NO_HITS_REGEX, then QnABot invokes the configurable Custom Don’t Know (no_hits) behavior, which, by default, returns a message saying “You stumped me.”

Try some experiments by creating Q&As or text passage items in QnABot, as well as using an Amazon Kendra index for fallback generative answers. Experiment (using the TEST tab in the designer) to find the best values to use for the embedding threshold settings to get the behavior you want. It’s hard to get the perfect balance, but see if you can find a good enough balance that results in useful answers most of the time.

Clean up

You can, of course, leave QnABot running to experiment with it and show it to your colleagues! But it does incur some cost—see Plan your deployment – Cost for more details. To remove the resources and avoid costs, delete the following CloudFormation stacks:

  • QnABot stack
  • LLM Plugin stack (if applicable)
  • Amazon Kendra index stack

Use case examples

These new features make QnABot relevant for many customer use cases such as self-service customer service and support bots and automated web-based Q&A bots. We discuss two such use cases in this section.

Integrate with a contact center

QnABot’s automated question answering capabilities deliver effective self-service for inbound voice calls in contact centers, with compelling outcomes. For example, see how Kentucky Transportation Cabinet reduced call hold time and improved customer experience with self-service virtual agents using Amazon Connect and Amazon Lex. Integrating the new generative AI features strengthens this value proposition further by dynamically generating reliable answers from existing content such as documents, knowledge bases, and websites. This eliminates the need for bot designers to anticipate and manually curate responses to every possible question that a user might ask. To integrate QnABot with Amazon Connect, see Connecting QnABot on AWS to an Amazon Connect call center. To integrate with other contact centers, See how Amazon Chime SDK can be used to connect Amazon Lex voice bots with 3rd party contact centers via SIPREC and Build an AI-powered virtual agent for Genesys Cloud using QnABot and Amazon Lex.

The LLM-powered QnABot can also play a pivotal role as an automated real-time agent assistant. In this solution, QnABot passively listens to the conversation and uses the LLM to generate real-time suggestions for the human agents based on certain cues. It’s straightforward to set up and try—give it a go! This solution can be utilized with both Amazon Connect and other on-prem and cloud contact centers. For more information, see Live call analytics and agent assist for your contact center with Amazon language AI services.

Integrate with a website

Embedding QnABot in your websites and applications allows users to get automated assistance with natural dialogue. For more information, see Deploy a Web UI for your Chatbot. For curated Q&A content, use markdown syntax and UI buttons and incorporate links, images, videos, and other dynamic elements that inform and delight your users. Integrate the QnABot Amazon Lex web UI with Amazon Connect live chat to facilitate quick escalation to human agents when the automated assistant cannot fully address a user’s inquiry on its own.

The QnABot on the AWS plugin samples repository

As shown in this post, QnABot v5.4.0+ not only offers built-in support for embeddings and LLM models hosted on SageMaker, but it also offers the ability to easily integrate with any other LLM by using Lambda functions. You can author your own custom Lambda functions or get started faster with one of the samples we have provided in our new qnabot-on-aws-plugin-samples repository.

This repository includes a ready-to-deploy plugin for Amazon Bedrock, which supports both embeddings and text generation requests. At the time of writing, Amazon Bedrock is available through private preview—you can request preview access. When Amazon Bedrock is generally available, we expect to integrate it directly with QnABot, but why wait? Apply for preview access and use our sample plugin to start experimenting!

Today’s LLM innovation cycle is driving a breakneck pace of new model releases, each aiming to surpass the last. This repository will expand to include additional QnABot plugin samples over time. As of this writing, we have support for two third-party model providers: Anthropic and AI21. We plan to add integrations for more LLMs, embeddings, and potentially common use case examples involving Lambda hooks and knowledge bases. These plugins are offered as-is without warranty, for your convenience—users are responsible for supporting and maintaining them once deployed.

We hope that the QnABot plugins repository will mature into a thriving open-source community project. Watch the qnabot-on-aws-plugin-samples GitHub repo to receive updates on new plugins and features, use the Issues forum to report problems or provide feedback, and contribute improvements via pull requests. Contributions are welcome!

Conclusion

In this post, we introduced the new generative AI features for QnABot and walked through a solution to create, deploy, and customize QnABot to use these features. We also discussed some relevant use cases. Automating repetitive inquiries frees up human workers and boosts productivity. Rich responses create engaging experiences. Deploying the LLM-powered QnABot can help you elevate the self-service experience for customers and employees.

Don’t miss this opportunity—get started today and revolutionize the user experience on your QnABot deployment!


About the authors

Clevester Teo is a Senior Partner Solutions Architect at AWS, focused on the Public Sector partner ecosystem. He enjoys building prototypes, staying active outdoors, and experiencing new cuisines. Clevester is passionate about experimenting with emerging technologies and helping AWS partners innovate and better serve public sector customers.

Windrich is a Solutions Architect at AWS who works with customers in industries such as finance and transport, to help accelerate their cloud adoption journey. He is especially interested in Serverless technologies and how customers can leverage them to bring values to their business. Outside of work, Windrich enjoys playing and watching sports, as well as exploring different cuisines around the world.

Bob Strahan Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Read More

Automatically generate impressions from findings in radiology reports using generative AI on AWS

Automatically generate impressions from findings in radiology reports using generative AI on AWS

Radiology reports are comprehensive, lengthy documents that describe and interpret the results of a radiological imaging examination. In a typical workflow, the radiologist supervises, reads, and interprets the images, and then concisely summarizes the key findings. The summarization (or impression) is the most important part of the report because it helps clinicians and patients focus on the critical contents of the report that contain information for clinical decision-making. Creating a clear and impactful impression involves much more effort than simply restating the findings. The entire process is therefore laborious, time consuming, and prone to error. It often takes years of training for doctors to accumulate enough expertise in writing concise and informative radiology report summarizations, further highlighting the significance of automating the process. Additionally, automatic generation of report findings summarization is critical for radiology reporting. It enables translation of reports into human readable language, thereby alleviating the patients’ burden of reading through lengthy and obscure reports.

To solve this problem, we propose the use of generative AI, a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. Generative AI is powered by machine learning (ML) models—very large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). Recent advancements in ML (specifically the invention of the transformer-based neural network architecture) have led to the rise of models that contain billions of parameters or variables. The proposed solution in this post uses fine-tuning of pre-trained large language models (LLMs) to help generate summarizations based on findings in radiology reports.

This post demonstrates a strategy for fine-tuning publicly available LLMs for the task of radiology report summarization using AWS services. LLMs have demonstrated remarkable capabilities in natural language understanding and generation, serving as foundation models that can be adapted to various domains and tasks. There are significant benefits to using a pre-trained model. It reduces computation costs, reduces carbon footprints, and allows you to use state-of-the-art models without having to train one from scratch.

Our solution uses the FLAN-T5 XL FM, using Amazon SageMaker JumpStart, which is an ML hub offering algorithms, models, and ML solutions. We demonstrate how to accomplish this using a notebook in Amazon SageMaker Studio. Fine-tuning a pre-trained model involves further training on specific data to improve performance on a different but related task. This solution involves fine-tuning the FLAN-T5 XL model, which is an enhanced version of T5 (Text-to-Text Transfer Transformer) general-purpose LLMs. T5 reframes natural language processing (NLP) tasks into a unified text-to-text-format, in contrast to BERT-style models that can only output either a class label or a span of the input. It is fine-tuned for a summarization task on 91,544 free-text radiology reports obtained from the MIMIC-CXR dataset.

Overview of solution

In this section, we discuss the key components of our solution: choosing the strategy for the task, fine-tuning an LLM, and evaluating the results. We also illustrate the solution architecture and the steps to implement the solution.

Identify the strategy for the task

There are various strategies to approach the task of automating clinical report summarization. For example, we could use a specialized language model pre-trained on clinical reports from scratch. Alternatively, we could directly fine-tune a publicly available general-purpose language model to perform the clinical task. Using a fine-tuned domain-agnostic model may be necessary in settings where training a language model from scratch is too costly. In this solution, we demonstrate the latter approach of using a FLAN -T5 XL model, which we fine-tune for the clinical task of summarization of radiology reports. The following diagram illustrates the model workflow.

A typical radiology report is well-organized and succinct. Such reports often have three key sections:

  • Background – Provides general information about the demographics of the patient with essential information about the patient, clinical history, and relevant medical history and details of exam procedures
  • Findings – Presents detailed exam diagnosis and results
  • Impression – Concisely summarizes the most salient findings or interpretation of the findings with an assessment of significance and potential diagnosis based on the observed abnormalities

Using the findings section in the radiology reports, the solution generates the impression section, which corresponds to the doctors’ summarization. The following figure is an example of a radiology report .

Fine-tune a general-purpose LLM for a clinical task

In this solution, we fine-tune a FLAN-T5 XL model (tuning all the parameters of the model and optimizing them for the task). We fine-tune the model using the clinical domain dataset MIMIC-CXR, which is a publicly available dataset of chest radiographs. To fine-tune this model through SageMaker Jumpstart, labeled examples must be provided in the form of {prompt, completion} pairs. In this case, we use pairs of {Findings, Impression} from the original reports in MIMIC-CXR dataset. For inferencing, we use a prompt as shown in the following example:

The model is fine-tuned on an accelerated computing ml.p3.16xlarge instance with 64 virtual CPUs and 488 GiB memory. For validation, 5% of the dataset was randomly selected. The elapsed time of the SageMaker training job with fine-tuning was 38,468 seconds (approximately 11 hours).

Evaluate the results

When the training is complete, it’s critical to evaluate the results. For a quantitative analysis of the generated impression, we use ROUGE (Recall-Oriented Understudy for Gisting Evaluation), the most commonly used metric for evaluating summarization. This metric compares an automatically produced summary against a reference or a set of references (human-produced) summary or translation. ROUGE1 refers to the overlap of unigrams (each word) between the candidate (the model’s output) and reference summaries. ROUGE2 refers to the overlap of bigrams (two words) between the candidate and reference summaries. ROUGEL is a sentence-level metric and refers to the longest common subsequence (LCS) between two pieces of text. It ignores newlines in the text. ROUGELsum is a summary-level metric. For this metric, newlines in the text aren’t ignored but are interpreted as sentence boundaries. The LCS is then computed between each pair of reference and candidate sentences, and then union-LCS is computed. For aggregation of these scores over a given set of reference and candidate sentences, the average is computed.

Walkthrough and architecture

The overall solution architecture as shown in the following figure primarily consists of a model development environment that uses SageMaker Studio, model deployment with a SageMaker endpoint, and a reporting dashboard using Amazon QuickSight.

In the following sections, we demonstrate fine-tuning an LLM available on SageMaker JumpStart for summarization of a domain-specific task via the SageMaker Python SDK. In particular, we discuss the following topics:

  • Steps to set up the development environment
  • An overview of the radiology report datasets on which the model is fine-tuned and evaluated
  • A demonstration of fine-tuning the FLAN-T5 XL model using SageMaker JumpStart programmatically with the SageMaker Python SDK
  • Inferencing and evaluation of the pre-trained and fine-tuned models
  • Comparison of results from pre-trained model and fine-tuned models

The solution is available in the Generating Radiology Report Impression using generative AI with Large Language Model on AWS GitHub repo.

Prerequisites

To get started, you need an AWS account in which you can use SageMaker Studio. You will need to create a user profile for SageMaker Studio if you don’t already have one.

The training instance type used in this post is ml.p3.16xlarge. Note that the p3 instance type requires a service quota limit increase.

The MIMIC CXR dataset can be accessed through a data use agreement, which requires user registration and completion of a credentialing process.

Set up the development environment

To set up your development environment, you create an S3 bucket, configure a notebook, create endpoints and deploy the models, and create a QuickSight dashboard.

Create an S3 bucket

Create an S3 bucket called llm-radiology-bucket to host the training and evaluation datasets. This will also be used to store the model artifact during model development.

Configure a notebook

Complete the following steps:

  1. Launch SageMaker Studio from either the SageMaker console or the AWS Command Line Interface (AWS CLI).

For more information about onboarding to a domain, see Onboard to Amazon SageMaker Domain.

  1. Create a new SageMaker Studio notebook for cleaning the report data and fine-tuning the model. We use an ml.t3.medium 2vCPU+4GiB notebook instance with a Python 3 kernel.
  1. Within the notebook, install the relevant packages such as nest-asyncio, IPyWidgets (for interactive widgets for Jupyter notebook), and the SageMaker Python SDK:
!pip install nest-asyncio==1.5.5 --quiet 
!pip install ipywidgets==8.0.4 --quiet 
!pip install sagemaker==2.148.0 --quiet

Create endpoints and deploy the models for inference

For inferencing the pre-trained and fine-tuned models, create an endpoint and deploy each model in the notebook as follows:

  1. Create a model object from the Model class that can be deployed to an HTTPS endpoint.
  2. Create an HTTPS endpoint with the model object’s pre-built deploy() method:
from sagemaker import model_uris, script_uris
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base

# Retrieve the URI of the pre-trained model
pre_trained_model_uri =model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="inference")

large_model_env = {"SAGEMAKER_MODEL_SERVER_WORKERS": "1", "TS_DEFAULT_WORKERS_PER_MODEL": "1"}

pre_trained_name = name_from_base(f"jumpstart-demo-pre-trained-{model_id}")

# Create the SageMaker model instance of the pre-trained model
if ("small" in model_id) or ("base" in model_id):
    deploy_source_uri = script_uris.retrieve(
        model_id=model_id, model_version=model_version, script_scope="inference"
    )
    pre_trained_model = Model(
        image_uri=deploy_image_uri,
        source_dir=deploy_source_uri,
        entry_point="inference.py",
        model_data=pre_trained_model_uri,
        role=aws_role,
        predictor_cls=Predictor,
        name=pre_trained_name,
    )
else:
    # For those large models, we already repack the inference script and model
    # artifacts for you, so the `source_dir` argument to Model is not required.
    pre_trained_model = Model(
        image_uri=deploy_image_uri,
        model_data=pre_trained_model_uri,
        role=aws_role,
        predictor_cls=Predictor,
        name=pre_trained_name,
        env=large_model_env,
    )

# Deploy the pre-trained model. Note that we need to pass Predictor class when we deploy model
# through Model class, for being able to run inference through the SageMaker API
pre_trained_predictor = pre_trained_model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=pre_trained_name,
)

Create a QuickSight dashboard

Create a QuickSight dashboard with an Athena data source with inference results in Amazon Simple Storage Service (Amazon S3) to compare the inference results with the ground truth. The following screenshot shows our example dashboard.

Radiology report datasets

The model is now fine-tuned, all the model parameters are tuned on 91,544 reports downloaded from the MIMIC-CXR v2.0 dataset. Because we used only the radiology report text data, we downloaded just one compressed report file (mimic-cxr-reports.zip) from the MIMIC-CXR website. Now we evaluate the fine-tuned model on 2,000 reports (referred to as the dev1 dataset) from the separate held out subset of this dataset. We use another 2,000 radiology reports (referred to as dev2) for evaluating the fine-tuned model from the chest X-ray collection from the Indiana University hospital network. All the datasets are read as JSON files and uploaded to the newly created S3 bucket llm-radiology-bucket. Note that all the datasets by default don’t contain any Protected Health Information (PHI); all sensitive information is replaced with three consecutive underscores (___) by the providers.

Fine-tune with the SageMaker Python SDK

For fine-tuning, the model_id is specified as huggingface-text2text-flan-t5-xl from the list of SageMaker JumpStart models. The training_instance_type is set as ml.p3.16xlarge and the inference_instance_type as ml.g5.2xlarge. The training data in JSON format is read from the S3 bucket. The next step is to use the selected model_id to extract the SageMaker JumpStart resource URIs, including image_uri (the Amazon Elastic Container Registry (Amazon ECR) URI for the Docker image), model_uri (the pre-trained model artifact Amazon S3 URI), and script_uri (the training script):

from sagemaker import image_uris, model_uris, script_uris

# Training instance will use this image
train_image_uri = image_uris.retrieve(
    region=aws_region,
    framework=None,  # automatically inferred from model_id
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,
)

# Pre-trained model
train_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="training"
)

# Script to execute on the training instance
train_script_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"
)

output_location = f"s3://{output_bucket}/demo-llm-rad-fine-tune-flan-t5/"

Also, an output location is set up as a folder within the S3 bucket.

Only one hyperparameter, epochs, is changed to 3, and the rest all are set as default:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# We will override some default hyperparameters with custom values
hyperparameters["epochs"] = "3"
print(hyperparameters)

The training metrics such as eval_loss (for validation loss), loss (for training loss), and epoch to be tracked are defined and listed:

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base

model_name = "-".join(model_id.split("-")[2:])  # get the most informative part of ID
training_job_name = name_from_base(f"js-demo-{model_name}-{hyperparameters['epochs']}")
print(f"{bold}job name:{unbold} {training_job_name}")

training_metric_definitions = [
    {"Name": "val_loss", "Regex": "'eval_loss': ([0-9\.]+)"},
    {"Name": "train_loss", "Regex": "'loss': ([0-9\.]+)"},
    {"Name": "epoch", "Regex": "'epoch': ([0-9\.]+)"},
]

We use the SageMaker JumpStart resource URIs (image_uri, model_uri, script_uri) identified earlier to create an estimator and fine-tune it on the training dataset by specifying the S3 path of the dataset. The Estimator class requires an entry_point parameter. In this case, JumpStart uses transfer_learning.py. The training job fails to run if this value is not set.

# Create SageMaker Estimator instance
sm_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    model_uri=train_model_uri,
    source_dir=train_script_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    volume_size=300,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=output_location,
    metric_definitions=training_metric_definitions,
)

# Launch a SageMaker training job over data located in the given S3 path
# Training jobs can take hours, it is recommended to set wait=False,
# and monitor job status through SageMaker console
sm_estimator.fit({"training": train_data_location}, job_name=training_job_name, wait=True)

This training job can take hours to complete; therefore, it’s recommended to set the wait parameter to False and monitor the training job status on the SageMaker console. Use the TrainingJobAnalytics function to keep track of the training metrics at various timestamps:

from sagemaker import TrainingJobAnalytics

# Wait for a couple of minutes for the job to start before running this cell
# This can be called while the job is still running
df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()

Deploy inference endpoints

In order to draw comparisons, we deploy inference endpoints for both the pre-trained and fine-tuned models.

First, retrieve the inference Docker image URI using model_id, and use this URI to create a SageMaker model instance of the pre-trained model. Deploy the pre-trained model by creating an HTTPS endpoint with the model object’s pre-built deploy() method. In order to run inference through SageMaker API, make sure to pass the Predictor class.

from sagemaker import image_uris
# Retrieve the inference docker image URI. This is the base HuggingFace container image
deploy_image_uri = image_uris.retrieve(
    region=aws_region,
    framework=None,  # automatically inferred from model_id
    model_id=model_id,
    model_version=model_version,
    image_scope="inference",
    instance_type=inference_instance_type,
)

# Retrieve the URI of the pre-trained model
pre_trained_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="inference"
)

pre_trained_model = Model(
        image_uri=deploy_image_uri,
        model_data=pre_trained_model_uri,
        role=aws_role,
        predictor_cls=Predictor,
        name=pre_trained_name,
        env=large_model_env,
    )

# Deploy the pre-trained model. Note that we need to pass Predictor class when we deploy model
# through Model class, for being able to run inference through the SageMaker API
pre_trained_predictor = pre_trained_model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=pre_trained_name,
)

Repeat the preceding step to create a SageMaker model instance of the fine-tuned model and create an endpoint to deploy the model.

Evaluate the models

First, set the length of summarized text, number of model outputs (should be greater than 1 if multiple summaries need to be generated), and number of beams for beam search.

Construct the inference request as a JSON payload and use it to query the endpoints for the pre-trained and fine-tuned models.

Compute the aggregated ROUGE scores (ROUGE1, ROUGE2, ROUGEL, ROUGELsum) as described earlier.

Compare the results

The following table depicts the evaluation results for the dev1 and dev2 datasets. The evaluation result on dev1 (2,000 findings from the MIMIC CXR Radiology Report) shows approximately 38 percentage points improvement in the aggregated average ROUGE1 and ROUGE2 scores compared to the pre-trained model. For dev2, an improvement of 31 percentage points and 25 percentage points is observed in ROUGE1 and ROUGE2 scores. Overall, fine-tuning led to an improvement of 38.2 percentage points and 31.3 percentage points in ROUGELsum scores for the dev1 and dev2 datasets, respectively.

Evaluation

Dataset

Pre-trained Model Fine-tuned model
ROUGE1 ROUGE2 ROUGEL ROUGELsum ROUGE1 ROUGE2 ROUGEL ROUGELsum
dev1 0.2239 0.1134 0.1891 0.1891 0.6040 0.4800 0.5705 0.5708
dev2 0.1583 0.0599 0.1391 0.1393 0.4660 0.3125 0.4525 0.4525

The following box plots depict the distribution of ROUGE scores for the dev1 and dev2 datasets evaluated using the fine-tuned model.

(a): dev1 (b): dev2

The following table shows that ROUGE scores for the evaluation datasets have approximately the same median and mean and therefore are symmetrically distributed.

Datasets Scores Count Mean Std Deviation Minimum 25% percentile 50% percentile 75% percentile Maximum
dev1 ROUGE1 2000.00 0.6038 0.3065 0.0000 0.3653 0.6000 0.9384 1.0000
ROUGE 2 2000.00 0.4798 0.3578 0.0000 0.1818 0.4000 0.8571 1.0000
ROUGE L 2000.00 0.5706 0.3194 0.0000 0.3000 0.5345 0.9101 1.0000
ROUGELsum 2000.00 0.5706 0.3194 0.0000 0.3000 0.5345 0.9101 1.0000
dev2 ROUGE 1 2000.00 0.4659 0.2525 0.0000 0.2500 0.5000 0.7500 1.0000
ROUGE 2 2000.00 0.3123 0.2645 0.0000 0.0664 0.2857 0.5610 1.0000
ROUGE L 2000.00 0.4529 0.2554 0.0000 0.2349 0.4615 0.7500 1.0000
ROUGE Lsum 2000.00 0.4529 0.2554 0.0000 0.2349 0.4615 0.7500 1.0000

Clean up

To avoid incurring future charges, delete the resources you created with the following code:

# Delete resources
pre_trained_predictor.delete_model()
pre_trained_predictor.delete_endpoint()
fine_tuned_predictor.delete_model()
fine_tuned_predictor.delete_endpoint()

Conclusion

In this post, we demonstrated how to fine-tune a FLAN-T5 XL model for a clinical domain-specific summarization task using SageMaker Studio. To increase the confidence, we compared the predictions with ground truth and evaluated the results using ROUGE metrics. We demonstrated that a model fine-tuned for a specific task returns better results than a model pre-trained on a generic NLP task. We would like to point out that fine-tuning a general-purpose LLM eliminates the cost of pre-training altogether.

Although the work presented here focuses on chest X-ray reports, it has the potential to be expanded to bigger datasets with varied anatomies and modalities, such as MRI and CT, for which radiology reports might be more complex with multiple findings. In such cases, radiologists could generate impressions in order of criticality and include follow-up recommendations. Furthermore, setting up a feedback loop for this application would enable radiologists to improve the performance of the model over time.

As we showed in this post, the fine-tuned model generates impressions for radiology reports with high ROUGE scores. You can try to fine-tune LLMs on other domain-specific medical reports from different departments.


About the authors

Dr. Adewale Akinfaderin is a Senior Data Scientist in Healthcare and Life Sciences at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global healthcare customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in Physics and a Doctorate degree in Engineering.

Priya Padate is a Senior Partner Solutions Architect with extensive expertise in Healthcare and Life Sciences at AWS. Priya drives go-to-market strategies with partners and drives solution development to accelerate AI/ML-based development. She is passionate about using technology to transform the healthcare industry to drive better patient care outcomes.

Ekta Walia Bhullar, PhD, is a senior AI/ML consultant with AWS Healthcare and Life Sciences (HCLS) professional services business unit. She has extensive experience in the application of AI/ML within the healthcare domain, especially in radiology. Outside of work, when not discussing AI in radiology, she likes to run and hike.

Read More

MLOps for batch inference with model monitoring and retraining using Amazon SageMaker, HashiCorp Terraform, and GitLab CI/CD

MLOps for batch inference with model monitoring and retraining using Amazon SageMaker, HashiCorp Terraform, and GitLab CI/CD

Maintaining machine learning (ML) workflows in production is a challenging task because it requires creating continuous integration and continuous delivery (CI/CD) pipelines for ML code and models, model versioning, monitoring for data and concept drift, model retraining, and a manual approval process to ensure new versions of the model satisfy both performance and compliance requirements.

In this post, we describe how to create an MLOps workflow for batch inference that automates job scheduling, model monitoring, retraining, and registration, as well as error handling and notification by using Amazon SageMaker, Amazon EventBridge, AWS Lambda, Amazon Simple Notification Service (Amazon SNS), HashiCorp Terraform, and GitLab CI/CD. The presented MLOps workflow provides a reusable template for managing the ML lifecycle through automation, monitoring, auditability, and scalability, thereby reducing the complexities and costs of maintaining batch inference workloads in production.

Solution overview

The following figure illustrates the proposed target MLOps architecture for enterprise batch inference for organizations who use GitLab CI/CD and Terraform infrastructure as code (IaC) in conjunction with AWS tools and services. GitLab CI/CD serves as the macro-orchestrator, orchestrating model build and model deploy pipelines, which include sourcing, building, and provisioning Amazon SageMaker Pipelines and supporting resources using the SageMaker Python SDK and Terraform. SageMaker Python SDK is used to create or update SageMaker pipelines for training, training with hyperparameter optimization (HPO), and batch inference. Terraform is used to create additional resources such as EventBridge rules, Lambda functions, and SNS topics for monitoring SageMaker pipelines and sending notifications (for example, when a pipeline step fails or succeeds). SageMaker Pipelines serves as the orchestrator for ML model training and inference workflows.

This architecture design represents a multi-account strategy where ML models are built, trained, and registered in a central model registry within a data science development account (which has more controls than a typical application development account). Then, inference pipelines are deployed to staging and production accounts using automation from DevOps tools such as GitLab CI/CD. The central model registry could optionally be placed in a shared services account as well. Refer to Operating model for best practices regarding a multi-account strategy for ML.

In the following subsections, we discuss different aspects of the architecture design in detail.

Infrastructure as code

IaC offers a way to manage IT infrastructure through machine-readable files, ensuring efficient version control. In this post and the accompanying code sample, we demonstrate how to use HashiCorp Terraform with GitLab CI/CD to manage AWS resources effectively. This approach underscores the key benefit of IaC, offering a transparent and repeatable process in IT infrastructure management.

Model training and retraining

In this design, the SageMaker training pipeline runs on a schedule (via EventBridge) or based on an Amazon Simple Storage Service (Amazon S3) event trigger (for example, when a trigger file or new training data, in case of a single training data object, is placed in Amazon S3) to regularly recalibrate the model with new data. This pipeline does not introduce structural or material changes to the model because it uses fixed hyperparameters that have been approved during the enterprise model review process.

The training pipeline registers the newly trained model version in the Amazon SageMaker Model Registry if the model exceeds a predefined model performance threshold (for example, RMSE for regression and F1 score for classification). When a new version of the model is registered in the model registry, it triggers a notification to the responsible data scientist via Amazon SNS. The data scientist then needs to review and manually approve the latest version of the model in the Amazon SageMaker Studio UI or via an API call using the AWS Command Line Interface (AWS CLI) or AWS SDK for Python (Boto3) before the new version of model can be utilized for inference.

The SageMaker training pipeline and its supporting resources are created by the GitLab model build pipeline, either via a manual run of the GitLab pipeline or automatically when code is merged into the main branch of the model build Git repository.

Batch inference

The SageMaker batch inference pipeline runs on a schedule (via EventBridge) or based on an S3 event trigger as well. The batch inference pipeline automatically pulls the latest approved version of the model from the model registry and uses it for inference. The batch inference pipeline includes steps for checking data quality against a baseline created by the training pipeline, as well as model quality (model performance) if ground truth labels are available.

If the batch inference pipeline discovers data quality issues, it will notify the responsible data scientist via Amazon SNS. If it discovers model quality issues (for example, RMSE is greater than a pre-specified threshold), the pipeline step for the model quality check will fail, which will in turn trigger an EventBridge event to start the training with HPO pipeline.

The SageMaker batch inference pipeline and its supporting resources are created by the GitLab model deploy pipeline, either via a manual run of the GitLab pipeline or automatically when code is merged into the main branch of the model deploy Git repository.

Model tuning and retuning

The SageMaker training with HPO pipeline is triggered when the model quality check step of the batch inference pipeline fails. The model quality check is performed by comparing model predictions with the actual ground truth labels. If the model quality metric (for example, RMSE for regression and F1 score for classification) doesn’t meet a pre-specified criterion, the model quality check step is marked as failed. The SageMaker training with HPO pipeline can also be triggered manually (in the SageMaker Studio UI or via an API call using the AWS CLI or SageMaker Python SDK) by the responsible data scientist if needed. Because the model hyperparameters are changing, the responsible data scientist needs to obtain approval from the enterprise model review board before the new model version can be approved in the model registry.

The SageMaker training with HPO pipeline and its supporting resources are created by the GitLab model build pipeline, either via a manual run of the GitLab pipeline or automatically when code is merged into the main branch of the model build Git repository.

Model monitoring

Data statistics and constraints baselines are generated as part of the training and training with HPO pipelines. They are saved to Amazon S3 and also registered with the trained model in the model registry if the model passes evaluation. The proposed architecture for the batch inference pipeline uses Amazon SageMaker Model Monitor for data quality checks, while using custom Amazon SageMaker Processing steps for model quality check. This design decouples data and model quality checks, which in turn allows you to only send a warning notification when data drift is detected; and trigger the training with HPO pipeline when a model quality violation is detected.

Model approval

After a newly trained model is registered in the model registry, the responsible data scientist receives a notification. If the model has been trained by the training pipeline (recalibration with new training data while hyperparameters are fixed), there is no need for approval from the enterprise model review board. The data scientist can review and approve the new version of the model independently. On the other hand, if the model has been trained by the training with HPO pipeline (retuning by changing hyperparameters), the new model version needs to go through the enterprise review process before it can be used for inference in production. When the review process is complete, the data scientist can proceed and approve the new version of the model in the model registry. Changing the status of the model package to Approved will trigger a Lambda function via EventBridge, which will in turn trigger the GitLab model deploy pipeline via an API call. This will automatically update the SageMaker batch inference pipeline to utilize the latest approved version of the model for inference.

There are two main ways to approve or reject a new model version in the model registry: using the AWS SDK for Python (Boto3) or from the SageMaker Studio UI. By default, both the training pipeline and training with HPO pipeline set ModelApprovalStatus to PendingManualApproval. The responsible data scientist can update the approval status for the model by calling the update_model_package API from Boto3. Refer to Update the Approval Status of a Model for details about updating the approval status of a model via the SageMaker Studio UI.

Data I/O design

SageMaker interacts directly with Amazon S3 for reading inputs and storing outputs of individual steps in the training and inference pipelines. The following diagram illustrates how different Python scripts, raw and processed training data, raw and processed inference data, inference results and ground truth labels (if available for model quality monitoring), model artifacts, training and inference evaluation metrics (model quality monitoring), as well as data quality baselines and violation reports (for data quality monitoring) can be organized within an S3 bucket. The direction of arrows in the diagram indicates which files are inputs or outputs from their respective steps in the SageMaker pipelines. Arrows have been color-coded based on pipeline step type to make them easier to read. The pipeline will automatically upload Python scripts from the GitLab repository and store output files or model artifacts from each step in the appropriate S3 path.

The data engineer is responsible for the following:

  • Uploading labeled training data to the appropriate path in Amazon S3. This includes adding new training data regularly to ensure the training pipeline and training with HPO pipeline have access to recent training data for model retraining and retuning, respectively.
  • Uploading input data for inference to the appropriate path in S3 bucket before a planned run of the inference pipeline.
  • Uploading ground truth labels to the appropriate S3 path for model quality monitoring.

The data scientist is responsible for the following:

  • Preparing ground truth labels and providing them to the data engineering team for uploading to Amazon S3.
  • Taking the model versions trained by the training with HPO pipeline through the enterprise review process and obtaining necessary approvals.
  • Manually approving or rejecting newly trained model versions in the model registry.
  • Approving the production gate for the inference pipeline and supporting resources to be promoted to production.

Sample code

In this section, we present a sample code for batch inference operations with a single-account setup as shown in the following architecture diagram. The sample code can be found in the GitHub repository, and can serve as a starting point for batch inference with model monitoring and automatic retraining using quality gates often required for enterprises. The sample code differs from the target architecture in the following ways:

  • It uses a single AWS account for building and deploying the ML model and supporting resources. Refer to Organizing Your AWS Environment Using Multiple Accounts for guidance on multi-account setup on AWS.
  • It uses a single GitLab CI/CD pipeline for building and deploying the ML model and supporting resources.
  • When a new version of the model is trained and approved, the GitLab CI/CD pipeline is not triggered automatically and needs to be run manually by the responsible data scientist to update the SageMaker batch inference pipeline with the latest approved version of the model.
  • It only supports S3 event-based triggers for running the SageMaker training and inference pipelines.

Prerequisites

You should have the following prerequisites before deploying this solution:

  • An AWS account
  • SageMaker Studio
  • A SageMaker execution role with Amazon S3 read/write and AWS Key Management Service (AWS KMS) encrypt/decrypt permissions
  • An S3 bucket for storing data, scripts, and model artifacts
  • Terraform version 0.13.5 or greater
  • GitLab with a working Docker runner for running the pipelines
  • The AWS CLI
  • jq
  • unzip
  • Python3 (Python 3.7 or greater) and the following Python packages:
    • boto3
    • sagemaker
    • pandas
    • pyyaml

Repository structure

The GitHub repository contains the following directories and files:

  • /code/lambda_function/ – This directory contains the Python file for a Lambda function that prepares and sends notification messages (via Amazon SNS) about the SageMaker pipelines’ step state changes
  • /data/ – This directory includes the raw data files (training, inference, and ground truth data)
  • /env_files/ – This directory contains the Terraform input variables file
  • /pipeline_scripts/ – This directory contains three Python scripts for creating and updating training, inference, and training with HPO SageMaker pipelines, as well as configuration files for specifying each pipeline’s parameters
  • /scripts/ – This directory contains additional Python scripts (such as preprocessing and evaluation) that are referenced by the training, inference, and training with HPO pipelines
  • .gitlab-ci.yml – This file specifies the GitLab CI/CD pipeline configuration
  • /events.tf – This file defines EventBridge resources
  • /lambda.tf – This file defines the Lambda notification function and the associated AWS Identity and Access Management (IAM) resources
  • /main.tf – This file defines Terraform data sources and local variables
  • /sns.tf – This file defines Amazon SNS resources
  • /tags.json – This JSON file allows you to declare custom tag key-value pairs and append them to your Terraform resources using a local variable
  • /variables.tf – This file declares all the Terraform variables

Variables and configuration

The following table shows the variables that are used to parameterize this solution. Refer to the ./env_files/dev_env.tfvars file for more details.

Name Description
bucket_name S3 bucket that is used to store data, scripts, and model artifacts
bucket_prefix S3 prefix for the ML project
bucket_train_prefix S3 prefix for training data
bucket_inf_prefix S3 prefix for inference data
notification_function_name Name of the Lambda function that prepares and sends notification messages about SageMaker pipelines’ step state changes
custom_notification_config The configuration for customizing notification message for specific SageMaker pipeline steps when a specific pipeline run status is detected
email_recipient The email address list for receiving SageMaker pipelines’ step state change notifications
pipeline_inf Name of the SageMaker inference pipeline
pipeline_train Name of the SageMaker training pipeline
pipeline_trainwhpo Name of SageMaker training with HPO pipeline
recreate_pipelines If set to true, the three existing SageMaker pipelines (training, inference, training with HPO) will be deleted and new ones will be created when GitLab CI/CD is run
model_package_group_name Name of the model package group
accuracy_mse_threshold Maximum value of MSE before requiring an update to the model
role_arn IAM role ARN of the SageMaker pipeline execution role
kms_key KMS key ARN for Amazon S3 and SageMaker encryption
subnet_id Subnet ID for SageMaker networking configuration
sg_id Security group ID for SageMaker networking configuration
upload_training_data If set to true, training data will be uploaded to Amazon S3, and this upload operation will trigger the run of the training pipeline
upload_inference_data If set to true, inference data will be uploaded to Amazon S3, and this upload operation will trigger the run of the inference pipeline
user_id The employee ID of the SageMaker user that is added as a tag to SageMaker resources

Deploy the solution

Complete the following steps to deploy the solution in your AWS account:

  1. Clone the GitHub repository into your working directory.
  2. Review and modify the GitLab CI/CD pipeline configuration to suit your environment. The configuration is specified in the ./gitlab-ci.yml file.
  3. Refer to the README file to update the general solution variables in the ./env_files/dev_env.tfvars file. This file contains variables for both Python scripts and Terraform automation.
    1. Check the additional SageMaker Pipelines parameters that are defined in the YAML files under ./batch_scoring_pipeline/pipeline_scripts/. Review and update the parameters if necessary.
  4. Review the SageMaker pipeline creation scripts in ./pipeline_scripts/ as well as the scripts that are referenced by them in the ./scripts/ folder. The example scripts provided in the GitHub repo are based on the Abalone dataset. If you are going to use a different dataset, ensure you update the scripts to suit your particular problem.
  5. Put your data files into the ./data/ folder using the following naming convention. If you are using the Abalone dataset along with the provided example scripts, ensure the data files are headerless, the training data includes both independent and target variables with the original order of columns preserved, the inference data only includes independent variables, and the ground truth file only includes the target variable.
    1. training-data.csv
    2. inference-data.csv
    3. ground-truth.csv
  6. Commit and push the code to the repository to trigger the GitLab CI/CD pipeline run (first run). Note that the first pipeline run will fail on the pipeline stage because there’s no approved model version yet for the inference pipeline script to use. Review the step log and verify a new SageMaker pipeline named TrainingPipeline has been successfully created.

    1. Open the SageMaker Studio UI, then review and run the training pipeline.
    2. After the successful run of the training pipeline, approve the registered model version in the model registry, then rerun the entire GitLab CI/CD pipeline.
  1. Review the Terraform plan output in the build stage. Approve the manual apply stage in the GitLab CI/CD pipeline to resume the pipeline run and authorize Terraform to create the monitoring and notification resources in your AWS account.
  2. Finally, review the SageMaker pipelines’ run status and output in the SageMaker Studio UI and check your email for notification messages, as shown in the following screenshot. The default message body is in JSON format.

SageMaker pipelines

In this section, we describe the three SageMaker pipelines within the MLOps workflow.

Training pipeline

The training pipeline is composed of the following steps:

  • Preprocessing step, including feature transformation and encoding
  • Data quality check step for generating data statistics and constraints baseline using the training data
  • Training step
  • Training evaluation step
  • Condition step to check whether the trained model meets a pre-specified performance threshold
  • Model registration step to register the newly trained model in the model registry if the trained model meets the required performance threshold

Both the skip_check_data_quality and register_new_baseline_data_quality parameters are set to True in the training pipeline. These parameters instruct the pipeline to skip the data quality check and just create and register new data statistics or constraints baselines using the training data. The following figure depicts a successful run of the training pipeline.

Batch inference pipeline

The batch inference pipeline is composed of the following steps:

  • Creating a model from the latest approved model version in the model registry
  • Preprocessing step, including feature transformation and encoding
  • Batch inference step
  • Data quality check preprocessing step, which creates a new CSV file containing both input data and model predictions to be used for the data quality check
  • Data quality check step, which checks the input data against baseline statistics and constraints associated with the registered model
  • Condition step to check whether ground truth data is available. If ground truth data is available, the model quality check step will be performed
  • Model quality calculation step, which calculates model performance based on ground truth labels

Both the skip_check_data_quality and register_new_baseline_data_quality parameters are set to False in the inference pipeline. These parameters instruct the pipeline to perform a data quality check using the data statistics or constraints baseline associated with the registered model (supplied_baseline_statistics_data_quality and supplied_baseline_constraints_data_quality) and skip creating or registering new data statistics and constraints baselines during inference. The following figure illustrates a run of the batch inference pipeline where the data quality check step has failed due to poor performance of the model on the inference data. In this particular case, the training with HPO pipeline will be triggered automatically to fine-tune the model.

Training with HPO pipeline

The training with HPO pipeline is composed of the following steps:

  • Preprocessing step (feature transformation and encoding)
  • Data quality check step for generating data statistics and constraints baseline using the training data
  • Hyperparameter tuning step
  • Training evaluation step
  • Condition step to check whether the trained model meets a pre-specified accuracy threshold
  • Model registration step if the best trained model meets the required accuracy threshold

Both the skip_check_data_quality and register_new_baseline_data_quality parameters are set to True in the training with HPO pipeline. The following figure depicts a successful run of the training with HPO pipeline.

Clean up

Complete the following steps to clean up your resources:

  1. Employ the destroy stage in the GitLab CI/CD pipeline to eliminate all resources provisioned by Terraform.
  2. Use the AWS CLI to list and remove any remaining pipelines that are created by the Python scripts.
  3. Optionally, delete other AWS resources such as the S3 bucket or IAM role created outside the CI/CD pipeline.

Conclusion

In this post, we demonstrated how enterprises can create MLOps workflows for their batch inference jobs using Amazon SageMaker, Amazon EventBridge, AWS Lambda, Amazon SNS, HashiCorp Terraform, and GitLab CI/CD. The presented workflow automates data and model monitoring, model retraining, as well as batch job runs, code versioning, and infrastructure provisioning. This can lead to significant reductions in complexities and costs of maintaining batch inference jobs in production. For more information about implementation details, review the GitHub repo.


About the Authors

Hasan Shojaei is a Sr. Data Scientist with AWS Professional Services, where he helps customers across different industries such as sports, insurance, and financial services solve their business challenges through the use of big data, machine learning, and cloud technologies. Prior to this role, Hasan led multiple initiatives to develop novel physics-based and data-driven modeling techniques for top energy companies. Outside of work, Hasan is passionate about books, hiking, photography, and history.

Wenxin Liu is a Sr. Cloud Infrastructure Architect. Wenxin advises enterprise companies on how to accelerate cloud adoption and supports their innovations on the cloud. He’s a pet lover and is passionate about snowboarding and traveling.

Vivek Lakshmanan is a Machine Learning Engineer at Amazon. He has a Master’s degree in Software Engineering with specialization in Data Science and several years of experience as an MLE. Vivek is excited on applying cutting-edge technologies and building AI/ML solutions to customers on cloud. He is passionate about Statistics, NLP and Model Explainability in AI/ML. In his spare time, he enjoys playing cricket and taking road trips.

Andy Cracchiolo is a Cloud Infrastructure Architect. With more than 15 years in IT infrastructure, Andy is an accomplished and results-driven IT professional. In addition to optimizing IT infrastructure, operations, and automation, Andy has a proven track record of analyzing IT operations, identifying inconsistencies, and implementing process enhancements that increase efficiency, reduce costs, and increase profits.

Read More

University of San Francisco Data Science Conference 2023 Datathon in partnership with AWS and Amazon SageMaker Studio Lab

University of San Francisco Data Science Conference 2023 Datathon in partnership with AWS and Amazon SageMaker Studio Lab

As part of the 2023 Data Science Conference (DSCO 23), AWS partnered with the Data Institute at the University of San Francisco (USF) to conduct a datathon. Participants, both high school and undergraduate students, competed on a data science project that focused on air quality and sustainability. The Data Institute at the USF aims to support cross-disciplinary research and education in the field of data science. The Data Institute and the Data Science Conference provide a distinctive fusion of cutting-edge academic research and the entrepreneurial culture of the technology industry in the San Francisco Bay Area.

The students used Amazon SageMaker Studio Lab, which is a free platform that provides a JupyterLab environment with compute (CPU and GPU) and storage (up to 15GB). Because most of the students were unfamiliar with machine learning (ML), they were given a brief tutorial illustrating how to set up an ML pipeline: how to conduct exploratory data analysis, feature engineering, model building, and model evaluation, and how to set up inference and monitoring. The tutorial referenced Amazon Sustainability Data Initiative (ASDI) datasets from the National Oceanic and Atmospheric Administration (NOAA) and OpenAQ to build an ML model to predict air quality levels using weather data via a binary classification AutoGluon model. Next, the students were turned loose to work on their own projects in their teams. The winning teams were led by Peter Ma, Ben Welner, and Ei Coltin, who were all awarded prizes at the opening ceremony of the Data Science Conference at USF.

Response from the event

“This was a fun event, and a great way to work with others. I learned some Python coding in class but this helped make it real. During the datathon, my team member and I conducted research on different ML models (LightGBM, logistic regression, SVM models, Random Forest Classifier, etc.) and their performance on an AQI dataset from NOAA aimed at detecting the toxicity of the atmosphere under specific weather conditions. We built a gradient boosting classifier to predict air quality from weather statistics.”

– Anay Pant, a junior at the Athenian School, Danville, California, and one of the winners of the datathon.

“AI is becoming increasingly important in the workplace, and 82% of companies need employees with machine learning skills. It’s critical that we develop the talent needed to build products and experiences that we will all benefit from, this includes software engineering, data science, domain knowledge, and more. We were thrilled to help the next generation of builders explore machine learning and experiment with its capabilities. Our hope is that they take this forward and expand their ML knowledge. I personally hope to one day use an app built by one of the students at this datathon!”

– Sherry Marcus, Director of AWS ML Solutions Lab.

“This is the first year we used SageMaker Studio Lab. We were pleased by how quickly high school/undergraduate students and our graduate student mentors could start their projects and collaborate using SageMaker Studio.”

– Diane Woodbridge from the Data Institute of the University of San Francisco.

Get started with Studio Lab

If you missed this datathon, you can still register for your own Studio Lab account and work on your own project. If you’re interested in running your own hackathon, reach out to your AWS representative for a Studio Lab referral code, which will give your participants immediate access to the service. Finally, you can look for next year’s challenge at the USF Data Institute.


About the Authors

Neha Narwal is a Machine Learning Engineer at AWS Bedrock where she contributes to development of large language models for generative AI applications. Her focus lies at the intersection of science and engineering to influence research in Natural Language Processing domain.

Vidya Sagar Ravipati is a Applied Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

Announcing the Preview of Amazon SageMaker Profiler: Track and visualize detailed hardware performance data for your model training workloads

Announcing the Preview of Amazon SageMaker Profiler: Track and visualize detailed hardware performance data for your model training workloads

Today, we’re pleased to announce the preview of Amazon SageMaker Profiler, a capability of Amazon SageMaker that provides a detailed view into the AWS compute resources provisioned during training deep learning models on SageMaker. With SageMaker Profiler, you can track all activities on CPUs and GPUs, such as CPU and GPU utilizations, kernel runs on GPUs, kernel launches on CPUs, sync operations, memory operations across GPUs, latencies between kernel launches and corresponding runs, and data transfer between CPUs and GPUs. In this post, we walk you through the capabilities of SageMaker Profiler.

SageMaker Profiler provides Python modules for annotating PyTorch or TensorFlow training scripts and activating SageMaker Profiler. It also offers a user interface (UI) that visualizes the profile, a statistical summary of profiled events, and the timeline of a training job for tracking and understanding the time relationship of the events between GPUs and CPUs.

The need for profiling training jobs

With the rise of deep learning (DL), machine learning (ML) has become compute and data intensive, typically requiring multi-node, multi-GPU clusters. As state-of-the-art models grow in size in the order of trillions of parameters, their computational complexity and cost also increase rapidly. ML practitioners have to cope with common challenges of efficient resource utilization when training such large models. This is particularly evident in large language models (LLMs), which typically have billions of parameters and therefore require large multi-node GPU clusters in order to train them efficiently.

When training these models on large compute clusters, we can encounter compute resource optimization challenges such as I/O bottlenecks, kernel launch latencies, memory limits, and low resource utilizations. If the training job configuration is not optimized, these challenges can result in inefficient hardware utilization and longer training times or incomplete training runs, which increase the overall costs and timelines for the project.

Prerequisites

The following are the prerequisites to start using SageMaker Profiler:

  • A SageMaker domain in your AWS account – For instructions on setting up a domain, see Onboard to Amazon SageMaker Domain using quick setup. You also need to add domain user profiles for individual users to access the SageMaker Profiler UI application. For more information, see Add and remove SageMaker Domain user profiles.
  • Permissions – The following list is the minimum set of permissions that should be assigned to the execution role for using the SageMaker Profiler UI application:
    • sagemaker:CreateApp
    • sagemaker:DeleteApp
    • sagemaker:DescribeTrainingJob
    • sagemaker:SearchTrainingJobs
    • s3:GetObject
    • s3:ListBucket

Prepare and run a training job with SageMaker Profiler

To start capturing kernel runs on GPUs while the training job is running, modify your training script using the SageMaker Profiler Python modules. Import the library and add the start_profiling() and stop_profiling() methods to define the beginning and the end of profiling. You can also use optional custom annotations to add markers in the training script to visualize hardware activities during particular operations in each step.

There are two approaches that you can take to profile your training scripts with SageMaker Profiler. The first approach is based on profiling full functions; the second approach is based on profiling specific code lines in functions.

To profile by functions, use the context manager smppy.annotate to annotate full functions. The following example script shows how to implement the context manager to wrap the training loop and full functions in each iteration:

import smppy

sm_prof = smppy.SMProfiler.instance()
config = smppy.Config()
config.profiler = {
    "EnableCuda": "1",
}
sm_prof.configure(config)
sm_prof.start_profiling()

for epoch in range(args.epochs):
    if world_size > 1:
        sampler.set_epoch(epoch)
    tstart = time.perf_counter()
    for i, data in enumerate(trainloader, 0):
        with smppy.annotate("step_"+str(i)):
            inputs, labels = data
            inputs = inputs.to("cuda", non_blocking=True)
            labels = labels.to("cuda", non_blocking=True)
    
            optimizer.zero_grad()
    
            with smppy.annotate("Forward"):
                outputs = net(inputs)
            with smppy.annotate("Loss"):
                loss = criterion(outputs, labels)
            with smppy.annotate("Backward"):
                loss.backward()
            with smppy.annotate("Optimizer"):
                optimizer.step()

sm_prof.stop_profiling()

You can also use smppy.annotation_begin() and smppy.annotation_end() to annotate specific lines of code in functions. For more information, refer to documentation.

Configure the SageMaker training job launcher

After you’re done annotating and setting up the profiler initiation modules, save the training script and prepare the SageMaker framework estimator for training using the SageMaker Python SDK.

  1. Set up a profiler_config object using the ProfilerConfig and Profiler modules as follows:
    from sagemaker import ProfilerConfig, Profiler
    profiler_config = ProfilerConfig(
        profiler_params = Profiler(cpu_profiling_duration=3600))

  2. Create a SageMaker estimator with the profiler_config object created in the previous step. The following code shows an example of creating a PyTorch estimator:
    import sagemaker
    from sagemaker.pytorch import PyTorch
    
    estimator = PyTorch(
        framework_version="2.0.0",
        image_uri="763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker",
        role=sagemaker.get_execution_role(),
        entry_point="train_with_profiler_demo.py", # your training job entry point
        source_dir=source_dir, # source dir for your training script
        output_path=output_path,
        base_job_name="sagemaker-profiler-demo",
        hyperparameters=hyperparameters, # if any
        instance_count=1, 
        instance_type=ml.p4d.24xlarge,
        profiler_config=profiler_config
    )

If you want to create a TensorFlow estimator, import sagemaker.tensorflow.TensorFlow instead, and specify one of the TensorFlow versions supported by SageMaker Profiler. For more information about supported frameworks and instance types, see Supported frameworks.

  1. Start the training job by running the fit method:
    estimator.fit(wait=False)

Launch the SageMaker Profiler UI

When the training job is complete, you can launch the SageMaker Profiler UI to visualize and explore the profile of the training job. You can access the SageMaker Profiler UI application through the SageMaker Profiler landing page on the SageMaker console or through the SageMaker domain.

To launch the SageMaker Profiler UI application on the SageMaker console, complete the following steps:

  1. On the SageMaker console, choose Profiler in the navigation pane.
  2. Under Get started, select the domain in which you want to launch the SageMaker Profiler UI application.

If your user profile only belongs to one domain, you will not see the option for selecting a domain.

  1. Select the user profile for which you want to launch the SageMaker Profiler UI application.

If there is no user profile in the domain, choose Create user profile. For more information about creating a new user profile, see Add and Remove User Profiles.

  1. Choose Open Profiler.

You can also launch the SageMaker Profiler UI from the domain details page.

Gain insights from the SageMaker Profiler

When you open the SageMaker Profiler UI, the Select and load a profile page opens, as shown in the following screenshot.

You can view a list of all the training jobs that have been submitted to SageMaker Profiler and search for a particular training job by its name, creation time, and run status (In Progress, Completed, Failed, Stopped, or Stopping). To load a profile, select the training job you want to view and choose Load. The job name should appear in the Loaded profile section at the top.

Choose the job name to generate the dashboard and timeline. Note that when you choose the job, the UI automatically opens the dashboard. You can load and visualize one profile at a time. To load another profile, you must first unload the previously loaded profile. To unload a profile, choose the trash bin icon in the Loaded profile section.

For this post, we view the profile of an ALBEF training job on two ml.p4d.24xlarge instances.

After you finish loading and selecting the training job, the UI opens the Dashboard page, as shown in the following screenshot.

You can see the plots for key metrics, namely the GPU active time, GPU utilization over time, CPU active time, and CPU utilization over time. The GPU active time pie chart shows the percentage of GPU active time vs. GPU idle time, which enables us to check if the GPUs are more active than idle throughout the entire training job. The GPU utilization over time timeline graph shows the average GPU utilization rate over time per node, aggregating all the nodes in a single chart. You can check if the GPUs have an unbalanced workload, under-utilization issues, bottlenecks, or idle issues during certain time intervals. For more details on interpreting these metrics, refer to documentation.

The dashboard provides you with additional plots, including time spent by all GPU kernels, time spent by the top 15 GPU kernels, launch counts of all GPU kernels, and launch counts of the top 15 GPU kernels, as shown in the following screenshot.

Lastly, the dashboard enables you to visualize additional metrics, such as the step time distribution, which is a histogram that shows the distribution of step durations on GPUs, and the kernel precision distribution pie chart, which shows the percentage of time spent on running kernels in different data types such as FP32, FP16, INT32, and INT8.

You can also obtain a pie chart on the GPU activity distribution that shows the percentage of time spent on GPU activities, such as running kernels, memory (memcpy and memset), and synchronization (sync). You can visualize the percentage of time spent on GPU memory operations from the GPU memory operations distribution pie chart.

You can also create your own histograms based on a custom metric that you annotated manually as described earlier in this post. When adding a custom annotation to a new histogram, select or enter the name of the annotation you added in the training script.

Timeline interface

The SageMaker Profiler UI also includes a timeline interface, which provides you with a detailed view into the compute resources at the level of operations and kernels scheduled on the CPUs and run on the GPUs. The timeline is organized in a tree structure, giving you information from the host level to the device level, as shown in the following screenshot.

For each CPU, you can track the CPU performance counters, such as clk_unhalted_ref.tsc and itlb_misses.miss_causes_a_walk. For each GPU on the 2x p4d.24xlarge instance, you can see a host timeline and a device timeline. Kernel launches are on the host timeline and kernel runs are on the device timeline.

You can also zoom in to the individual steps. In the following screenshot, we have zoomed in to step_41. The timeline strip selected in the following screenshot is the AllReduce operation, an essential communication and synchronization step in distributed training, run on GPU-0. In the screenshot, note that the kernel launch in the GPU-0 host connects to the kernel run in the GPU-0 device stream 1, indicated with the arrow in cyan.

Availability and considerations

SageMaker Profiler is available in PyTorch (version 2.0.0 and 1.13.1) and TensorFlow (version 2.12.0 and 2.11.1). The following table provides the links to the supported AWS Deep Learning Containers for SageMaker.

Framework Version AWS DLC Image URI
PyTorch 2.0.0 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker
PyTorch 1.13.1 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
TensorFlow 2.12.0 763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.12.0-gpu-py310-cu118-ubuntu20.04-sagemaker
TensorFlow 2.11.1 763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.11.1-gpu-py39-cu112-ubuntu20.04-sagemaker

SageMaker Profiler is currently available in the following Regions: US East (Ohio, N. Virginia), US West (Oregon), and Europe (Frankfurt, Ireland).

SageMaker Profiler is available in the training instance types ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.g4dn.12xlarge.

For the full list of supported frameworks and versions, refer to documentation.

SageMaker Profiler incurs charges after the SageMaker Free Tier or the free trial period of the feature ends. For more information, see Amazon SageMaker Pricing.

Performance of SageMaker Profiler

We compared the overhead of SageMaker Profiler against various open-source profilers. The baseline used for the comparison was obtained from running the training job without a profiler.

Our key finding revealed that SageMaker Profiler generally resulted in a shorter billable training duration because it had less overhead time on the end-to-end training runs. It also generated less profiling data (up to 10 times less) when compared against open-source alternatives. The smaller profiling artifacts generated by SageMaker Profiler require less storage, thereby also saving on costs.

Conclusion

SageMaker Profiler enables you to get detailed insights into the utilization of compute resources when training your deep learning models. This can enable you to resolve performance hotspots and bottlenecks to ensure efficient resource utilization that would ultimately drive down training costs and reduce the overall training duration.

To get started with SageMaker Profiler, refer to documentation.


About the Authors

 Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS based in Munich, Germany. Roy helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS. Roy is passionate about computational optimization problems and improving the performance of AI workloads.

Sushant Moon is a Data Scientist at AWS, India, specializing in guiding customers through their AI/ML endeavors. With a diverse background spanning retail, finance, and insurance domains, he delivers innovative and tailored solutions. Beyond his professional life, Sushant finds rejuvenation in swimming and seeks inspiration from his travels to diverse locales.

Diksha Sharma is an AI/ML Specialist Solutions Architect in the Worldwide Specialist Organization. She works with public sector customers to help them architect efficient, secure, and scalable machine learning applications including generative AI solutions on AWS. In her spare time, Diksha loves to read, paint, and spend time with her family.

Read More

Persistent Systems shapes the future of software engineering with Amazon CodeWhisperer

Persistent Systems shapes the future of software engineering with Amazon CodeWhisperer

Amazon CodeWhisperer, the AWS AI coding companion, is a step change in developer productivity tools. Based on generative AI technology, Amazon CodeWhisperer offers contextualized code snippets or recommendations based on natural language prompts to build software quickly, responsibly, and securely. It enables productivity gains and increases accuracy for accelerated digital transformations. Amazon CodeWhisperer ensures enterprises have greater control over AI-generated code, especially the code written by developers who may have a limited understanding of code attribution, quality, and security requirements.

Persistent Systems, a global digital engineering provider, has run several pilots and formal studies with Amazon CodeWhisperer that point to shifts in software engineering, generative AI-led modernization, responsible innovation, and more. This post highlights four themes emerging from Persistent’s Amazon CodeWhisperer experiments that could change software engineering as we know it.

Beyond productivity gains: Reimagining coding with Amazon CodeWhisperer

In this section, we discuss some of the ways that Amazon CodeWhisperer is reimagining coding.

Improving responsible delivery

Ownership, explainability, and transparency of AI-generated code are the most contentious points for the commercial adoption of coding companions such as Amazon CodeWhisperer. Amazon gives developers complete ownership of the code they write using Amazon CodeWhisperer. The Amazon CodeWhisperer team has carefully curated the training data and omitted restrictive licenses, ensuring developers don’t inadvertently use restrictively licensed code when they use Amazon CodeWhisperer. In addition, because recommender pipelines can be strongly influenced by open-source code, if Amazon CodeWhisperer detects a lineage, it flags the license references (for example, MIT or Apache, an open-source project). This enables the developer to attribute code snippets to the source owners, instituting coding best practices. Although Amazon collects data such as code snippets, recommendations, and comments from files open in the integrated development environment, for Amazon CodeWhisperer Professional users, these are not stored or used to train the model. Also, Amazon CodeWhisperer Individual users can opt out of sharing content with AWS, limiting the chances of this being reproduced as recommendations to other users.

Persistent’s approach to generative AI mirrors Richard P. Feynman’s thinking, who said, “I would rather have questions that can’t be answered than answers that can’t be questioned.” Persistent prioritizes responsibility, accountability, and transparency to build client trust. One example of the potential of Amazon CodeWhisperer lies in its ability to reference code, helping clients circumvent legal liabilities that could derail other rewards. For more information about Persistent’s approach to generative AI, refer to Generative AI Services and Solutions.

Moving code security upstream and upfront

Seasoned developers will tell you that security cannot be tested-in; it must be built from the ground up. Although some approaches, such as DevSecOps, make it easier for developers, code security experts, and operations teams to embed security testing while the code is written, Amazon CodeWhisperer takes this one step further. It runs security scans on the code directly in the integrated development environment (IDE), allowing a single developer resource to test the code for quality and security. This highly automated, shift-left scenario for security testing enables enterprises to arrest defects upstream and remedy them at a fraction of the cost and time. Especially now, when coding, with the advent of generative AI moving closer to business users, the automated, in-line security scans in Amazon CodeWhisperer will provide less rework, faster time to production, and resilient code.

Persistent helps leading global organizations fortify their business applications with code embedded with security guardrails. It believes security testing has to shift closer to the developer (professional or citizen) and be encoded into applications as they are written. Amazon CodeWhisperer, with its transformative power to fast-track not just coding but secure coding, fits well into the narrative.

Enabling developer skills to undergo a reboot

Most developers must undergo at least 4 months of training before being tagged to projects. In our pilot, Amazon CodeWhisperer condensed the training period to 1 month with reduced cognitive load concerning understanding the context or coding language. We see this bearing on how companies hire developers, evaluating not the coding knowledge, which has been largely abstracted, but on the prompt engineering expertise and the ability to be creative with tools such as Amazon CodeWhisperer.

The parameters for professional developers will change, and quickly depending on their ability to tune the input to get the desired answer. This also opens the field for citizen developers or business technologists, bringing coding closer to the business.

Driving implementation closer to strategy

With so many moving parts, businesses and their technology partners will return to the whiteboard together. The engagement model will evolve to factor in these new variables (such as faster coding timelines, secure code, more citizen developers, or domain-oriented developers) unleashed by Amazon CodeWhisperer. Coding will now move closer to the business, automatically incorporating security guardrails and mandatory regulations into software applications as they are written, all at scale. And with verticalized workloads, success will depend on the development team’s domain expertise and the ability to translate code into innovation. This means the implementation of the company’s vision through this code will become even more watertight because it adheres to strategic pillars of security, quality, and speed.

From long shots to offshoots – what the future holds

We extrapolated these themes to map a future where Amazon CodeWhisperer can help realize “delivery moon shots” that, up until now, were aspirational. The future looks something like this:

  • Zero-wastage – Amazon CodeWhisperer, especially with its proactive security scans and reference tracker tool, will ensure the code is of shippable quality, enabling every allied function—from business to developers—to add value and minimize wastage in terms of effort, time to value, or rework. This will bring a singular focus on the core job for each stakeholder, further enforcing a value-first mindset.
  • Zero ramp-up – The ability to support multiple coding languages, factor in developer notes and comments into code suggestions, and offer lines of code on the fly makes Amazon CodeWhisperer the perfect antidote to the cold start problem for developers. As mentioned, developers don’t need a gestation period before being onboarded on a project. This dramatically cuts down the time to value, allowing implementation partners to deploy resources across projects for better monetization dynamically.
  • Zero-shot translation – Amazon CodeWhisperer supports multiple programming languages, such as Python, Java, JavaScript, TypeScript, SQL, and more. It will be able to translate code from one programming language to another, or what is called zero-shot translation ability, where it uses reference code in language A to write code in language B more accurately. This unleashes significant changes in how legacy modernization projects are planned and implemented. With the zero-shot translation ability of Amazon CodeWhisperer, Persistent is confident legacy modernization will become faster and no longer be a moon shot.
  • Zero lifting – Amazon CodeWhisperer is optimized to generate accurate code for other AWS offerings, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. The accurate code generation makes the lift easy. Because AWS and other major cloud service providers are now pushing forward a multi-cloud narrative, Persistent expects Amazon CodeWhisperer to improve accuracy while recommending code for other solutions offered by AWS peers. This makes the road smoother for multi-cloud or multi-platform settings, eliminating the heavy lifting required while shifting workloads from one service vendor to another—supercharging digital transformation 2.0.

Conclusion

Amazon CodeWhisperer goes beyond improving developer productivity: it democratizes coding and brings it closer to business users while ensuring best practices such as code attribution and enhanced security are never out of the purview.

Persistent is excited about Amazon CodeWhisperer and its potential impact on businesses and partners. It is working to create an Amazon CodeWhisperer-ready developer workforce and alerting its customers about its benefits to drive adoption. Persistent’s strong partnership with AWS makes it the best-fit technology partner to help businesses capitalize on the intrinsic value of Amazon CodeWhisperer.

To learn more about Persistent’s generative AI philosophy that reimagines the way software is engineered today and how Amazon CodeWhisperer aligns with it, refer to Generative AI Services and Solutions.


About the authors

Dr. Pandurang Kamat is Chief Technology Officer, responsible for advanced technology research focused on unlocking business value through innovation at scale. He is a seasoned technology leader who helps customers improve user experience, optimize business processes, and create new digital products. His vision for Persistent is to be an innovation powerhouse that anchors a global and diverse innovation ecosystem, comprising of academia and start-ups. He holds a bachelor’s degree in Computer Engineering from Goa University and Ph.D. in Computer Science from Rutgers University. He is a well-published author with several international research publications, an ACM-India Eminent Speaker, serves on the board of studies at universities, and mentors technology start-ups.

Ankur Desai is a Principal Product Manager within the AWS AI Services team.

Kiran Randhi works for Amazon Web Services as a Principal Partner Solutions Architect in Seattle, Washington. He works closely with AWS Global Strategic SI partners to develop and implement effective cloud strategies that allow them to fully leverage the benefits of cloud technology. Kiran helps CIOs, CTOs, and architects turn their cloud visions into reality by providing architectural guidance and expertise throughout the implementation of strategic cloud solutions. He focuses on AWS security, Migration & Modernization, Data & Analytics, and other technologies to build solutions for different industries in the cloud.

Read More