Live Meeting Assistant with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock

Live Meeting Assistant with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock

See CHANGELOG for latest features and fixes.

You’ve likely experienced the challenge of taking notes during a meeting while trying to pay attention to the conversation. You’ve probably also experienced the need to quickly fact-check something that’s been said, or look up information to answer a question that’s just been asked in the call. Or maybe you have a team member that always joins meetings late, and expects you to send them a quick summary over chat to catch them up.

Then there are the times that others are talking in a language that’s not your first language, and you’d love to have a live translation of what people are saying to make sure you understand correctly.

And after the call is over, you usually want to capture a summary for your records, or to send to the participants, with a list of all the action items, owners, and due dates.

All of this, and more, is now possible with our newest sample solution, Live Meeting Assistant (LMA).

Check out the following demo to see how it works.

In this post, we show you how to use LMA with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock.

Solution overview

The LMA sample solution captures speaker audio and metadata from your browser-based meeting app (as of this writing, Zoom and Chime are supported), or audio only from any other browser-based meeting app, softphone, or audio source. It uses Amazon Transcribe for speech to text, Knowledge Bases for Amazon Bedrock for contextual queries against your company’s documents and knowledge sources, and Amazon Bedrock models for customizable transcription insights and summaries.

Everything you need is provided as open source in our GitHub repo. It’s straightforward to deploy in your AWS account. When you’re done, you’ll wonder how you ever managed without it!

The following are some of the things LMA can do:

  • Live transcription with speaker attribution – LMA is powered by Amazon Transcribe ASR models for low-latency, high-accuracy speech to text. You can teach it brand names and domain-specific terminology if needed, using custom vocabulary and custom language model features in Amazon Transcribe.
  • Live translation – It uses Amazon Translate to optionally show each segment of the conversation translated into your language of choice, from a selection of 75 languages.
  • Context-aware meeting assistant – It uses Knowledge Bases for Amazon Bedrock to provide answers from your trusted sources, using the live transcript as context for fact-checking and follow-up questions. To activate the assistant, just say “Okay, Assistant,” choose the ASK ASSISTANT! button, or enter your own question in the UI.
  • On-demand summaries of the meeting – With the click of a button on the UI, you can generate a summary, which is useful when someone joins late and needs to get caught up. The summaries are generated from the transcript by Amazon Bedrock. LMA also provides options for identifying the current meeting topic, and for generating a list of action items with owners and due dates. You can also create your own custom prompts and corresponding options.
  • Automated summary and insights – When the meeting has ended, LMA automatically runs a set of large language model (LLM) prompts on Amazon Bedrock to summarize the meeting transcript and extract insights. You can customize these prompts as well.
  • Meeting recording – The audio is (optionally) stored for you, so you can replay important sections on the meeting later.
  • Inventory list of meetings – LMA keeps track of all your meetings in a searchable list.
  • Browser extension captures audio and meeting metadata from popular meeting apps – The browser extension captures meeting metadata—the meeting title and names of active speakers—and audio from you (your microphone) and others (from the meeting browser tab). As of this writing, LMA supports Chrome for the browser extension, and Zoom and Chime for meeting apps (with Teams and WebEx coming soon). Standalone meeting apps don’t work with LMA —instead, launch your meetings in the browser.

You are responsible for complying with legal, corporate, and ethical restrictions that apply to recording meetings and calls. Do not use this solution to stream, record, or transcribe calls if otherwise prohibited.

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

You also need an existing knowledge base in Amazon Bedrock. If you haven’t set one up yet, see Create a knowledge base. Populate your knowledge base with content to power LMA’s context-aware meeting assistant.

Finally, LMA uses Amazon Bedrock LLMs for its meeting summarization features. Before proceeding, if you have not previously done so, you must request access to the following Amazon Bedrock models:

  • Titan Embeddings G1 – Text
  • Anthropic: All Claude models

Deploy the solution using AWS CloudFormation

We’ve provided pre-built AWS CloudFormation templates that deploy everything you need in your AWS account.

If you’re a developer and you want to build, deploy, or publish the solution from code, refer to the Developer README.

Complete the following steps to launch the CloudFormation stack:

  1. Log in to the AWS Management Console.
  2. Choose Launch Stack for your desired AWS Region to open the AWS CloudFormation console and create a new stack.
Region Launch Stack
US East (N. Virginia)
US West (Oregon)
  1. For Stack name, use the default value, LMA.
  2. For Admin Email Address, use a valid email address—your temporary password is emailed to this address during the deployment.
  3. For Authorized Account Email Domain, use the domain name part of your corporate email address to allow users with email addresses in the same domain to create their own new UI accounts, or leave blank to prevent users from directly creating their own accounts. You can enter multiple domains as a comma-separated list.
  4. For MeetingAssistService, choose BEDROCK_KNOWLEDGE_BASE (the only available option as of this writing).
  5. For Meeting Assist Bedrock Knowledge Base Id (existing), enter your existing knowledge base ID (for example, JSXXXXX3D8). You can copy it from the Amazon Bedrock console.
  6. For all other parameters, use the default values.

If you want to customize the settings later, for example to add your own AWS Lambda functions, use custom vocabularies and language models to improve accuracy, enable personally identifiable information (PII) redaction, and more, you can update the stack for these parameters.

  1. Select the acknowledgement check boxes, then choose Create stack.

The main CloudFormation stack uses nested stacks to create the following resources in your AWS account:

The stacks take about 35–40 minutes to deploy. The main stack status shows CREATE_COMPLETE when everything is deployed.

Set your password

After you deploy the stack, open the LMA web user interface and set your password by completing the following steps:

  1. Open the email you received, at the email address you provided, with the subject “Welcome to Live Meeting Assistant!”
  2. Open your web browser to the URL shown in the email. You’re directed to the login page.
  3. The email contains a generated temporary password that you use to log in and create your own password. Your user name is your email address.
  4. Set a new password.

Your new password must have a length of at least eight characters, and contain uppercase and lowercase characters, plus numbers and special characters.

  1. Follow the directions to verify your email address, or choose Skip to do it later.

You’re now logged in to LMA.

You also received a similar email with the subject “QnABot Signup Verification Code.” This email contains a generated temporary password that you use to log in and create your own password in the QnABot designer. You use QnABot designer only if you want to customize LMA options and prompts. Your username for QnABot is Admin. You can set your permanent QnABot Admin password now, or keep this email safe in case you want to customize things later.

Download and install the Chrome browser extension

For the best meeting streaming experience, install the LMA browser plugin (currently available for Chrome):

  1. Choose Download Chrome Extension to download the browser extension .zip file (lma-chrome-extension.zip).
  2. Choose (right-click) and expand the .zip file (lma-chrome-extension.zip) to create a local folder named lma-chrome-extension.
  3. Open Chrome and enter the link chrome://extensions into the address bar.
  4. Enable Developer mode.
  5. Choose Load unpacked, navigate to the lma-chrome-extension folder (which you unzipped from the download), and choose Select. This loads your extension.
  6. Pin the new LMA extension to the browser tool bar for easy access—you will use it often to stream your meetings!

Start using LMA

LMA provides two streaming options:

  • Chrome browser extension – Use this to stream audio and speaker metadata from your meeting browser app. It currently works with Zoom and Chime, but we hope to add more meeting apps.
  • LMA Stream Audio tab – Use this to stream audio from your microphone and any Chrome browser-based meeting app, softphone, or audio application.

We show you how to use both options in the following sections.

Use the Chrome browser extension to stream a Zoom call

Complete the following steps to use the browser extension:

  1. Open the LMA extension and log in with your LMA credentials.
  2. Join or start a Zoom meeting in your web browser (do not use the separate Zoom client).

If you already have the Zoom meeting page loaded, reload it.

The LMA extension automatically detects that Zoom is running in the browser tab, and populates your name and the meeting name.

  1. Tell others on the call that you are about to start recording the call using LMA and obtain their permission. Do not proceed if participants object.
  2. Choose Start Listening.
  3. Read and accept the disclaimer, and choose Allow to share the browser tab.

The LMA extension automatically detects and displays the active speaker on the call. If you are alone in the meeting, invite some friends to join, and observe that the names they used to join the call are displayed in the extension when they speak, and are attributed to their words in the LMA transcript.

  1. Choose Open in LMA to see your live transcript in a new tab.
  2. Choose your preferred transcript language, and interact with the meeting assistant using the wake phrase “OK Assistant!” or the Meeting Assist Bot pane.

The ASK ASSISTANT button asks the meeting assistant service (Amazon Bedrock knowledge base) to suggest a good response based on the transcript of the recent interactions in the meeting. Your mileage may vary, so experiment!

  1. When you are done, choose Stop Streaming to end the meeting in LMA.

Within a few seconds, the automated end-of-meeting summaries appear, and the audio recording becomes available. You can continue to use the bot after the call has ended.

Use the LMA UI Stream Audio tab to stream from your microphone and any browser-based audio application

The browser extension is the most convenient way to stream metadata and audio from supported meeting web apps. However, you can also use LMA to stream just the audio from any browser-based softphone, meeting app, or other audio source playing in your Chrome browser, using the convenient Stream Audio tab that is built into the LMA UI.

  1. Open any audio source in a browser tab.

For example, this could be a softphone (such as Google Voice), another meeting app, or for demo purposes, you can simply play a local audio recording or a YouTube video in your browser to emulate another meeting participant. If you just want to try it, open the following YouTube video in a new tab.

  1. In the LMA App UI, choose Stream Audio (no extension) to open the Stream Audio tab.
  2. For Meeting ID, enter a meeting ID.
  3. For Name, enter a name for yourself (applied to audio from your microphone).
  4. For Participant Name(s), enter the names of the participants (applied to the incoming audio source).
  5. Choose Start Streaming.
  6. Choose the browser tab you opened earlier, and choose Allow to share.
  7. Choose the LMA UI tab again to view your new meeting ID listed, showing the meeting as In Progress.
  8. Choose the meeting ID to open the details page, and watch the transcript of the incoming audio, attributed to the participant names that you entered. If you speak, you’ll see the transcription of your own voice.

Use the Stream Audio feature to stream from any softphone app, meeting app, or any other streaming audio playing in the browser, along with your own audio captured from your selected microphone. Always obtain permission from others before recording them using LMA, or any other recording application.

Processing flow overview

How did LMA transcribe and analyze your meeting? Let’s look at how it works. The following diagram shows the main architectural components and how they fit together at a high level.

The LMA user joins a meeting in their browser, enables the LMA browser extension, and authenticates using their LMA credentials. If the meeting app (for example, Zoom.us) is supported by the LMA extension, the user’s name, meeting name, and active speaker names are automatically detected by the extension. If the meeting app is not supported by the extension, then the LMA user can manually enter their name and the meeting topic—active speakers’ names will not be detected.

After getting permission from other participants, the LMA user chooses Start Listening on the LMA extension pane. A secure WebSocket connection is established to the preconfigured LMA stack WebSocket URL, and the user’s authentication token is validated. The LMA browser extension sends a START message to the WebSocket containing the meeting metadata (name, topic, and so on), and starts streaming two-channel audio from the user’s microphone and the incoming audio channel containing the voices of the other meeting participants. The extension monitors the meeting app to detect active speaker changes during the call, and sends that metadata to the WebSocket, enabling LMA to label speech segments with the speaker’s name.

The WebSocket server running in Fargate consumes the real-time two-channel audio fragments from the incoming WebSocket stream. The audio is streamed to Amazon Transcribe, and the transcription results are written in real time to Kinesis Data Streams.

Each meeting processing session runs until the user chooses Stop Listening in the LMA extension pane, or ends the meeting and closes the tab. At the end of the call, the function creates a stereo recording file in Amazon S3 (if recording was enabled when the stack was deployed).

A Lambda function called the Call Event Processor, fed by Kinesis Data Streams, processes and optionally enriches meeting metadata and transcription segments. The Call Event Processor integrates with the meeting assist services. LMA is powered by Amazon Lex, Knowledge Bases for Amazon Bedrock, and Amazon Bedrock LLMs using the open source QnABot on AWS solution for answers based on FAQs and as an orchestrator for request routing to the appropriate AI service. The Call Event Processor also invokes the Transcript Summarization Lambda function when the call ends, to generate a summary of the call from the full transcript.

The Call Event Processor function interfaces with AWS AppSync to persist changes (mutations) in Amazon DynamoDB and send real-time updates to the LMA user’s logged-in web clients (conveniently opened by choosing the Open in LMA option in the browser extension).

The LMA web UI assets are hosted on Amazon S3 and served via CloudFront. Authentication is provided by Amazon Cognito.

When the user is authenticated, the web application establishes a secure GraphQL connection to the AWS AppSync API, and subscribes to receive real-time events such as new calls and call status changes for the meetings list page, and new or updated transcription segments and computed analytics for the meeting details page. When translation is enabled, the web application also interacts securely with Amazon Translate to translate the meeting transcription into the selected language.

The entire processing flow, from ingested speech to live webpage updates, is event driven, and the end-to-end latency is short—typically just a few seconds.

Monitoring and troubleshooting

AWS CloudFormation reports deployment failures and causes on the relevant stack’s Events tab. See Troubleshooting CloudFormation for help with common deployment problems. Look out for deployment failures caused by limit exceeded errors; the LMA stacks create resources that are subject to default account and Region service quotas, such as elastic IP addresses and NAT gateways. When troubleshooting CloudFormation stack failures, always navigate into any failed nested stacks to find the first nested resource failure reported—this is almost always the root cause.

Amazon Transcribe has a default limit of 25 concurrent transcription streams, which limits LMA to 25 concurrent meetings in a given AWS account or Region. Request an increase for the number of concurrent HTTP/2 streams for streaming transcription if you have many users and need to handle a larger number of concurrent meetings in your account.

LMA provides runtime monitoring and logs for each component using CloudWatch:

  • WebSocket processing and transcribing Fargate task – On the Amazon Elastic Container Service (Amazon ECS) console, navigate to the Clusters page and open the LMA-WEBSOCKETSTACK-xxxx-TranscribingCluster function. Choose the Tasks tab and open the task page. Choose Logs and View in CloudWatch to inspect the WebSocket transcriber task logs.
  • Call Event Processor Lambda function – On the Lambda console, open the LMA-AISTACK-CallEventProcessor function. Choose the Monitor tab to see function metrics. Choose View logs in CloudWatch to inspect function logs.
  • AWS AppSync API – On the AWS AppSync console, open the CallAnalytics-LMA API. Choose Monitoring in the navigation pane to see API metrics. Choose View logs in CloudWatch to inspect AWS AppSync API logs.

For QnABot on AWS for Meeting Assist, refer to the Meeting Assist README, and the QnABot solution implementation guide for additional information.

Cost assessment

LMA provides a WebSocket server using Fargate (2vCPU) and VPC networking resources costing about $0.10/hour (approximately $72/month). For more details, see AWS Fargate Pricing.

LMA is enabled using QnABot and Knowledge Bases for Amazon Bedrock. You create your own knowledge base, which you use for LMA and potentially other use cases. For more details, see Amazon Bedrock Pricing. Additional AWS services used by the QnABot solution cost about $0.77/hour. For more details, refer to the list of QnABot on AWS solution costs.

The remaining solution costs are based on usage.

The usage costs add up to about $0.17 for a 5-minute call, although this can vary based on options selected (such as translation), number of LLM summarizations, and total usage because usage affects Free Tier eligibility and volume tiered pricing for many services. For more information about the services that incur usage costs, see the following:

To explore LMA costs for yourself, use AWS Cost Explorer or choose Bill Details on the AWS Billing Dashboard to see your month-to-date spend by service.

Customize your deployment

Use the following CloudFormation template parameters when creating or updating your stack to customize your LCA deployment:

  • To use your own S3 bucket for meeting recordings, use Call Audio Recordings Bucket Name and Audio File Prefix.
  • To redact PII from the transcriptions, set Enable Content Redaction for Transcripts to true, and adjust Transcription PII Redaction Entity Types as needed. For more information, see Redacting or identifying PII in a real-time stream.
  • To improve transcription accuracy for technical and domain-specific acronyms and jargon, set Transcription Custom Vocabulary Name to the name of a custom vocabulary that you already created in Amazon Transcribe or set Transcription Custom Language Model Name to the name of a previously created custom language model. For more information, see Improving Transcription Accuracy.
  • To transcribe meetings in a supported language other than US English, choose the desired value for Language for Transcription.
  • To customize transcript processing, optionally set Lambda Hook Function ARN for Custom Transcript Segment Processing to the ARN of your own Lambda function. For more information, see Using a Lambda function to optionally provide custom logic for transcript processing.
  • To customize the meeting assist capabilities based on the QnABot on AWS solution, Amazon Lex, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock integration, see the Meeting Assist README.
  • To customize transcript summarization by configuring LMA to call your own Lambda function, see Transcript Summarization LAMBDA option.
  • To customize transcript summarization by modifying the default prompts or adding new ones, see Transcript Summarization.
  • To change the retention period, set Record Expiration In Days to the desired value. All call data is permanently deleted from the LMA DynamoDB storage after this period. Changes to this setting apply only to new calls received after the update.

LMA is an open source project. You can fork the LMA GitHub repository, enhance the code, and send us pull requests so we can incorporate and share your improvements!

Update an existing LMA stack

You can update your existing LMA stack to the latest release. For more details, see Update an existing stack.

Clean up

Congratulations! You have completed all the steps for setting up your live call analytics sample solution using AWS services.

When you’re finished experimenting with this sample solution, clean up your resources by using the AWS CloudFormation console to delete the LMA stacks that you deployed. This deletes resources that were created by deploying the solution. The recording S3 buckets, DynamoDB table, and CloudWatch log groups are retained after the stack is deleted to avoid deleting your data.

Live Call Analytics: Companion solution

Our companion solution, Live Call Analytics and Agent Assist (LCA), offers real-time transcription and analytics for contact centers (phone calls) rather than meetings. There are many similarities—in fact, LMA was built using an architecture and many components derived from LCA.

Conclusion

The Live Meeting Assistant sample solution offers a flexible, feature-rich, and customizable approach to provide live meeting assistance to improve your productivity during and after meetings. It uses Amazon AI/ML services like Amazon Transcribe, Amazon Lex, Knowledge Bases for Amazon Bedrock, and Amazon Bedrock LLMs to transcribe and extract real-time insights from your meeting audio.

The sample LMA application is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. Browse to the LMA GitHub repository to explore the code, choose Watch to be notified of new releases, and check the README for the latest documentation updates.

For expert assistance, AWS Professional Services and other AWS Partners are here to help.

We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the LMA GitHub repository.


About the authors

Bob Strahan Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Chris Lott is a Principal Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California and enjoys gardening, aerospace, and traveling the world.

Babu Srinivasan is a Sr. Specialist SA – Language AI services in the World Wide Specialist organization at AWS, with over 24 years of experience in IT and the last 6 years focused on the AWS Cloud. He is passionate about AI/ML. Outside of work, he enjoys woodworking and entertains friends and family (sometimes strangers) with sleight of hand card magic.

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Picture of Gillian ArmstrongGillian Armstrong is a Builder Solutions Architect. She is excited about how the Cloud is opening up opportunities for more people to use technology to solve problems, and especially excited about how cognitive technologies, like conversational AI, are allowing us to interact with computers in more human ways.

Read More

Meta Llama 3 models are now available in Amazon SageMaker JumpStart

Meta Llama 3 models are now available in Amazon SageMaker JumpStart

Today, we are excited to announce that Meta Llama 3 foundation models are available through Amazon SageMaker JumpStart to deploy and run inference. The Llama 3 models are a collection of pre-trained and fine-tuned generative text models.

In this post, we walk through how to discover and deploy Llama 3 models via SageMaker JumpStart.

What is Meta Llama 3

Llama 3 comes in two parameter sizes — 8B and 70B with 8k context length — that can support a broad range of use cases with improvements in reasoning, code generation, and instruction following. Llama 3 uses a decoder-only transformer architecture and new tokenizer that provides improved model performance with 128k size. In addition, Meta improved post-training procedures that substantially reduced false refusal rates, improved alignment, and increased diversity in model responses. You can now derive the combined advantages of Llama 3 performance and MLOps controls with Amazon SageMaker features such as SageMaker Pipelines, SageMaker Debugger, or container logs. In addition, the model will be deployed in an AWS secure environment under your VPC controls, helping provide data security.

What is SageMaker JumpStart

With SageMaker JumpStart, you can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment. You can now discover and deploy Llama 3 models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as SageMaker Pipelines, SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping provide data security. Llama 3 models are available today for deployment and inferencing in Amazon SageMaker Studio in us-east-1 (N. Virginia), us-east-2 (Ohio), us-west-2 (Oregon), eu-west-1 (Ireland) and ap-northeast-1 (Tokyo) AWS Regions.

Discover models

You can access the foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

From the SageMaker JumpStart landing page, you can easily discover various models by browsing through different hubs which are named after model providers. You can find Llama 3 models in Meta hub. If you do not see Llama 3 models, please update your SageMaker Studio version by shutting down and restarting. For more information, refer to Shut down and Update Studio Classic Apps.

You can find Llama 3 models by searching for “Meta-llama-3“ from the search box located at top left.

You can discover all Meta models available in SageMaker JumpStart by clicking on Meta hub.

Clicking on a model card opens the corresponding model detail page, from which you can easily Deploy the model.

Deploy a model

When you choose Deploy and acknowledge the EULA terms, deployment will start.

You can monitor progress of the deployment on the page that shows up after clicking the Deploy button.

Alternatively, you can choose Open notebook to deploy through the example notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using the notebook, you start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker with the following code.

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id = "meta-textgeneration-llama-3-70b-instruct")
predictor = model.deploy(accept_eula=False)

By default accept_eula is set to False. You need to manually accept the EULA to deploy the endpoint successfully, By doing so, you accept the user license agreement and acceptable use policy. You can also find the license agreement Llama website. This deploys the model on SageMaker with default configurations including the default instance type and default VPC configurations. You can change these configuration by specifying non-default values in JumpStartModel. To learn more, please refer to the following documentation.

The following table lists all the Llama 3 models available in SageMaker JumpStart along with the model_ids, default instance types and maximum number of total tokens (sum of the number of input tokens and number of generated tokens) supported for each of these models.

Model Name Model ID Max Total Tokens Default instance type
Meta-Llama-3-8B meta-textgeneration-llama-3-8B 8192 ml.g5.12xlarge
Meta-Llama-3-8B-Instruct meta-textgeneration-llama-3-8B-instruct 8192 ml.g5.12xlarge
Meta-Llama-3-70B meta-textgeneration-llama-3-70b 8192 ml.p4d.24xlarge
Meta-Llama-3-70B-Instruct meta-textgeneration-llama-3-70b-instruct 8192 ml.p4d.24xlarge

Run inference

After you deploy the model, you can run inference against the deployed endpoint through SageMaker predictor. Fine-tuned instruct models (Llama 3: 8B Instruct and 70B Instruct) accept a history of chats between the user and the chat assistant, and generate the subsequent chat. The pre-trained models (Llama 3: 8B and 70B) require a string prompt and perform text completion on the provided prompt.

predictor.predict(payload)

Inference parameters control the text generation process at the endpoint. The Max new tokens control the size of the output generated by the model. This is not same as the number of words because the vocabulary of the model is not the same as the English language vocabulary, and each token may not be an English language word. The temperature parameter controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.

Example prompts for the 70B model

You can use Llama 3 models for text completion for any piece of text. Through text generation, you can perform a variety of tasks such as question answering, language translation, and sentiment analysis, and more. The input payload to the endpoint looks like the following code:

payload = {
    “inputs”: str,
   (optional)"parameters":{"max_new_tokens":int, "top_p":float, "temperature":float}
}

The following are some sample example prompts and the text generated by the model. All outputs are generated with inference parameters {"max_new_tokens":64, "top_p":0.9, "temperature":0.6}.

Input: Simply put, the theory of relativity states that
Output: 1) the laws of physics are the same for all observers, and 2) the speed of light is constant for all observers. The first part of the theory is known as the special theory of relativity, and the second part is known as the general theory of relativity.nThe special theory of relativity states

Input: I believe the meaning of life is
Output:  to live a life of purpose, on purpose. I believe that we are all here to make a difference in the world, and that we are all here to make a difference in the lives of others. I believe that we are all here to make a difference in the world, and that we are all here to mak

Input: A brief message congratulating the team on the launch:nnHi everyone,nnI just 
Output: wanted to say congratulations on the launch of the new website! It looks great and I'm sure it will be a valuable resource for our customers. I know that a lot of hard work went into making this happen, so thank you all for your efforts.nnBest regards,nJohn Doen

In the next example, we show how to use Llama 3 models with few shot in-context learning where we provide training samples available to the model. We only run inference on the deployed model and during this process, and model weights do not change.

Input:
Translate English to French:nsea otter => loutre de mernpeppermint => menthe poivréenplush girafe => girafe peluchencheese =>
Output:  fromagenpanda => pandanpenguin => pingouinnpolar bear => ours polairenpuffin => macareuxnsloth => paresseuxnwalrus => morsenwhale => baleinenzebra => zu00e8brenTranslate English to German:n

Example prompts for the 70B-Instruct model

With Llama 3 instruct models which are optimized for dialogue use cases, the input to the instruct model endpoints is the previous history between the chat assistant and the user. You can ask questions contextual to the conversation that has happened so far. You can also provide the system configuration, such as personas, which define the chat assistant’s behavior. While the input payload format is the same as the base pre-trained model, the input text should be formatted in the following manner:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

[SYSTEM_TEXT]<|eot_id|><|start_header_id|>user<|end_header_id|>

[USER_TEXT]<|eot_id|><|start_header_id|>assistant<|end_header_id|>

In this instruction template, you can optionally start with a system role and include as many alternating roles as desired in the turn-based history. The final role should always be assistant and end with two new line feeds.

Next, consider a few example prompts and responses from the model. In the following example, the user is asking a simple question to the assistant.

Input: <|begin_of_text|><|start_header_id|>user<|end_header_id|>nnwhat is the recipe of mayonnaise?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: The classic condiment! Mayonnaise is a thick, creamy emulsion sauce made from a combination of oil, egg yolks, acid (such as vinegar or lemon juice), and seasonings. Here's a simple recipe to make mayonnaise at home:nn**Ingredients:**nn* 2 egg yolksn* 1 tablespoon lemon juice or vinegar (such as apple cider vinegar or white wine vinegar)n* 1/2 teaspoon Dijon mustard (optional, but recommended for flavor)n* 1/2 cup (120 ml) neutral-tasting oil, such as canola, grapeseed, or sunflower oiln* Salt, to tastenn**Instructions:**nn1. **Start with room temperature ingredients**: This is crucial for emulsification to occur.n2. **In a medium-sized bowl**, whisk together the egg yolks, lemon juice or vinegar, and Dijon mustard (if using) until well combined.n3. **Slowly add the oil**: While continuously whisking the egg yolk mixture, slowly pour in the oil in a thin, steady stream. Start with a very slow drizzle and gradually increase the flow as the mixture thickens.n4. **Whisk constantly**: Keep whisking until the mixture has doubled in volume and has a thick, creamy consistency. This should take about 5-7 minutes.n5. **Check the consistency**: Stop whisking when the mayonnaise has reached the desired consistency. It should be smooth, thick, and hold its shape when the whisk is lifted.n6. **Season with salt**: Add salt to taste, and whisk until well combined.n7. **Taste and adjust**: If the mayonnaise is too thick, add a little water. If it's not thick enough, add a little more oil.nn**Tips and Variations:**nn* Use a neutral-tasting oil to avoid flavoring the mayonnaise.n* If you're using a stand mixer, use the whisk attachment and start with a slow speed, increasing as needed.n* Add flavorings like minced garlic, chopped fresh herbs (e.g., parsley, dill, or tarragon), or grated ginger to create different flavor profiles.n* For a lighter mayonnaise, use only 1 egg yolk or add a little water to thin it out.n* If the mayonnaise breaks or separates, don't worry! Simply start again with a new egg yolk and slowly whisk in the broken mayonnaise.nn**Homemade Mayonnaise vs. Store-Bought:**nnWhile store-bought mayonnaise is convenient, homemade mayonnaise offers several advantages:nn* Freshness: Homemade mayonnaise is made to order, ensuring optimal flavor and texture.n* Customization: You can tailor the flavor to your liking with various seasonings and ingredients.n* No preservatives: Homemade mayonnaise doesn't contain added preservatives, making it a healthier option.nnEnjoy your freshly made mayonnaise!

In the following example, the user has a conversation with the assistant about tourist sites in Paris. Then the user inquires  about the first option recommended by the chat assistant.

Input: <|begin_of_text|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nnParis, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:nn1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.n2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.nnThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.<|eot_id|><|start_header_id|>user<|end_header_id|>nnWhat is so great about #1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: The Eiffel Tower! It's an iconic symbol of Paris and one of the most recognizable landmarks in the world. Here are some reasons why it's so great:nn1. **Engineering marvel**: When it was built for the 1889 World's Fair, the Eiffel Tower was a groundbreaking feat of engineering. It was the tallest structure in the world at the time, standing at 324 meters (1,063 feet).n2. **Panoramic views**: The Eiffel Tower offers breathtaking 360-degree views of the city from its observation decks. You can see many of Paris' famous landmarks, such as the Seine River, the Louvre, and Montmartre.n3. **Romantic atmosphere**: The Eiffel Tower is often associated with romance and is a popular spot for proposals, weddings, and romantic dinners. The tower's iron latticework and twinkling lights at night add to its romantic ambiance.n4. **Historical significance**: The Eiffel Tower has played a significant role in French history, serving as a communication tower during World War I and a symbol of French culture and engineering.n5. **Accessibility**: The Eiffel Tower is easily accessible by public transportation or on foot, and there are various ways to experience it, including taking the stairs, elevator, or a guided tour.n6. **Nighttime spectacle**: The Eiffel Tower is even more magical at night, when it's illuminated with thousands of twinkling lights. The light show, known as the "Eiffel Tower Sparkle," takes place every hour on the hour from dusk to midnight.n7. **Iconic status**: The Eiffel Tower has been featured in countless films, literature, and art, making it an instantly recognizable symbol of Paris and France.nnOverall, the Eiffel Tower is a must-visit attraction in Paris, offering a unique combination of history, engineering, romance, and stunning views.

In the following examples, we set the system’s configuration.

Input: <|begin_of_text|><|start_header_id|>system<|end_header_id|>nnAlways answer with Haiku<|eot_id|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: Eiffel's iron lacenRiver Seine's gentle flow bynMontmartre's charm calls
Input: <|begin_of_text|><|start_header_id|>system<|end_header_id|>nnAlways answer with emojis<|eot_id|><|start_header_id|>user<|end_header_id|>nnHow to go from Beijing to NY?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: ✈🛬🇺🇸

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Llama 3 models in SageMaker Studio. You now have access to four Llama 3 foundation models that contain billions of parameters. Because foundation models are pretrained, they can also help lower training and infrastructure costs and enable customization for your use case. Check out SageMaker JumpStart in SageMaker Studio now to get started.


About Authors

Kyle Ulrich is an Applied Scientist II at AWS
Xin Huang is a Senior Applied Scientist at AWS
Qing Lan is a Senior Software Developer Engineer at AWS
Haotian An is a Software Developer Engineer II at AWS
Christopher Whitten is a Software Development Engineer II at AWS
Tyler Osterberg is a Software Development Engineer I at AWS
Manan Shah is a Software Development Manager at AWS
Jonathan Guinegagne is a Senior Software Developer Engineer at AWS
Adriana Simmons is a Senior Product Marketing Manager at AWS
June Won is a Senior Product Manager at AWS
Ashish Khetan is a Senior Applied Scientist at AWS
Rachna Chadha is a Principal Solution Architect at AWS
Deepak Rupakula is a Principal GTM Specialist at AWS

Read More

Slack delivers native and secure generative AI powered by Amazon SageMaker JumpStart

Slack delivers native and secure generative AI powered by Amazon SageMaker JumpStart

This post is co-authored by Jackie Rocca, VP of Product, AI at Slack

Slack is where work happens. It’s the AI-powered platform for work that connects people, conversations, apps, and systems together in one place. With the newly launched Slack AI—a trusted, native, generative artificial intelligence (AI) experience available directly in Slack—users can surface and prioritize information so they can find their focus and do their most productive work.

We are excited to announce that Slack, a Salesforce company, has collaborated with Amazon SageMaker JumpStart to power Slack AI’s initial search and summarization features and provide safeguards for Slack to use large language models (LLMs) more securely. Slack worked with SageMaker JumpStart to host industry-leading third-party LLMs so that data is not shared with the infrastructure owned by third party model providers.

This keeps customer data in Slack at all times and upholds the same security practices and compliance standards that customers expect from Slack itself. Slack is also using Amazon SageMaker inference capabilities for advanced routing strategies to scale the solution to customers with optimal performance, latency, and throughput.

“With Amazon SageMaker JumpStart, Slack can access state-of-the-art foundation models to power Slack AI, while prioritizing security and privacy. Slack customers can now search smarter, summarize conversations instantly, and be at their most productive.”

– Jackie Rocca, VP Product, AI at Slack

Foundation models in SageMaker JumpStart

SageMaker JumpStart is a machine learning (ML) hub that can help accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select foundation models (FMs) quickly based on predefined quality and responsibility metrics to perform tasks like article summarization and image generation. Pretrained models are fully customizable for your use case with your data, and you can effortlessly deploy them into production with the user interface or SDK. In addition, you can access prebuilt solutions to solve common use cases and share ML artifacts, including ML models and notebooks, within your organization to accelerate ML model building and deployment. None of your data is used to train the underlying models. All the data is encrypted and is never shared with third-party vendors so you can trust that your data remains private and confidential.

Check out the SageMaker JumpStart model page for available models.

Slack AI

Slack launched Slack AI to provide native generative AI capabilities so that customers can easily find and consume large volumes of information quickly, enabling them to get even more value out of their shared knowledge in Slack.  For example, users can ask a question in plain language and instantly get clear and concise answers with enhanced search. They can catch up on channels and threads in one click with conversation summaries. And they can access personalized, daily digests of what’s happening in select channels with the newly launched recaps.

Because trust is Slack’s most important value, Slack AI runs on an enterprise-grade infrastructure they built on AWS, upholding the same security practices and compliance standards that customers expect. Slack AI is built for security-conscious customers and is designed to be secure by design—customer data remains in-house, data is not used for LLM training purposes, and data remains siloed.

Solution overview

SageMaker JumpStart provides access to many LLMs, and Slack selects the right FMs that fit their use cases. Because these models are hosted on Slack’s owned AWS infrastructure, data sent to models during invocation doesn’t leave Slack’s AWS infrastructure. In addition, to provide a secure solution, data sent for invoking SageMaker models is encrypted in transit. The data sent to SageMaker JumpStart endpoints for invoking models is not used to train base models. SageMaker JumpStart allows Slack to support high standards for security and data privacy, while also using state-of-the-art models that help Slack AI perform optimally for Slack customers.

SageMaker JumpStart endpoints serving Slack business applications are powered by AWS instances. SageMaker supports a wide range of instance types for model deployment, which allows Slack to pick the instance that is best suited to support latency and scalability requirements of Slack AI use cases. Slack AI has access to multi-GPU based instances to host their SageMaker JumpStart models. Multiple GPU instances allow each instance backing Slack AI’s endpoint to host multiple copies of a model. This helps improve resource utilization and reduce model deployment cost. For more information, refer to Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency.

The following diagram illustrates the solution architecture.

To use the instances most effectively and support the concurrency and latency requirements, Slack used SageMaker-offered routing strategies with their SageMaker endpoints. By default, a SageMaker endpoint uniformly distributes incoming requests to ML instances using a round-robin algorithm routing strategy called RANDOM. However, with generative AI workloads, requests and responses can be extremely variable, and it’s desirable to load balance by considering the capacity and utilization of the instance rather than random load balancing. To effectively distribute requests across instances backing the endpoints, Slack uses the LEAST_OUTSTANDING_REQUESTS (LAR) routing strategy. This strategy routes requests to the specific instances that have more capacity to process requests instead of randomly picking any available instance. The LAR strategy provides more uniform load balancing and resource utilization. As a result, Slack AI noticed over a 39% latency decrease in their p95 latency numbers when enabling LEAST_OUTSTANDING_REQUESTS compared to RANDOM.

For more details on SageMaker routing strategies, see Minimize real-time inference latency by using Amazon SageMaker routing strategies.

Conclusion

Slack is delivering native generative AI capabilities that will help their customers be more productive and easily tap into the collective knowledge that’s embedded in their Slack conversations. With fast access to a large selection of FMs and advanced load balancing capabilities that are hosted in dedicated instances through SageMaker JumpStart, Slack AI is able to provide rich generative AI features in a more robust and quicker manner, while upholding Slack’s trust and security standards.

Learn more about SageMaker JumpStart, Slack AI and how the Slack team built Slack AI to be secure and private. Leave your thoughts and questions in the comments section.


About the Authors

Jackie Rocca is VP of Product at Slack, where she oversees the vision and execution of Slack AI, which brings generative AI natively and securely into Slack’s user experience. In her five years at Slack, Jackie has delivered on a number of initiatives to push Slack’s business forward. Now she’s on a mission to help customers accelerate their productivity and get even more value out of their conversations, data, and collective knowledge with generative AI. Prior to her time at Slack, Jackie was a Product Manager at Google for more than six years, where she helped launch and grow Youtube TV. Jackie is based in the San Francisco Bay Area.

Rachna Chadha is a Principal Solutions Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that the ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Maninder (Mani) Kaur is the AI/ML Specialist lead for Strategic ISVs at AWS. With her customer-first approach, Mani helps strategic customers shape their AI/ML strategy, fuel innovation, and accelerate their AI/ML journey. Mani is a firm believer of ethical and responsible AI, and strives to ensure that her customers’ AI solutions align with these principles.

Gene Ting is a Principal Solutions Architect at AWS. He is focused on helping enterprise customers build and operate workloads securely on AWS. In his free time, Gene enjoys teaching kids technology and sports, as well as following the latest on cybersecurity.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Read More

Uncover hidden connections in unstructured financial data with Amazon Bedrock and Amazon Neptune

Uncover hidden connections in unstructured financial data with Amazon Bedrock and Amazon Neptune

In asset management, portfolio managers need to closely monitor companies in their investment universe to identify risks and opportunities, and guide investment decisions. Tracking direct events like earnings reports or credit downgrades is straightforward—you can set up alerts to notify managers of news containing company names. However, detecting second and third-order impacts arising from events at suppliers, customers, partners, or other entities in a company’s ecosystem is challenging.

For example, a supply chain disruption at a key vendor would likely negatively impact downstream manufacturers. Or the loss of a top customer for a major client poses a demand risk for the supplier. Very often, such events fail to make headlines featuring the impacted company directly, but are still important to pay attention to. In this post, we demonstrate an automated solution combining knowledge graphs and generative artificial intelligence (AI) to surface such risks by cross-referencing relationship maps with real-time news.

Broadly, this entails two steps: First, building the intricate relationships between companies (customers, suppliers, directors) into a knowledge graph. Second, using this graph database along with generative AI to detect second and third-order impacts from news events. For instance, this solution can highlight that delays at a parts supplier may disrupt production for downstream auto manufacturers in a portfolio though none are directly referenced.

With AWS, you can deploy this solution in a serverless, scalable, and fully event-driven architecture. This post demonstrates a proof of concept built on two key AWS services well suited for graph knowledge representation and natural language processing: Amazon Neptune and Amazon Bedrock. Neptune is a fast, reliable, fully managed graph database service that makes it straightforward to build and run applications that work with highly connected datasets. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Overall, this prototype demonstrates the art of possible with knowledge graphs and generative AI—deriving signals by connecting disparate dots. The takeaway for investment professionals is the ability to stay on top of developments closer to the signal while avoiding noise.

Build the knowledge graph

The first step in this solution is building a knowledge graph, and a valuable yet often overlooked data source for knowledge graphs is company annual reports. Because official corporate publications undergo scrutiny before release, the information they contain is likely to be accurate and reliable. However, annual reports are written in an unstructured format meant for human reading rather than machine consumption. To unlock their potential, you need a way to systematically extract and structure the wealth of facts and relationships they contain.

With generative AI services like Amazon Bedrock, you now have the capability to automate this process. You can take an annual report and trigger a processing pipeline to ingest the report, break it down into smaller chunks, and apply natural language understanding to pull out salient entities and relationships.

For example, a sentence stating that “[Company A] expanded its European electric delivery fleet with an order for 1,800 electric vans from [Company B]” would allow Amazon Bedrock to identify the following:

  • [Company A] as a customer
  • [Company B] as a supplier
  • A supplier relationship between [Company A] and [Company B]
  • Relationship details of “supplier of electric delivery vans”

Extracting such structured data from unstructured documents requires providing carefully crafted prompts to large language models (LLMs) so they can analyze text to pull out entities like companies and people, as well as relationships such as customers, suppliers, and more. The prompts contain clear instructions on what to look out for and the structure to return the data in. By repeating this process across the entire annual report, you can extract the relevant entities and relationships to construct a rich knowledge graph.

However, before committing the extracted information to the knowledge graph, you need to first disambiguate the entities. For instance, there may already be another ‘[Company A]’ entity in the knowledge graph, but it could represent a different organization with the same name. Amazon Bedrock can reason and compare the attributes such as business focus area, industry, and revenue-generating industries and relationships to other entities to determine if the two entities are actually distinct. This prevents inaccurately merging unrelated companies into a single entity.

After disambiguation is complete, you can reliably add new entities and relationships into your Neptune knowledge graph, enriching it with the facts extracted from annual reports. Over time, the ingestion of reliable data and integration of more reliable data sources will help build a comprehensive knowledge graph that can support revealing insights through graph queries and analytics.

This automation enabled by generative AI makes it feasible to process thousands of annual reports and unlocks an invaluable asset for knowledge graph curation that would otherwise go untapped due to the prohibitively high manual effort needed.

The following screenshot shows an example of the visual exploration that’s possible in a Neptune graph database using the Graph Explorer tool.

Process news articles

The next step of the solution is automatically enriching portfolio managers’ news feeds and highlighting articles relevant to their interests and investments. For the news feed, portfolio managers can subscribe to any third-party news provider through AWS Data Exchange or another news API of their choice.

When a news article enters the system, an ingestion pipeline is invoked to process the content. Using techniques similar to the processing of annual reports, Amazon Bedrock is used to extract entities, attributes, and relationships from the news article, which are then used to disambiguate against the knowledge graph to identify the corresponding entity in the knowledge graph.

The knowledge graph contains connections between companies and people, and by linking article entities to existing nodes, you can identify if any subjects are within two hops of the companies that the portfolio manager has invested in or is interested in. Finding such a connection indicates the article may be relevant to the portfolio manager, and because the underlying data is represented in a knowledge graph, it can be visualized to help the portfolio manager understand why and how this context is relevant. In addition to identifying connections to the portfolio, you can also use Amazon Bedrock to perform sentiment analysis on the entities referenced.

The final output is an enriched news feed surfacing articles likely to impact the portfolio manager’s areas of interest and investments.

Solution overview

The overall architecture of the solution looks like the following diagram.

The workflow consists of the following steps:

  1. A user uploads official reports (in PDF format) to an Amazon Simple Storage Service (Amazon S3) bucket. The reports should be officially published reports to minimize the inclusion of inaccurate data into your knowledge graph (as opposed to news and tabloids).
  2. The S3 event notification invokes an AWS Lambda function, which sends the S3 bucket and file name to an Amazon Simple Queue Service (Amazon SQS) queue. The First-In-First-Out (FIFO) queue makes sure that the report ingestion process is performed sequentially to reduce the likelihood of introducing duplicate data into your knowledge graph.
  3. An Amazon EventBridge time-based event runs every minute to start the run of an AWS Step Functions state machine asynchronously.
  4. The Step Functions state machine runs through a series of tasks to process the uploaded document by extracting key information and inserting it into your knowledge graph:
    1. Receive the queue message from Amazon SQS.
    2. Download the PDF report file from Amazon S3, split it into multiple smaller text chunks (approximately 1,000 words) for processing, and store the text chunks in Amazon DynamoDB.
    3. Use Anthropic’s Claude v3 Sonnet on Amazon Bedrock to process the first few text chunks to determine the main entity that the report is referring to, together with relevant attributes (such as industry).
    4. Retrieve the text chunks from DynamoDB and for each text chunk, invoke a Lambda function to extract out entities (such as company or person), and its relationship (customer, supplier, partner, competitor, or director) to the main entity using Amazon Bedrock.
    5. Consolidate all extracted information.
    6. Filter out noise and irrelevant entities (for example, generic terms such as “consumers”) using Amazon Bedrock.
    7. Use Amazon Bedrock to perform disambiguation by reasoning using the extracted information against the list of similar entities from the knowledge graph. If the entity does not exist, insert it. Otherwise, use the entity that already exists in the knowledge graph. Insert all relationships extracted.
    8. Clean up by deleting the SQS queue message and the S3 file.
  5. A user accesses a React-based web application to view the news articles that are supplemented with the entity, sentiment, and connection path information.
  6. Using the web application, the user specifies the number of hops (default N=2) on the connection path to monitor.
  7. Using the web application, the user specifies the list of entities to track.
  8. To generate fictional news, the user chooses Generate Sample News to generate 10 sample financial news articles with random content to be fed into the news ingestion process. Content is generated using Amazon Bedrock and is purely fictional.
  9. To download actual news, the user chooses Download Latest News to download the top news happening today (powered by NewsAPI.org).
  10. The news file (TXT format) is uploaded to an S3 bucket. Steps 8 and 9 upload news to the S3 bucket automatically, but you can also build integrations to your preferred news provider such as AWS Data Exchange or any third-party news provider to drop news articles as files into the S3 bucket. News data file content should be formatted as <date>{dd mmm yyyy}</date><title>{title}</title><text>{news content}</text>.
  11. The S3 event notification sends the S3 bucket or file name to Amazon SQS (standard), which invokes multiple Lambda functions to process the news data in parallel:
    1. Use Amazon Bedrock to extract entities mentioned in the news together with any related information, relationships, and sentiment of the mentioned entity.
    2. Check against the knowledge graph and use Amazon Bedrock to perform disambiguation by reasoning using the available information from the news and from within the knowledge graph to identify the corresponding entity.
    3. After the entity has been located, search for and return any connection paths connecting to entities marked with INTERESTED=YES in the knowledge graph that are within N=2 hops away.
  12. The web application auto refreshes every 1 second to pull out the latest set of processed news to display on the web application.

Deploy the prototype

You can deploy the prototype solution and start experimenting yourself. The prototype is available from GitHub and includes details on the following:

  • Deployment prerequisites
  • Deployment steps
  • Cleanup steps

Summary

This post demonstrated a proof of concept solution to help portfolio managers detect second- and third-order risks from news events, without direct references to companies they track. By combining a knowledge graph of intricate company relationships with real-time news analysis using generative AI, downstream impacts can be highlighted, such as production delays from supplier hiccups.

Although it’s only a prototype, this solution shows the promise of knowledge graphs and language models to connect dots and derive signals from noise. These technologies can aid investment professionals by revealing risks faster through relationship mappings and reasoning. Overall, this is a promising application of graph databases and AI that warrants exploration to augment investment analysis and decision-making.

If this example of generative AI in financial services is of interest to your business, or you have a similar idea, reach out to your AWS account manager, and we will be delighted to explore further with you.


About the Author

Xan Huang is a Senior Solutions Architect with AWS and is based in Singapore. He works with major financial institutions to design and build secure, scalable, and highly available solutions in the cloud. Outside of work, Xan spends most of his free time with his family and getting bossed around by his 3-year-old daughter. You can find Xan on LinkedIn.

Read More

Open source observability for AWS Inferentia nodes within Amazon EKS clusters

Open source observability for AWS Inferentia nodes within Amazon EKS clusters

Recent developments in machine learning (ML) have led to increasingly large models, some of which require hundreds of billions of parameters. Although they are more powerful, training and inference on those models require significant computational resources. Despite the availability of advanced distributed training libraries, it’s common for training and inference jobs to need hundreds of accelerators (GPUs or purpose-built ML chips such as AWS Trainium and AWS Inferentia), and therefore tens or hundreds of instances.

In such distributed environments, observability of both instances and ML chips becomes key to model performance fine-tuning and cost optimization. Metrics allow teams to understand workload behavior and optimize resource allocation and utilization, diagnose anomalies, and increase overall infrastructure efficiency. For data scientists, ML chips utilization and saturation are also relevant for capacity planning.

This post walks you through the Open Source Observability pattern for AWS Inferentia, which shows you how to monitor the performance of ML chips, used in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, with data plane nodes based on Amazon Elastic Compute Cloud (Amazon EC2) instances of type Inf1 and Inf2.

The pattern is part of the AWS CDK Observability Accelerator, a set of opinionated modules to help you set observability for Amazon EKS clusters. The AWS CDK Observability Accelerator is organized around patterns, which are reusable units for deploying multiple resources. The open source observability set of patterns instruments observability with Amazon Managed Grafana dashboards, an AWS Distro for OpenTelemetry collector to collect metrics, and Amazon Managed Service for Prometheus to store them.

Solution overview

The following diagram illustrates the solution architecture.

This solution deploys an Amazon EKS cluster with a node group that includes Inf1 instances.

The AMI type of the node group is AL2_x86_64_GPU, which uses the Amazon EKS optimized accelerated Amazon Linux AMI. In addition to the standard Amazon EKS-optimized AMI configuration, the accelerated AMI includes the NeuronX runtime.

To access the ML chips from Kubernetes, the pattern deploys the AWS Neuron device plugin.

Metrics are exposed to Amazon Managed Service for Prometheus by the neuron-monitor DaemonSet, which deploys a minimal container, with the Neuron tools installed. Specifically, the neuron-monitor DaemonSet runs the neuron-monitor command piped into the neuron-monitor-prometheus.py companion script (both commands are part of the container):

neuron-monitor | neuron-monitor-prometheus.py --port <port>

The command uses the following components:

  • neuron-monitor collects metrics and stats from the Neuron applications running on the system and streams the collected data to stdout in JSON format
  • neuron-monitor-prometheus.py maps and exposes the telemetry data from JSON format into Prometheus-compatible format

Data is visualized in Amazon Managed Grafana by the corresponding dashboard.

The rest of the setup to collect and visualize metrics with Amazon Managed Service for Prometheus and Amazon Managed Grafana is similar to that used in other open source based patterns, which are included in the AWS Observability Accelerator for CDK GitHub repository.

Prerequisites

You need the following to complete the steps in this post:

Set up the environment

Complete the following steps to set up your environment:

  1. Open a terminal window and run the following commands:
export AWS_REGION=<YOUR AWS REGION>
export ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
  1. Retrieve the workspace IDs of any existing Amazon Managed Grafana workspace:
aws grafana list-workspaces

The following is our sample output:

{
  "workspaces": [
    {
      "authentication": {
        "providers": [
          "AWS_SSO"
        ]
      },
      "created": "2023-06-07T12:23:56.625000-04:00",
      "description": "accelerator-workspace",
      "endpoint": "g-XYZ.grafana-workspace.us-east-2.amazonaws.com",
      "grafanaVersion": "9.4",
      "id": "g-XYZ",
      "modified": "2023-06-07T12:30:09.892000-04:00",
      "name": "accelerator-workspace",
      "notificationDestinations": [
        "SNS"
      ],
      "status": "ACTIVE",
      "tags": {}
    }
  ]
}
  1. Assign the values of id and endpoint to the following environment variables:
export COA_AMG_WORKSPACE_ID="<<YOUR-WORKSPACE-ID, similar to the above g-XYZ, without quotation marks>>"
export COA_AMG_ENDPOINT_URL="<<https://YOUR-WORKSPACE-URL, including protocol (i.e. https://), without quotation marks, similar to the above https://g-XYZ.grafana-workspace.us-east-2.amazonaws.com>>"

COA_AMG_ENDPOINT_URL needs to include https://.

  1. Create a Grafana API key from the Amazon Managed Grafana workspace:
export AMG_API_KEY=$(aws grafana create-workspace-api-key 
--key-name "grafana-operator-key" 
--key-role "ADMIN" 
--seconds-to-live 432000 
--workspace-id $COA_AMG_WORKSPACE_ID 
--query key 
--output text)
  1. Set up a secret in AWS Systems Manager:
aws ssm put-parameter --name "/cdk-accelerator/grafana-api-key" 
--type "SecureString" 
--value $AMG_API_KEY 
--region $AWS_REGION

The secret will be accessed by the External Secrets add-on and made available as a native Kubernetes secret in the EKS cluster.

Bootstrap the AWS CDK environment

The first step to any AWS CDK deployment is bootstrapping the environment. You use the cdk bootstrap command in the AWS CDK CLI to prepare the environment (a combination of AWS account and AWS Region) with resources required by AWS CDK to perform deployments into that environment. AWS CDK bootstrapping is needed for each account and Region combination, so if you already bootstrapped AWS CDK in a Region, you don’t need to repeat the bootstrapping process.

cdk bootstrap aws://$ACCOUNT_ID/$AWS_REGION

Deploy the solution

Complete the following steps to deploy the solution:

  1. Clone the cdk-aws-observability-accelerator repository and install the dependency packages. This repository contains AWS CDK v2 code written in TypeScript.
git clone https://github.com/aws-observability/cdk-aws-observability-accelerator.git
cd cdk-aws-observability-accelerator

The actual settings for Grafana dashboard JSON files are expected to be specified in the AWS CDK context. You need to update context in the cdk.json file, located in the current directory. The location of the dashboard is specified by the fluxRepository.values.GRAFANA_NEURON_DASH_URL parameter, and neuronNodeGroup is used to set the instance type, number, and Amazon Elastic Block Store (Amazon EBS) size used for the nodes.

  1. Enter the following snippet into cdk.json, replacing context:
"context": {
    "fluxRepository": {
      "name": "grafana-dashboards",
      "namespace": "grafana-operator",
      "repository": {
        "repoUrl": "https://github.com/aws-observability/aws-observability-accelerator",
        "name": "grafana-dashboards",
        "targetRevision": "main",
        "path": "./artifacts/grafana-operator-manifests/eks/infrastructure"
      },
      "values": {
        "GRAFANA_CLUSTER_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json",
        "GRAFANA_KUBELET_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json",
        "GRAFANA_NSWRKLDS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json",
        "GRAFANA_NODEEXP_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json",
        "GRAFANA_NODES_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodes.json",
        "GRAFANA_WORKLOADS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/workloads.json",
        "GRAFANA_NEURON_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/neuron/neuron-monitor.json"
      },
      "kustomizations": [
        {
          "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/infrastructure"
        },
        {
          "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/neuron"
        }
      ]
    },
     "neuronNodeGroup": {
      "instanceClass": "inf1",
      "instanceSize": "2xlarge",
      "desiredSize": 1, 
      "minSize": 1, 
      "maxSize": 3,
      "ebsSize": 512
    }
  }

You can replace the Inf1 instance type with Inf2 and change the size as needed. To check availability in your selected Region, run the following command (amend Values as you see fit):

aws ec2 describe-instance-type-offerings 
--filters Name=instance-type,Values="inf1*" 
--query "InstanceTypeOfferings[].InstanceType" 
--region $AWS_REGION
  1. Install the project dependencies:
npm install
  1. Run the following commands to deploy the open source observability pattern:
make build
make pattern single-new-eks-inferentia-opensource-observability deploy

Validate the solution

Complete the following steps to validate the solution:

  1. Run the update-kubeconfig command. You should be able to get the command from the output message of the previous command:
aws eks update-kubeconfig --name single-new-eks-inferentia-opensource... --region <your region> --role-arn arn:aws:iam::xxxxxxxxx:role/single-new-eks-....
  1. Verify the resources you created:
kubectl get pods -A

The following screenshot shows our sample output.

  1. Make sure the neuron-device-plugin-daemonset DaemonSet is running:
kubectl get ds neuron-device-plugin-daemonset --namespace kube-system

The following is our expected output:

NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
neuron-device-plugin-daemonset   1         1         1       1            1           <none>          2h
  1. Confirm that the neuron-monitor DaemonSet is running:
kubectl get ds neuron-monitor --namespace kube-system

The following is our expected output:

NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
neuron-monitor   1         1         1       1            1           <none>          2h
  1. To verify that the Neuron devices and cores are visible, run the neuron-ls and neuron-top commands from, for example, your neuron-monitor pod (you can get the pod’s name from the output of kubectl get pods -A):
kubectl exec -it {your neuron-monitor pod} -n kube-system -- /bin/bash -c "neuron-ls"

The following screenshot shows our expected output.

kubectl exec -it {your neuron-monitor pod} -n kube-system -- /bin/bash -c "neuron-top"

The following screenshot shows our expected output.

Visualize data using the Grafana Neuron dashboard

Log in to your Amazon Managed Grafana workspace and navigate to the Dashboards panel. You should see a dashboard named Neuron / Monitor.

To see some interesting metrics on the Grafana dashboard, we apply the following manifest:

curl https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/k8s-deployment-manifest-templates/neuron/pytorch-inference-resnet50.yml | kubectl apply -f -

This is a sample workload that compiles the torchvision ResNet50 model and runs repetitive inference in a loop to generate telemetry data.

To verify the pod was successfully deployed, run the following code:

kubectl get pods

You should see a pod named pytorch-inference-resnet50.

After a few minutes, looking into the Neuron / Monitor dashboard, you should see the gathered metrics similar to the following screenshots.

Grafana Operator and Flux always work together to synchronize your dashboards with Git. If you delete your dashboards by accident, they will be re-provisioned automatically.

Clean up

You can delete the whole AWS CDK stack with the following command:

make pattern single-new-eks-inferentia-opensource-observability destroy

Conclusion

In this post, we showed you how to introduce observability, with open source tooling, into an EKS cluster featuring a data plane running EC2 Inf1 instances. We started by selecting the Amazon EKS-optimized accelerated AMI for the data plane nodes, which includes the Neuron container runtime, providing access to AWS Inferentia and Trainium Neuron devices. Then, to expose the Neuron cores and devices to Kubernetes, we deployed the Neuron device plugin. The actual collection and mapping of telemetry data into Prometheus-compatible format was achieved via neuron-monitor and neuron-monitor-prometheus.py. Metrics were sourced from Amazon Managed Service for Prometheus and displayed on the Neuron dashboard of Amazon Managed Grafana.

We recommend that you explore additional observability patterns in the AWS Observability Accelerator for CDK GitHub repo. To learn more about Neuron, refer to the AWS Neuron Documentation.


About the Author

Riccardo Freschi is a Sr. Solutions Architect at AWS, focusing on application modernization. He works closely with partners and customers to help them transform their IT landscapes in their journey to the AWS Cloud by refactoring existing applications and building new ones.

Read More

Explore data with ease: Use SQL and Text-to-SQL in Amazon SageMaker Studio JupyterLab notebooks

Explore data with ease: Use SQL and Text-to-SQL in Amazon SageMaker Studio JupyterLab notebooks

Amazon SageMaker Studio provides a fully managed solution for data scientists to interactively build, train, and deploy machine learning (ML) models. In the process of working on their ML tasks, data scientists typically start their workflow by discovering relevant data sources and connecting to them. They then use SQL to explore, analyze, visualize, and integrate data from various sources before using it in their ML training and inference. Previously, data scientists often found themselves juggling multiple tools to support SQL in their workflow, which hindered productivity.

We’re excited to announce that JupyterLab notebooks in SageMaker Studio now come with built-in support for SQL. Data scientists can now:

  • Connect to popular data services including Amazon Athena, Amazon Redshift, Amazon DataZone, and Snowflake directly within the notebooks
  • Browse and search for databases, schemas, tables, and views, and preview data within the notebook interface
  • Mix SQL and Python code in the same notebook for efficient exploration and transformation of data for use in ML projects
  • Use developer productivity features such as SQL command completion, code formatting assistance, and syntax highlighting to help accelerate code development and improve overall developer productivity

In addition, administrators can securely manage connections to these data services, allowing data scientists to access authorized data without the need to manage credentials manually.

In this post, we guide you through setting up this feature in SageMaker Studio, and walk you through various capabilities of this feature. Then we show how you can enhance the in-notebook SQL experience using Text-to-SQL capabilities provided by advanced large language models (LLMs) to write complex SQL queries using natural language text as input. Finally, to enable a broader audience of users to generate SQL queries from natural language input in their notebooks, we show you how to deploy these Text-to-SQL models using Amazon SageMaker endpoints.

Solution overview

With SageMaker Studio JupyterLab notebook’s SQL integration, you can now connect to popular data sources like Snowflake, Athena, Amazon Redshift, and Amazon DataZone. This new feature enables you to perform various functions.

For example, you can visually explore data sources like databases, tables, and schemas directly from your JupyterLab ecosystem. If your notebook environments are running on SageMaker Distribution 1.6 or higher, look for a new widget on the left side of your JupyterLab interface. This addition enhances data accessibility and management within your development environment.

If you’re not currently on suggested SageMaker Distribution (1.5 or lower) or in a custom environment, refer to appendix for more information.

After you have set up connections (illustrated in the next section), you can list data connections, browse databases and tables, and inspect schemas.

The SageMaker Studio JupyterLab built-in SQL extension also enables you to run SQL queries directly from a notebook. Jupyter notebooks can differentiate between SQL and Python code using the %%sm_sql magic command, which must be placed at the top of any cell that contains SQL code. This command signals to JupyterLab that the following instructions are SQL commands rather than Python code. The output of a query can be displayed directly within the notebook, facilitating seamless integration of SQL and Python workflows in your data analysis.

The output of a query can be displayed visually as HTML tables, as shown in the following screenshot.

They can also be written to a pandas DataFrame.

Prerequisites

Make sure you have satisfied the following prerequisites in order to use the SageMaker Studio notebook SQL experience:

  • SageMaker Studio V2 – Make sure you’re running the most up-to-date version of your SageMaker Studio domain and user profiles. If you’re currently on SageMaker Studio Classic, refer to Migrating from Amazon SageMaker Studio Classic.
  • IAM role – SageMaker requires an AWS Identity and Access Management (IAM) role to be assigned to a SageMaker Studio domain or user profile to manage permissions effectively. An execution role update may be required to bring in data browsing and the SQL run feature. The following example policy enables users to grant, list, and run AWS Glue, Athena, Amazon Simple Storage Service (Amazon S3), AWS Secrets Manager, and Amazon Redshift resources:
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Sid":"SQLRelatedS3Permissions",
             "Effect":"Allow",
             "Action":[
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetBucketLocation",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload",
                "s3:PutObject"
             ],
             "Resource":[
                "arn:aws:s3:::sagemaker*/*",
                "arn:aws:s3:::sagemaker*"
             ]
          },
          {
             "Sid":"GlueDataAccess",
             "Effect":"Allow",
             "Action":[
                "glue:GetDatabases",
                "glue:GetSchema",
                "glue:GetTables",
                "glue:GetDatabase",
                "glue:GetTable",
                "glue:ListSchemas",
                "glue:GetPartitions",
                "glue:GetConnections",
                "glue:GetConnection",
                "glue:CreateConnection"
             ],
             "Resource":[
                "arn:aws:glue:<region>:<account>:table/sagemaker*/*",
                "arn:aws:glue:<region>:<account>:database/sagemaker*",
                "arn:aws:glue:<region>:<account>:schema/sagemaker*",
                "arn:aws:glue:<region>:<account>:connection/sagemaker*",
                "arn:aws:glue:<region>:<account>:registry/sagemaker*",
                "arn:aws:glue:<region>:<account>:catalog"
             ]
          },
          {
             "Sid":"AthenaQueryExecution",
             "Effect":"Allow",
             "Action":[
                "athena:ListDataCatalogs",
                "athena:ListDatabases",
                "athena:ListTableMetadata",
                "athena:StartQueryExecution",
                "athena:GetQueryExecution",
                "athena:RunQuery",
                "athena:StartSession",
                "athena:GetQueryResults",
                "athena:ListWorkGroups",
                "athena:GetDataCatalog",
                "athena:GetWorkGroup"
             ],
             "Resource":[
                "arn:aws:athena:<region>:<account>:workgroup/sagemaker*",
                "arn:aws:athena:<region>:<account>:datacatalog/sagemaker*"
             ]
          },
          {
             "Sid":"GetSecretsAndCredentials",
             "Effect":"Allow",
             "Action":[
                "secretsmanager:GetSecretValue",
                "redshift:GetClusterCredentials"
             ],
             "Resource":[
                "arn:aws:secretsmanager:<region>:<account>:secret:sagemaker*",
                "arn:aws:redshift:<region>:<account>:dbuser:sagemaker*/sagemaker*",
                "arn:aws:redshift:<region>:<account>:dbgroup:sagemaker*/sagemaker*",
                "arn:aws:redshift:<region>:<account>:dbname:sagemaker*/sagemaker*"
             ]
          }
       ]
    }

  • JupyterLab Space – You need access to the updated SageMaker Studio and JupyterLab Space with SageMaker Distribution v1.6 or later image versions. If you’re using custom images for JupyterLab Spaces or older versions of SageMaker Distribution (v1.5 or lower), refer to the appendix for instructions to install necessary packages and modules to enable this feature in your environments. To learn more about SageMaker Studio JupyterLab Spaces, refer to Boost productivity on Amazon SageMaker Studio: Introducing JupyterLab Spaces and generative AI tools.
  • Data source access credentials – This SageMaker Studio notebook feature requires user name and password access to data sources such as Snowflake and Amazon Redshift. Create user name and password-based access to these data sources if you do not already have one. OAuth-based access to Snowflake is not a supported feature as of this writing.
  • Load SQL magic – Before you run SQL queries from a Jupyter notebook cell, it’s essential to load the SQL magics extension. Use the command %load_ext amazon_sagemaker_sql_magic to enable this feature. Additionally, you can run the %sm_sql? command to view a comprehensive list of supported options for querying from a SQL cell. These options include setting a default query limit of 1,000, running a full extraction, and injecting query parameters, among others. This setup allows for flexible and efficient SQL data manipulation directly within your notebook environment.

Create database connections

The built-in SQL browsing and execution capabilities of SageMaker Studio are enhanced by AWS Glue connections. An AWS Glue connection is an AWS Glue Data Catalog object that stores essential data such as login credentials, URI strings, and virtual private cloud (VPC) information for specific data stores. These connections are used by AWS Glue crawlers, jobs, and development endpoints to access various types of data stores. You can use these connections for both source and target data, and even reuse the same connection across multiple crawlers or extract, transform, and load (ETL) jobs.

To explore SQL data sources in the left pane of SageMaker Studio, you first need to create AWS Glue connection objects. These connections facilitate access to different data sources and allow you to explore their schematic data elements.

In the following sections, we walk through the process of creating SQL-specific AWS Glue connectors. This will enable you to access, view, and explore datasets across a variety of data stores. For more detailed information about AWS Glue connections, refer to Connecting to data.

Create an AWS Glue connection

The only way to bring data sources into SageMaker Studio is with AWS Glue connections. You need to create AWS Glue connections with specific connection types. As of this writing, the only supported mechanism of creating these connections is using the AWS Command Line Interface (AWS CLI).

Connection definition JSON file

When connecting to different data sources in AWS Glue, you must first create a JSON file that defines the connection properties—referred to as the connection definition file. This file is crucial for establishing an AWS Glue connection and should detail all the necessary configurations for accessing the data source. For security best practices, it’s recommended to use Secrets Manager to securely store sensitive information such as passwords. Meanwhile, other connection properties can be managed directly through AWS Glue connections. This approach makes sure that sensitive credentials are protected while still making the connection configuration accessible and manageable.

The following is an example of a connection definition JSON:

{
    "ConnectionInput": {
        "Name": <GLUE_CONNECTION_NAME>,
        "Description": <GLUE_CONNECTION_DESCRIPTION>,
        "ConnectionType": "REDSHIFT | SNOWFLAKE | ATHENA",
        "ConnectionProperties": {
            "PythonProperties": "{"aws_secret_arn": <SECRET_ARN>, "database": <...>}"
        }
    }
}

When setting up AWS Glue connections for your data sources, there are a few important guidelines to follow to provide both functionality and security:

  • Stringification of properties – Within the PythonProperties key, make sure all properties are stringified key-value pairs. It’s crucial to properly escape double-quotes by using the backslash () character where necessary. This helps maintain the correct format and avoid syntax errors in your JSON.
  • Handling sensitive information – Although it’s possible to include all connection properties within PythonProperties, it is advisable not to include sensitive details like passwords directly in these properties. Instead, use Secrets Manager for handling sensitive information. This approach secures your sensitive data by storing it in a controlled and encrypted environment, away from the main configuration files.

Create an AWS Glue connection using the AWS CLI

After you include all the necessary fields in your connection definition JSON file, you’re ready to establish an AWS Glue connection for your data source using the AWS CLI and the following command:

aws --region <REGION> glue create-connection 
--cli-input-json file:///path/to/file/connection/definition/file.json

This command initiates a new AWS Glue connection based on the specifications detailed in your JSON file. The following is a quick breakdown of the command components:

  • –region <REGION> – This specifies the AWS Region where your AWS Glue connection will be created. It is crucial to select the Region where your data sources and other services are located to minimize latency and comply with data residency requirements.
  • –cli-input-json file:///path/to/file/connection/definition/file.json – This parameter directs the AWS CLI to read the input configuration from a local file that contains your connection definition in JSON format.

You should be able to create AWS Glue connections with the preceding AWS CLI command from your Studio JupyterLab terminal. On the File menu, choose New and Terminal.

If the create-connection command runs successfully, you should see your data source listed in the SQL browser pane. If you don’t see your data source listed, choose Refresh to update the cache.

Create a Snowflake connection

In this section, we focus on integrating a Snowflake data source with SageMaker Studio. Creating Snowflake accounts, databases, and warehouses falls outside the scope of this post. To get started with Snowflake, refer to the Snowflake user guide. In this post, we concentrate on creating a Snowflake definition JSON file and establishing a Snowflake data source connection using AWS Glue.

Create a Secrets Manager secret

You can connect to your Snowflake account by either using a user ID and password or using private keys. To connect with a user ID and password, you need to securely store your credentials in Secrets Manager. As mentioned previously, although it’s possible to embed this information under PythonProperties, it is not recommended to store sensitive information in plain text format. Always make sure that sensitive data is handled securely to avoid potential security risks.

To store information in Secrets Manager, complete the following steps:

  1. On the Secrets Manager console, choose Store a new secret.
  2. For Secret type, choose Other type of secret.
  3. For the key-value pair, choose Plaintext and enter the following:
    {
        "user":"TestUser",
        "password":"MyTestPassword",
        "account":"AWSSAGEMAKERTEST"
    }

  4. Enter a name for your secret, such as sm-sql-snowflake-secret.
  5. Leave the other settings as default or customize if required.
  6. Create the secret.

Create an AWS Glue connection for Snowflake

As discussed earlier, AWS Glue connections are essential for accessing any connection from SageMaker Studio. You can find a list of all supported connection properties for Snowflake. The following is a sample connection definition JSON for Snowflake. Replace the placeholder values with the appropriate values before saving it to disk:

{
    "ConnectionInput": {
        "Name": "Snowflake-Airlines-Dataset",
        "Description": "SageMaker-Snowflake Airlines Dataset",
        "ConnectionType": "SNOWFLAKE",
        "ConnectionProperties": {
            "PythonProperties": "{"aws_secret_arn": "arn:aws:secretsmanager:<region>:<account>:secret:sm-sql-snowflake-secret", "database": "SAGEMAKERDEMODATABASE1"}"
        }
    }
}

To create an AWS Glue connection object for the Snowflake data source, use the following command:

aws --region <REGION> glue create-connection 
--cli-input-json file:///path/to/file/snowflake/definition/file.json

This command creates a new Snowflake data source connection in your SQL browser pane that’s browsable, and you can run SQL queries against it from your JupyterLab notebook cell.

Create an Amazon Redshift connection

Amazon Redshift is a fully managed, petabyte-scale data warehouse service that simplifies and reduces the cost of analyzing all your data using standard SQL. The procedure for creating an Amazon Redshift connection closely mirrors that for a Snowflake connection.

Create a Secrets Manager secret

Similar to the Snowflake setup, to connect to Amazon Redshift using a user ID and password, you need to securely store the secrets information in Secrets Manager. Complete the following steps:

  1. On the Secrets Manager console, choose Store a new secret.
  2. For Secret type, choose Credentials for Amazon Redshift cluster.
  3. Enter the credentials used to log in to access Amazon Redshift as a data source.
  4. Choose the Redshift cluster associated with the secrets.
  5. Enter a name for the secret, such as sm-sql-redshift-secret.
  6. Leave the other settings as default or customize if required.
  7. Create the secret.

By following these steps, you make sure your connection credentials are handled securely, using the robust security features of AWS to manage sensitive data effectively.

Create an AWS Glue connection for Amazon Redshift

To set up a connection with Amazon Redshift using a JSON definition, fill in the necessary fields and save the following JSON configuration to disk:

{
    "ConnectionInput": {
        "Name": "Redshift-US-Housing-Dataset",
        "Description": "sagemaker redshift us housing dataset connection",
        "ConnectionType": "REDSHIFT",
        "ConnectionProperties": {
            "PythonProperties": "{"aws_secret_arn": "arn:aws:secretsmanager:<region>:<account>:sm-sql-redshift-secret", "database": "us-housing-database"}"
        }
    }
}

To create an AWS Glue connection object for the Redshift data source, use the following AWS CLI command:

aws --region <REGION> glue create-connection 
--cli-input-json file:///path/to/file/redshift/definition/file.json

This command creates a connection in AWS Glue linked to your Redshift data source. If the command runs successfully, you will be able to see your Redshift data source within the SageMaker Studio JupyterLab notebook, ready for running SQL queries and performing data analysis.

Create an Athena connection

Athena is a fully managed SQL query service from AWS that enables analysis of data stored in Amazon S3 using standard SQL. To set up an Athena connection as a data source in the JupyterLab notebook’s SQL browser, you need to create an Athena sample connection definition JSON. The following JSON structure configures the necessary details to connect to Athena, specifying the data catalog, the S3 staging directory, and the Region:

{
    "ConnectionInput": {
        "Name": "Athena-Credit-Card-Fraud",
        "Description": "SageMaker-Athena Credit Card Fraud",
        "ConnectionType": "ATHENA",
        "ConnectionProperties": {
            "PythonProperties": "{"catalog_name": "AwsDataCatalog","s3_staging_dir": "s3://sagemaker-us-east-2-123456789/athena-data-source/credit-card-fraud/", "region_name": "us-east-2"}"
        }
    }
}

To create an AWS Glue connection object for the Athena data source, use the following AWS CLI command:

aws --region <REGION> glue create-connection 
--cli-input-json file:///path/to/file/athena/definition/file.json

If the command is successful, you will be able to access Athena data catalog and tables directly from the SQL browser within your SageMaker Studio JupyterLab notebook.

Query data from multiple sources

If you have multiple data sources integrated into SageMaker Studio through the built-in SQL browser and the notebook SQL feature, you can quickly run queries and effortlessly switch between data source backends in subsequent cells within a notebook. This capability allows for seamless transitions between different databases or data sources during your analysis workflow.

You can run queries against a diverse collection of data source backends and bring the results directly into the Python space for further analysis or visualization. This is facilitated by the %%sm_sql magic command available in SageMaker Studio notebooks. To output the results of your SQL query into a pandas DataFrame, there are two options:

  • From your notebook cell toolbar, choose the output type DataFrame and name your DataFrame variable
  • Append the following parameter to your %%sm_sql command:
    --output '{"format": "DATAFRAME", "dataframe_name": "df"}'

The following diagram illustrates this workflow and showcases how you can effortlessly run queries across various sources in subsequent notebook cells, as well as train a SageMaker model using training jobs or directly within the notebook using local compute. Additionally, the diagram highlights how the built-in SQL integration of SageMaker Studio simplifies the processes of extraction and building directly within the familiar environment of a JupyterLab notebook cell.

Text to SQL: Using natural language to enhance query authoring

SQL is a complex language that requires an understanding of databases, tables, syntaxes, and metadata. Today, generative artificial intelligence (AI) can enable you to write complex SQL queries without requiring in-depth SQL experience. The advancement of LLMs has significantly impacted natural language processing (NLP)-based SQL generation, allowing for the creation of precise SQL queries from natural language descriptions—a technique referred to as Text-to-SQL. However, it is essential to acknowledge the inherent differences between human language and SQL. Human language can sometimes be ambiguous or imprecise, whereas SQL is structured, explicit, and unambiguous. Bridging this gap and accurately converting natural language into SQL queries can present a formidable challenge. When provided with appropriate prompts, LLMs can help bridge this gap by understanding the intent behind the human language and generating accurate SQL queries accordingly.

With the release of the SageMaker Studio in-notebook SQL query feature, SageMaker Studio makes it straightforward to inspect databases and schemas, and author, run, and debug SQL queries without ever leaving the Jupyter notebook IDE. This section explores how the Text-to-SQL capabilities of advanced LLMs can facilitate the generation of SQL queries using natural language within Jupyter notebooks. We employ the cutting-edge Text-to-SQL model defog/sqlcoder-7b-2 in conjunction with Jupyter AI, a generative AI assistant specifically designed for Jupyter notebooks, to create complex SQL queries from natural language. By using this advanced model, we can effortlessly and efficiently create complex SQL queries using natural language, thereby enhancing our SQL experience within notebooks.

Notebook prototyping using the Hugging Face Hub

To begin prototyping, you need the following:

  • GitHub code – The code presented in this section is available in the following GitHub repo and by referencing the example notebook.
  • JupyterLab Space – Access to a SageMaker Studio JupyterLab Space backed by GPU-based instances is essential. For the defog/sqlcoder-7b-2 model, a 7B parameter model, using an ml.g5.2xlarge instance is recommended. Alternatives such as defog/sqlcoder-70b-alpha or defog/sqlcoder-34b-alpha are also viable for natural language to SQL conversion, but larger instance types may be required for prototyping. Make sure you have the quota to launch a GPU-backed instance by navigating to the Service Quotas console, searching for SageMaker, and searching for Studio JupyterLab Apps running on <instance type>.

Launch a new GPU-backed JupyterLab Space from your SageMaker Studio. It’s recommended to create a new JupyterLab Space with at least 75 GB of Amazon Elastic Block Store (Amazon EBS) storage for a 7B parameter model.

  • Hugging Face Hub – If your SageMaker Studio domain has access to download models from the Hugging Face Hub, you can use the AutoModelForCausalLM class from huggingface/transformers to automatically download models and pin them to your local GPUs. The model weights will be stored in your local machine’s cache. See the following code:
    model_id = "defog/sqlcoder-7b-2" # or use "defog/sqlcoder-34b-alpha", "defog/sqlcoder-70b-alpha
    
    # download model and tokenizer in fp16 and pin model to local notebook GPUs
    model = AutoModelForCausalLM.from_pretrained(
        model_id, 
        device_map="auto",
        torch_dtype=torch.float16
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

After the model has been fully downloaded and loaded into memory, you should observe an increase in GPU utilization on your local machine. This indicates that the model is actively using the GPU resources for computational tasks. You can verify this in your own JupyterLab space by running nvidia-smi (for a one-time display) or nvidia-smi —loop=1 (to repeat every second) from your JupyterLab terminal.

Text-to-SQL models excel at understanding the intent and context of a user’s request, even when the language used is conversational or ambiguous. The process involves translating natural language inputs into the correct database schema elements, such as table names, column names, and conditions. However, an off-the-shelf Text-to-SQL model will not inherently know the structure of your data warehouse, the specific database schemas, or be able to accurately interpret the content of a table based solely on column names. To effectively use these models to generate practical and efficient SQL queries from natural language, it is necessary to adapt the SQL text-generation model to your specific warehouse database schema. This adaptation is facilitated through the use of LLM prompts. The following is a recommended prompt template for the defog/sqlcoder-7b-2 Text-to-SQL model, divided into four parts:

  • Task – This section should specify a high-level task to be accomplished by the model. It should include the type of database backend (such as Amazon RDS, PostgreSQL, or Amazon Redshift) to make the model aware of any nuanced syntactical differences that may affect the generation of the final SQL query.
  • Instructions – This section should define task boundaries and domain awareness for the model, and may include few-shot examples to guide the model in generating finely tuned SQL queries.
  • Database Schema – This section should detail your warehouse database schemas, outlining the relationships between tables and columns to aid the model in understanding the database structure.
  • Answer – This section is reserved for the model to output the SQL query response to the natural language input.

An example of the database schema and prompt used in this section is available in the GitHub Repo.

### Task
Generate a SQL query to answer [QUESTION]{user_question}[/QUESTION]

### Instructions
- If you cannot answer the question with the available database schema, return 'I do not know'

### Database Schema
The query will run on a database with the following schema:
{table_metadata_string_DDL_statements}

### Answer
Given the database schema, here is the SQL query that 
 [QUESTION]
    {user_question}
 [/QUESTION]

[SQL]

Prompt engineering is not just about forming questions or statements; it’s a nuanced art and science that significantly impacts the quality of interactions with an AI model. The way you craft a prompt can profoundly influence the nature and usefulness of the AI’s response. This skill is pivotal in maximizing the potential of AI interactions, especially in complex tasks requiring specialized understanding and detailed responses.

It’s important to have the option to quickly build and test a model’s response for a given prompt and optimize the prompt based on the response. JupyterLab notebooks provide the ability to receive instant model feedback from a model running on local compute and optimize the prompt and tune a model’s response further or change a model entirely. In this post, we use a SageMaker Studio JupyterLab notebook backed by ml.g5.2xlarge’s NVIDIA A10G 24 GB GPU to run Text-to-SQL model inference on the notebook and interactively build our model prompt until the model’s response is sufficiently tuned to provide responses that are directly executable in JupyterLab’s SQL cells. To run model inference and simultaneously stream model responses, we use a combination of model.generate and TextIteratorStreamer as defined in the following code:

streamer = TextIteratorStreamer(
    tokenizer=tokenizer, 
    timeout=240.0, 
    skip_prompt=True, 
    skip_special_tokens=True
)


def llm_generate_query(user_question):
    """ Generate text-gen SQL responses"""
    
    updated_prompt = prompt.format(question=user_question)
    inputs = tokenizer(updated_prompt, return_tensors="pt").to("cuda")
    
    return model.generate(
        **inputs,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=1024,
        temperature=0.1,
        do_sample=False,
        num_beams=1, 
        streamer=streamer,
    )

The model’s output can be decorated with SageMaker SQL magic %%sm_sql ..., which allows the JupyterLab notebook to identify the cell as a SQL cell.

Host Text-to-SQL models as SageMaker endpoints

At the end of the prototyping stage, we have selected our preferred Text-to-SQL LLM, an effective prompt format, and an appropriate instance type for hosting the model (either single-GPU or multi-GPU). SageMaker facilitates the scalable hosting of custom models through the use of SageMaker endpoints. These endpoints can be defined according to specific criteria, allowing for the deployment of LLMs as endpoints. This capability enables you to scale the solution to a wider audience, allowing users to generate SQL queries from natural language inputs using custom hosted LLMs. The following diagram illustrates this architecture.

To host your LLM as a SageMaker endpoint, you generate several artifacts.

The first artifact is model weights. SageMaker Deep Java Library (DJL) Serving containers allow you to set up configurations through a meta serving.properties file, which enables you to direct how models are sourced—either directly from the Hugging Face Hub or by downloading model artifacts from Amazon S3. If you specify model_id=defog/sqlcoder-7b-2, DJL Serving will attempt to directly download this model from the Hugging Face Hub. However, you may incur networking ingress/egress charges each time the endpoint is deployed or elastically scaled. To avoid these charges and potentially speed up the download of model artifacts, it is recommended to skip using model_id in serving.properties and save model weights as S3 artifacts and only specify them with s3url=s3://path/to/model/bin.

Saving a model (with its tokenizer) to disk and uploading it to Amazon S3 can be accomplished with just a few lines of code:

# save model and tokenizer to local disk
model.save_pretrained(local_model_path)
tokenizer.save_pretrained(local_model_path)
...
...
...
# upload file to s3
s3_bucket_name = "<my llm artifact bucket name>>"
# s3 prefix to save model weights and tokenizer defs
model_s3_prefix = "sqlcoder-7b-instruct/weights"
# s3 prefix to store s
meta_model_s3_prefix = "sqlcoder-7b-instruct/meta-model"

sagemaker.s3.S3Uploader.upload(local_model_path,  f"s3://{s3_bucket_name}/{model_s3_prefix}")

You also use a database prompt file. In this setup, the database prompt is composed of Task, Instructions, Database Schema, and Answer sections. For the current architecture, we allocate a separate prompt file for each database schema. However, there is flexibility to expand this setup to include multiple databases per prompt file, allowing the model to run composite joins across databases on the same server. During our prototyping stage, we save the database prompt as a text file named <Database-Glue-Connection-Name>.prompt, where Database-Glue-Connection-Name corresponds to the connection name visible in your JupyterLab environment. For instance, this post refers to a Snowflake connection named Airlines_Dataset, so the database prompt file is named Airlines_Dataset.prompt. This file is then stored on Amazon S3 and subsequently read and cached by our model serving logic.

Moreover, this architecture permits any authorized users of this endpoint to define, store, and generate natural language to SQL queries without the need for multiple redeployments of the model. We use the following example of a database prompt to demonstrate the Text-to-SQL functionality.

Next, you generate custom model service logic. In this section, you outline a custom inference logic named model.py. This script is designed to optimize the performance and integration of our Text-to-SQL services:

  • Define the database prompt file caching logic – To minimize latency, we implement a custom logic for downloading and caching database prompt files. This mechanism makes sure that prompts are readily available, reducing the overhead associated with frequent downloads.
  • Define custom model inference logic – To enhance inference speed, our text-to-SQL model is loaded in the float16 precision format and then converted into a DeepSpeed model. This step allows for more efficient computation. Additionally, within this logic, you specify which parameters users can adjust during inference calls to tailor the functionality according to their needs.
  • Define custom input and output logic – Establishing clear and customized input/output formats is essential for smooth integration with downstream applications. One such application is JupyterAI, which we discuss in the subsequent section.
%%writefile {meta_model_filename}/model.py
...

predictor = None
prompt_for_db_dict_cache = {}

def download_prompt_from_s3(prompt_filename):

    print(f"downloading prompt file: {prompt_filename}")
    s3 = boto3.resource('s3')
    ...


def get_model(properties):
    
    ...
    print(f"Loading model from {cwd}")
    model = AutoModelForCausalLM.from_pretrained(
        cwd, 
        low_cpu_mem_usage=True, 
        torch_dtype=torch.bfloat16
    )
    model = deepspeed.init_inference(
        model, 
        mp_size=properties["tensor_parallel_degree"]
    )
    
    ...


def handle(inputs: Input) -> None:

    ...

    global predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())

    ...
    result = f"""%%sm_sql --metastore-id {prompt_for_db_key.split('.')[0]} --metastore-type GLUE_CONNECTIONnn{result}n"""
    result = [{'generated_text': result}]
    
    return Output().add(result)

Additionally, we include a serving.properties file, which acts as a global configuration file for models hosted using DJL serving. For more information, refer to Configurations and settings.

Lastly, you can also include a requirements.txt file to define additional modules required for inference and package everything into a tarball for deployment.

See the following code:

os.system(f"tar czvf {meta_model_filename}.tar.gz ./{meta_model_filename}/")

>>>./deepspeed-djl-serving-7b/
>>>./deepspeed-djl-serving-7b/serving.properties
>>>./deepspeed-djl-serving-7b/model.py
>>>./deepspeed-djl-serving-7b/requirements.txt

Integrate your endpoint with the SageMaker Studio Jupyter AI assistant

Jupyter AI is an open source tool that brings generative AI to Jupyter notebooks, offering a robust and user-friendly platform for exploring generative AI models. It enhances productivity in JupyterLab and Jupyter notebooks by providing features like the %%ai magic for creating a generative AI playground inside notebooks, a native chat UI in JupyterLab for interacting with AI as a conversational assistant, and support for a wide array of LLMs from providers like Amazon Titan, AI21, Anthropic, Cohere, and Hugging Face or managed services like Amazon Bedrock and SageMaker endpoints. For this post, we use Jupyter AI’s out-of-the-box integration with SageMaker endpoints to bring the Text-to-SQL capability into JupyterLab notebooks. The Jupyter AI tool comes pre-installed in all SageMaker Studio JupyterLab Spaces backed by SageMaker Distribution images; end-users are not required to make any additional configurations to start using the Jupyter AI extension to integrate with a SageMaker hosted endpoint. In this section, we discuss the two ways to use the integrated Jupyter AI tool.

Jupyter AI inside a notebook using magics

Jupyter AI’s %%ai magic command allows you to transform your SageMaker Studio JupyterLab notebooks into a reproducible generative AI environment. To begin using AI magics, make sure you have loaded the jupyter_ai_magics extension to use %%ai magic, and additionally load amazon_sagemaker_sql_magic to use %%sm_sql magic:

# load sm_sql magic extension and ai magic extension
%load_ext jupyter_ai_magics
%load_ext amazon_sagemaker_sql_magic

To run a call to your SageMaker endpoint from your notebook using the %%ai magic command, provide the following parameters and structure the command as follows:

  • –region-name – Specify the Region where your endpoint is deployed. This makes sure that the request is routed to the correct geographic location.
  • –request-schema – Include the schema of the input data. This schema outlines the expected format and types of the input data that your model needs to process the request.
  • –response-path – Define the path within the response object where the output of your model is located. This path is used to extract the relevant data from the response returned by your model.
  • -f (optional) – This is an output formatter flag that indicates the type of output returned by the model. In the context of a Jupyter notebook, if the output is code, this flag should be set accordingly to format the output as executable code at the top of a Jupyter notebook cell, followed by a free text input area for user interaction.

For example, the command in a Jupyter notebook cell might look like the following code:

%%ai sagemaker-endpoint:<endpoint-name> --region-name=us-east-1 
--request-schema={
    "inputs":"<prompt>", 
    "parameters":{
        "temperature":0.1,
        "top_p":0.2,
        "max_new_tokens":1024,
        "return_full_text":false
    }, 
    "db_prompt":"Airlines_Dataset.prompt"
  } 
--response-path=[0].generated_text -f code

My natural language query goes here...

Jupyter AI chat window

Alternatively, you can interact with SageMaker endpoints through a built-in user interface, simplifying the process of generating queries or engaging in dialogue. Before beginning to chat with your SageMaker endpoint, configure the relevant settings in Jupyter AI for the SageMaker endpoint, as shown in the following screenshot.

Conclusion

SageMaker Studio now simplifies and streamlines the data scientist workflow by integrating SQL support into JupyterLab notebooks. This allows data scientists to focus on their tasks without the need to manage multiple tools. Furthermore, the new built-in SQL integration in SageMaker Studio enables data personas to effortlessly generate SQL queries using natural language text as input, thereby accelerating their workflow.

We encourage you to explore these features in SageMaker Studio. For more information, refer to Prepare data with SQL in Studio.

Appendix

Enable the SQL browser and notebook SQL cell in custom environments

If you’re not using a SageMaker Distribution image or using Distribution images 1.5 or below, run the following commands to enable the SQL browsing feature inside your JupyterLab environment:

npm install -g vscode-jsonrpc
npm install -g sql-language-server
pip install amazon-sagemaker-sql-execution==0.1.0
pip install amazon-sagemaker-sql-editor
restart-jupyter-server

Relocate the SQL browser widget

JupyterLab widgets allow for relocation. Depending on your preference, you can move widgets to the either side of JupyterLab widgets pane. If you prefer, you can move the direction of the SQL widget to the opposite side (right to left) of the sidebar with a simple right-click on the widget icon and choosing Switch Sidebar Side.


About the authors

Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes using state of the art ML techniques. In his free time, he enjoys playing chess and traveling. You can find Pranav on LinkedIn.

Varun Shah is a Software Engineer working on Amazon SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solutions which simplify data processing and data preparation journeys . In his spare time, Varun enjoys outdoor activities including hiking and skiing, and is always up for discovering new, exciting places.

Sumedha Swamy is a Principal Product Manager at Amazon Web Services where he leads SageMaker Studio team in its mission to develop IDE of choice for data science and machine learning. He has dedicated the past 15 years building Machine Learning based consumer and enterprise products.

Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers. He has helped technology companies design and implement data analytics solutions and products.

Read More

Distributed training and efficient scaling with the Amazon SageMaker Model Parallel and Data Parallel Libraries

Distributed training and efficient scaling with the Amazon SageMaker Model Parallel and Data Parallel Libraries

There has been tremendous progress in the field of distributed deep learning for large language models (LLMs), especially after the release of ChatGPT in December 2022. LLMs continue to grow in size with billions or even trillions of parameters, and they often won’t fit into a single accelerator device such as GPU or even a single node such as ml.p5.32xlarge because of memory limitations. Customers training LLMs often must distribute their workload across hundreds or even thousands of GPUs. Enabling training at such scale remains a challenge in distributed training, and training efficiently in such a large system is another equally important problem. Over the past years, the distributed training community has introduced 3D parallelism (data parallelism, pipeline parallelism, and tensor parallelism) and other techniques (such as sequence parallelism and expert parallelism) to address such challenges.

In December 2023, Amazon announced the release of the SageMaker model parallel library 2.0 (SMP), which achieves state-of-the-art efficiency in large model training, together with the SageMaker distributed data parallelism library (SMDDP). This release is a significant update from 1.x: SMP is now integrated with open source PyTorch Fully Sharded Data Parallel (FSDP) APIs, which allows you to use a familiar interface when training large models, and is compatible with Transformer Engine (TE), unlocking tensor parallelism techniques alongside FSDP for the first time. To learn more about the release, refer to Amazon SageMaker model parallel library now accelerates PyTorch FSDP workloads by up to 20%.

In this post, we explore the performance benefits of Amazon SageMaker (including SMP and SMDDP), and how you can use the library to train large models efficiently on SageMaker. We demonstrate the performance of SageMaker with benchmarks on ml.p4d.24xlarge clusters up to 128 instances, and FSDP mixed precision with bfloat16 for the Llama 2 model. We start with a demonstration of near-linear scaling efficiencies for SageMaker, followed by analyzing contributions from each feature for optimal throughput, and end with efficient training with various sequence lengths up to 32,768 through tensor parallelism.

Near-linear scaling with SageMaker

To reduce the overall training time for LLM models, preserving high throughput when scaling to large clusters (thousands of GPUs) is crucial given the inter-node communication overhead. In this post, we demonstrate robust and near-linear scaling (by varying the number of GPUs for a fixed total problem size) efficiencies on p4d instances invoking both SMP and SMDDP.

In this section, we demonstrate SMP’s near-linear scaling performance. Here we train Llama 2 models of various sizes (7B, 13B, and 70B parameters) using a fixed sequence length of 4,096, the SMDDP backend for collective communication, TE enabled, a global batch size of 4 million, with 16 to 128 p4d nodes. The following table summarizes our optimal configuration and training performance (model TFLOPs per second).

Model size Number of nodes TFLOPs* sdp* tp* offload* Scaling efficiency
7B 16 136.76 32 1 N 100.0%
32 132.65 64 1 N 97.0%
64 125.31 64 1 N 91.6%
128 115.01 64 1 N 84.1%
13B 16 141.43 32 1 Y 100.0%
32 139.46 256 1 N 98.6%
64 132.17 128 1 N 93.5%
128 120.75 128 1 N 85.4%
70B 32 154.33 256 1 Y 100.0%
64 149.60 256 1 N 96.9%
128 136.52 64 2 N 88.5%

*At the given model size, sequence length, and number of nodes, we show the globally optimal throughput and configurations after exploring various sdp, tp, and activation offloading combinations.

The preceding table summarizes the optimal throughput numbers subject to sharded data parallel (sdp) degree (typically using FSDP hybrid sharding instead of full sharding, with more details in the next section), tensor parallel (tp) degree, and activation offloading value changes, demonstrating a near-linear scaling for SMP together with SMDDP. For example, given the Llama 2 model size 7B and sequence length 4,096, overall it achieves scaling efficiencies of 97.0%, 91.6%, and 84.1% (relative to 16 nodes) at 32, 64, and 128 nodes, respectively. The scaling efficiencies are stable across different model sizes and increase slightly as the model size gets larger.

SMP and SMDDP also demonstrate similar scaling efficiencies for other sequence lengths such as 2,048 and 8,192.

SageMaker model parallel library 2.0 performance: Llama 2 70B

Model sizes have continued to grow over the past years, along with frequent state-of-the-art performance updates in the LLM community. In this section, we illustrate performance in SageMaker for the Llama 2 model using a fixed model size 70B, sequence length of 4,096, and a global batch size of 4 million. To compare with the previous table’s globally optimal configuration and throughput (with SMDDP backend, typically FSDP hybrid sharding and TE), the following table extends to other optimal throughputs (potentially with tensor parallelism) with extra specifications on the distributed backend (NCCL and SMDDP), FSDP sharding strategies (full sharding and hybrid sharding), and enabling TE or not (default).

Model size Number of nodes TFLOPS TFLOPs #3 config TFLOPs improvement over baseline
. . NCCL full sharding: #0 SMDDP full sharding: #1 SMDDP hybrid sharding: #2 SMDDP hybrid sharding with TE: #3 sdp* tp* offload* #0 → #1 #1 → #2 #2 → #3 #0 → #3
70B 32 150.82 149.90 150.05 154.33 256 1 Y -0.6% 0.1% 2.9% 2.3%
64 144.38 144.38 145.42 149.60 256 1 N 0.0% 0.7% 2.9% 3.6%
128 68.53 103.06 130.66 136.52 64 2 N 50.4% 26.8% 4.5% 99.2%

*At the given model size, sequence length, and number of nodes, we show the globally optimal throughput and configuration after exploring various sdp, tp, and activation offloading combinations.

The latest release of SMP and SMDDP supports multiple features including native PyTorch FSDP, extended and more flexible hybrid sharding, transformer engine integration, tensor parallelism, and optimized all gather collective operation. To better understand how SageMaker achieves efficient distributed training for LLMs, we explore incremental contributions from SMDDP and the following SMP core features:

  • SMDDP enhancement over NCCL with FSDP full sharding
  • Replacing FSDP full sharding with hybrid sharding, which reduces communication cost to improve throughput
  • A further boost to throughput with TE, even when tensor parallelism is disabled
  • At lower resource settings, activation offloading might be able to enable training that would otherwise be infeasible or very slow due to high memory pressure

FSDP full sharding: SMDDP enhancement over NCCL

As shown in the previous table, when models are fully sharded with FSDP, although NCCL (TFLOPs #0) and SMDDP (TFLOPs #1) throughputs are comparable at 32 or 64 nodes, there is a huge improvement of 50.4% from NCCL to SMDDP at 128 nodes.

At smaller model sizes, we observe consistent and significant improvements with SMDDP over NCCL, starting at smaller cluster sizes, because SMDDP is able to mitigate the communication bottleneck effectively.

FSDP hybrid sharding to reduce communication cost

In SMP 1.0, we launched sharded data parallelism, a distributed training technique powered by Amazon in-house MiCS technology. In SMP 2.0, we introduce SMP hybrid sharding, an extensible and more flexible hybrid sharding technique that allows models to be sharded among a subset of GPUs, instead of all training GPUs, which is the case for FSDP full sharding. It’s useful for medium-sized models that don’t need to be sharded across the entire cluster in order to satisfy per-GPU memory constraints. This leads to clusters having more than one model replica and each GPU communicating with fewer peers at runtime.

SMP’s hybrid sharding enables efficient model sharding over a wider range, from the smallest shard degree with no out of memory issues up to the whole cluster size (which equates to full sharding).

The following figure illustrates the throughput dependence on sdp at tp = 1 for simplicity. Although it’s not necessarily the same as the optimal tp value for NCCL or SMDDP full sharding in the previous table, the numbers are quite close. It clearly validates the value of switching from full sharding to hybrid sharding at a large cluster size of 128 nodes, which is applicable to both NCCL and SMDDP. For smaller model sizes, significant improvements with hybrid sharding start at smaller cluster sizes, and the difference keeps increasing with cluster size.

Improvements with TE

TE is designed to accelerate LLM training on NVIDIA GPUs. Despite not using FP8 because it’s unsupported on p4d instances, we still see significant speedup with TE on p4d.

On top of MiCS trained with the SMDDP backend, TE introduces a consistent boost for throughput across all cluster sizes (the only exception is full sharding at 128 nodes), even when tensor parallelism is disabled (tensor parallel degree is 1).

For smaller model sizes or various sequence lengths, the TE boost is stable and non-trivial, in the range of approximately 3–7.6%.

Activation offloading at low resource settings

At low resource settings (given a small number of nodes), FSDP might experience a high memory pressure (or even out of memory in the worst case) when activation checkpointing is enabled. For such scenarios bottlenecked by memory, turning on activation offloading is potentially an option to improve performance.

For example, as we saw previously, although the Llama 2 at model size 13B and sequence length 4,096 is able to train optimally with at least 32 nodes with activation checkpointing and without activation offloading, it achieves the best throughput with activation offloading when limited to 16 nodes.

Enable training with long sequences: SMP tensor parallelism

Longer sequence lengths are desired for long conversations and context, and are getting more attention in the LLM community. Therefore, we report various long sequence throughputs in the following table. The table shows optimal throughputs for Llama 2 training on SageMaker, with various sequence lengths from 2,048 up to 32,768. At sequence length 32,768, native FSDP training is infeasible with 32 nodes at a global batch size of 4 million.

. . . TFLOPS
Model size Sequence length Number of nodes Native FSDP and NCCL SMP and SMDDP SMP improvement
7B 2048 32 129.25 138.17 6.9%
4096 32 124.38 132.65 6.6%
8192 32 115.25 123.11 6.8%
16384 32 100.73 109.11 8.3%
32768 32 N.A. 82.87 .
13B 2048 32 137.75 144.28 4.7%
4096 32 133.30 139.46 4.6%
8192 32 125.04 130.08 4.0%
16384 32 111.58 117.01 4.9%
32768 32 N.A. 92.38 .
*: max . . . . 8.3%
*: median . . . . 5.8%

When the cluster size is large and given a fixed global batch size, some model training might be infeasible with native PyTorch FSDP, lacking a built-in pipeline or tensor parallelism support. In the preceding table, given a global batch size of 4 million, 32 nodes, and sequence length 32,768, the effective batch size per GPU is 0.5 (for example, tp = 2 with batch size 1), which would otherwise be infeasible without introducing tensor parallelism.

Conclusion

In this post, we demonstrated efficient LLM training with SMP and SMDDP on p4d instances, attributing contributions to multiple key features, such as SMDDP enhancement over NCCL, flexible FSDP hybrid sharding instead of full sharding, TE integration, and enabling tensor parallelism in favor of long sequence lengths. After being tested over a wide range of settings with various models, model sizes, and sequence lengths, it exhibits robust near-linear scaling efficiencies, up to 128 p4d instances on SageMaker. In summary, SageMaker continues to be a powerful tool for LLM researchers and practitioners.

To learn more, refer to SageMaker model parallelism library v2, or contact the SMP team at sm-model-parallel-feedback@amazon.com.

Acknowledgements

We’d like to thank Robert Van Dusen, Ben Snyder, Gautam Kumar, and Luis Quintela for their constructive feedback and discussions.


About the Authors

Xinle Sheila Liu is an SDE in Amazon SageMaker. In her spare time, she enjoys reading and outdoor sports.

Suhit Kodgule is a Software Development Engineer with the AWS Artificial Intelligence group working on deep learning frameworks. In his spare time, he enjoys hiking, traveling, and cooking.

Victor Zhu is a Software Engineer in Distributed Deep Learning at Amazon Web Services. He can be found enjoying hiking and board games around the SF Bay Area.

Derya Cavdar works as a software engineer at AWS. Her interests include deep learning and distributed training optimization.

Teng Xu is a Software Development Engineer in the Distributed Training group in AWS AI. He enjoys reading.

Read More

Manage your Amazon Lex bot via AWS CloudFormation templates

Manage your Amazon Lex bot via AWS CloudFormation templates

Amazon Lex is a fully managed artificial intelligence (AI) service with advanced natural language models to design, build, test, and deploy conversational interfaces in applications. It employs advanced deep learning technologies to understand user input, enabling developers to create chatbots, virtual assistants, and other applications that can interact with users in natural language.

Managing your Amazon Lex bots using AWS CloudFormation allows you to create templates defining the bot and all the AWS resources it depends on. AWS CloudFormation provides and configures those resources on your behalf, removing the risk of human error when deploying bots to new environments. The benefits of using CloudFormation include:

  • Consistency – A CloudFormation template provides a more consistent and automated way to deploy and manage the resources associated with an Amazon Lex bot.
  • Version control – With AWS CloudFormation, you can use version control systems like Git to manage your CloudFormation templates. This allows you to maintain different versions of your bot and roll back to previous versions if needed.
  • Reusability – You can reuse CloudFormation templates across multiple environments, such as development, staging, and production. This saves time and effort in defining the same bot across different environments.
  • Expandability – As your Amazon Lex bot grows in complexity, managing it through the AWS Management Console becomes more challenging. AWS CloudFormation allows for a more streamlined and efficient approach to managing the bot’s definition and resources.
  • Automation – Using a CloudFormation template allows you to automate the deployment process. You can use AWS services like AWS CodePipeline and AWS CodeBuild to build, test, and deploy your Amazon Lex bot automatically.

In this post, we guide you through the steps involved in creating a CloudFormation template for an Amazon Lex V2 bot.

Solution overview

We have chosen the Book Trip bot as our starting point for this exercise. We use a CloudFormation template to create a new bot from scratch, including defining intents, slots, and other required components. Additionally, we explore topics such as version control, aliases, integrating AWS Lambda functions, creating conditional branches, and enabling logging.

Prerequisites

You should have the following prerequisites:

  • An AWS account to create and deploy a CloudFormation template
  • The necessary AWS Identity and Access Management (IAM) permissions to deploy AWS CloudFormation and the resources used in the template
  • Basic knowledge of Amazon Lex, Lambda functions, and associated services
  • Basic knowledge of creating and deploying CloudFormation templates

Create an IAM role

To begin, you need to create an IAM role that the bot will use. You can achieve this by initializing a CloudFormation template and adding the IAM role as a resource. You can use the following template to create the role. If you download the example template and deploy it, you should see that an IAM role has been created. We provide examples of templates as we go through this post and merge them as we get further along.

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: CloudFormation template for book hotel bot.
Resources:
  # 1. IAM role that is used by the bot at runtime
  BotRuntimeRole:    
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lexv2.amazonaws.com
            Action:
              - "sts:AssumeRole"
      Path: "/"
      Policies:
        - PolicyName: LexRuntimeRolePolicy
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action:
                  - "polly:SynthesizeSpeech"
                  - "comprehend:DetectSentiment"
                Resource: "*"

Configure the Amazon Lex bot

Next, you need to add the bot definition. The following is the YAML template for the Amazon Lex bot definition; you construct the necessary components one by one:

Type: AWS::Lex::Bot
Properties:
  AutoBuildBotLocales: Boolean
  BotFileS3Location: 
    S3Location
  BotLocales: 
    - BotLocale
  BotTags: 
    - Tag
  DataPrivacy: 
    DataPrivacy
  Description: String
  IdleSessionTTLInSeconds: Integer
  Name: String
  RoleArn: String
  TestBotAliasSettings: 
    TestBotAliasSettings
  TestBotAliasTags: 
    - Tag

To create a bot that only includes the bot definition without any intent, you can use the following template. Here, you specify the bot’s name, the ARN of the role that you previously created, data privacy settings, and more:

BookHotelBot:
    DependsOn: BotRuntimeRole # The role created in the previous step
    Type: AWS::Lex::Bot
    Properties:
      Name: "BookHotel"
      Description: "Sample Bot to book a hotel"
      RoleArn: !GetAtt BotRuntimeRole.Arn      
      #For each Amazon Lex bot created with the Amazon Lex Model Building Service, you must specify whether your use of Amazon Lex 
      #is related to a website, program, or other application that is directed or targeted, in whole or in part, to children under 
      #age 13 and subject to the Children's Online Privacy Protection Act (COPPA) by specifying true or false in the 
      #childDirected field.
      DataPrivacy:
        ChildDirected: false
      IdleSessionTTLInSeconds: 300

You can download the updated template. Deploying the updated template allows you to create both the role and the bot definition. Note that you’re updating the stack you created in the previous step.

The final step entails defining the BotLocales, which form the majority of the bot’s functionality. This includes, for example, Intents and Slot types. The following is the YAML template:

  CustomVocabulary: 
    CustomVocabulary
  Description: String
  Intents: 
    - Intent
  LocaleId: String
  NluConfidenceThreshold: Number
  SlotTypes: 
    - SlotType
  VoiceSettings: 
    VoiceSettings

In this case, you build the BookHotel intent, which requires a custom slot type for room types. You set the LocaleId, then the VoiceSettings. Then you add the SlotTypes and their corresponding values.

The next step is to define the Intents, starting with the first intent, BookHotel, which involves adding utterances, slots, and slot priorities. The details of these nodes are demonstrated in the provided template. Finally, you add the second intent, which is the FallbackIntent. See the following code:

BotLocales:
        - LocaleId: "en_US"
          Description: "en US locale"
          NluConfidenceThreshold: 0.40
          VoiceSettings:
            VoiceId: "Matthew"
          SlotTypes:
            - Name: "RoomTypeValues"
              Description: "Type of room"
              SlotTypeValues:
                - SampleValue:
                    Value: queen
                - SampleValue:
                    Value: king
                - SampleValue:
                    Value: deluxe
              ValueSelectionSetting:
                ResolutionStrategy: ORIGINAL_VALUE
          Intents:
            - Name: "BookHotel"
              Description: "Intent to book a hotel room"
              SampleUtterances:
                - Utterance: "Book a hotel"
                - Utterance: "I want a make hotel reservations"
                - Utterance: "Book a {Nights} night stay in {Location}"
              IntentConfirmationSetting:
                PromptSpecification:
                  MessageGroupsList:
                    - Message:
                        PlainTextMessage:
                          Value: "Okay, I have you down for a {Nights} night stay in {Location} starting {CheckInDate}.  Shall I book the reservation?"
                  MaxRetries: 3
                  AllowInterrupt: false
                DeclinationResponse:
                  MessageGroupsList:
                    - Message:
                        PlainTextMessage:
                          Value: "Okay, I have cancelled your reservation in progress."
                  AllowInterrupt: false
              SlotPriorities:
                - Priority: 1
                  SlotName: Location
                - Priority: 2
                  SlotName: CheckInDate
                - Priority: 3
                  SlotName: Nights
                - Priority: 4
                  SlotName: RoomType
              Slots:
                - Name: "Location"
                  Description: "Location of the city in which the hotel is located"
                  SlotTypeName: "AMAZON.City"
                  ValueElicitationSetting:
                    SlotConstraint: "Required"
                    PromptSpecification:
                      MessageGroupsList:
                        - Message:
                            PlainTextMessage:
                              Value: "What city will you be staying in?"
                      MaxRetries: 2
                      AllowInterrupt: false
                - Name: "CheckInDate"
                  Description: "Date of check-in"
                  SlotTypeName: "AMAZON.Date"
                  ValueElicitationSetting:
                    SlotConstraint: "Required"
                    PromptSpecification:
                      MessageGroupsList:
                        - Message:
                            PlainTextMessage:
                              Value: "What day do you want to check in?"
                      MaxRetries: 2
                      AllowInterrupt: false
                - Name: "Nights"
                  Description: "something"
                  SlotTypeName: "AMAZON.Number"
                  ValueElicitationSetting:
                    SlotConstraint: "Required"
                    PromptSpecification:
                      MessageGroupsList:
                        - Message:
                            PlainTextMessage:
                              Value: "How many nights will you be staying?"
                      MaxRetries: 2
                      AllowInterrupt: false
                - Name: "RoomType"
                  Description: "Enumeration of types of rooms that are offered by a hotel."
                  SlotTypeName: "RoomTypeValues"
                  ValueElicitationSetting:
                    SlotConstraint: "Required"
                    PromptSpecification:
                      MessageGroupsList:
                        - Message:
                            PlainTextMessage:
                              Value: "What type of room would you like, queen, king or deluxe?"
                      MaxRetries: 2
                      AllowInterrupt: false
            - Name: "FallbackIntent"
              Description: "Default intent when no other intent matches"
              ParentIntentSignature: "AMAZON.FallbackIntent"

You can download the CloudFormation template for the work done until now. After you update your stack with this template, a functional bot will be deployed. On the Amazon Lex console, you can confirm that there is a draft version of the bot, and a default alias named TestBotAlias has been created.

bot alias

Create a new bot version and alias

Amazon Lex supports publishing versions of bots, intents, and slot types so that you can control your client applications’ implementation. A version is a numbered snapshot of your bot definition that you can publish for use in different parts of your workflow, such as development, beta deployment, and production. Amazon Lex bots also support aliases. An alias is a pointer to a specific version of a bot. With an alias, you can update your client applications’ version. In practical scenarios, bot aliases are used for blue/green deployments and managing environment-specific configurations like development and production environments.

To illustrate, let’s say you point an alias to version 1 of your bot. When it’s time to update the bot, you can publish version 2 and change the alias to point to the new version. Because your applications use the alias instead of a specific version, all clients receive the new functionality without requiring updates.

Keep in mind that when you modify the CloudFormation template and initiate deployment, the changes are implemented within the draft version, primarily meant for testing. After you complete your testing phase, you can establish a new version to finalize the changes you’ve incorporated so far.

Next, you create a new bot version based on your draft, set up a new alias, and link the version to this alias. The following are the two new resources to add to your template:

BookHotelInitialVersion:
    DependsOn: BookHotelBot
    Type: AWS::Lex::BotVersion
    Properties:
      BotId: !Ref BookHotelBot
      BotVersionLocaleSpecification:
        - LocaleId: en_US
          BotVersionLocaleDetails:
            SourceBotVersion: DRAFT
      Description: Hotel Bot initial version

  BookHotelDemoAlias:
    Type: AWS::Lex::BotAlias
    Properties:
      BotId: !Ref BookHotelBot
      BotAliasName: "BookHotelDemoAlias"
      BotVersion: !GetAtt BookHotelInitialVersion.BotVersion

You can download the new version of the template and deploy it by updating your stack. You can see on the Amazon Lex console that a new version is created and associated with a new alias called BookHotelDemoAlias.

demo alias

When you create a new version of an Amazon Lex bot, it typically increments the version number sequentially, starting from 1. To discern a specific version, you can refer to its description.

initial version

Add a Lambda function

To initialize values or validate user input for your bot, you can add a Lambda function as a code hook to your bot. Similarly, you can use a Lambda function for fulfillment as well, for example writing data to databases or calling APIs save the collected information. For more information, refer to Enabling custom logic with AWS Lambda functions.

Let’s add a new resource for the Lambda function to the CloudFormation template. Although it’s generally not advised to embed code in CloudFormation templates, we do so here solely for the sake of making the demo deployment less complicated. See the following code:

HotelBotFunction:
    DependsOn: BotRuntimeRole # So that the Lambda function is ready before the bot deployment
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: book_hotel_lambda
      Runtime: python3.11
      Timeout: 15
      Handler: index.lambda_handler
      InlineCode: |
        import os
        import json

        def close(intent_request):
            intent_request['sessionState']['intent']['state'] = 'Fulfilled'

            message = {"contentType": "PlainText",
                      "content": "Your Booking is confirmed"}

            session_attributes = {}
            sessionState = intent_request['sessionState']
            if 'sessionAttributes' in sessionState:
                session_attributes = sessionState['sessionAttributes']

            requestAttributes = None
            if 'requestAttributes' in intent_request:
                requestAttributes = intent_request['requestAttributes']

            return {
                'sessionState': {
                    'sessionAttributes': session_attributes,
                    'dialogAction': {
                        'type': 'Close'
                    },
                    'intent': intent_request['sessionState']['intent'],
                    'originatingRequestId': 'xxxxxxx-xxxx-xxxx-xxxx'
                },
                'messages':  [message],
                'sessionId': intent_request['sessionId'],
                'requestAttributes': requestAttributes
            }

        def router(event):
            intent_name = event['sessionState']['intent']['name']
            slots = event['sessionState']['intent']['slots']
            if (intent_name == 'BookHotel'):
                # invoke lambda and return result
                return close(event)

            raise Exception(
                'The intent is not supported by Lambda: ' + intent_name)

        def lambda_handler(event, context):
            response = router(event)
            return response

To use this Lambda function for the fulfillment, enable the code hook settings in your intent:

Intents:
  - Name: "BookHotel"
    Description: "Intent to book a hotel room"
    FulfillmentCodeHook:
      Enabled: true
    SampleUtterances:
      - Utterance: "Book a hotel"
      - Utterance: "I want a make hotel reservations"
      - Utterance: "Book a {Nights} night stay in {Location}"

Because you made changes to your bot, you can create a new version of the bot by adding a new resource named BookHotelVersionWithLambda in the template:

BookHotelVersionWithLambda:
    DependsOn: BookHotelInitialVersion
    Type: AWS::Lex::BotVersion
    Properties:
      BotId: !Ref BookHotelBot
      BotVersionLocaleSpecification:
        - LocaleId: en_US
          BotVersionLocaleDetails:
            SourceBotVersion: DRAFT
      Description: Hotel Bot with a lambda function

The Lambda function is associated with a bot alias. Amazon Lex V2 can use one Lambda function per bot alias per language. Therefore, you must update your alias in the template to add the Lambda function resource. You can do so in the BotAliasLocalSettings section. You also need to point the alias to the new version you created. The following code is the modified alias configuration:

  BookHotelDemoAlias:
    Type: AWS::Lex::BotAlias
    Properties:
      BotId: !Ref BookHotelBot
      BotAliasName: "BookHotelDemoAlias"
      BotVersion: !GetAtt BookHotelVersionWithLambda.BotVersion
      # Remove BotAliasLocaleSettings if you aren't concerned with Lambda setup.
      # If you are you can modify the LambdaArn below to get started.
      BotAliasLocaleSettings:
        - LocaleId: en_US
          BotAliasLocaleSetting:
            Enabled: true
            CodeHookSpecification:
              LambdaCodeHook:
                CodeHookInterfaceVersion: "1.0"
                LambdaArn: !GetAtt HotelBotFunction.Arn

Up until now, you have only linked the Lambda function with the alias. However, you need to grant permission to allow the alias to invoke the Lambda function. In the following code, you add the Lambda invoke permission for Amazon Lex and specify the alias ARN as the source ARN:

  LexInvokeLambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: "lambda:InvokeFunction"
      FunctionName: !GetAtt HotelBotFunction.Arn
      Principal: "lexv2.amazonaws.com"
      SourceArn: !GetAtt BookHotelDemoAlias.Arn

You can download the latest version of the template. After updating your stack with this version, you will have an Amazon Lex bot integrated with a Lambda function.

second version

updated alis

Conditional branches

Now let’s explore the conditional branch feature of the Amazon Lex bot and consider a scenario where booking more than five nights in Seattle is not allowed for the next week. As per the business requirement, the conversation should end with an appropriate message if the user attempts to book more than five nights in Seattle. The conditional branch for that is represented in the CloudFormation template under the SlotCaptureSetting:

- Name: "Nights"
                  Description: “Number of nights.”
                  SlotTypeName: "AMAZON.Number"
                  ValueElicitationSetting:
                    SlotConstraint: "Required"
                    SlotCaptureSetting:
                      CaptureConditional:
                        DefaultBranch:
                          NextStep:
                            DialogAction:
                              Type: "ElicitSlot"
                              SlotToElicit: "RoomType"
                        ConditionalBranches:
                          - Name: "Branch1"
                            Condition:
                              ExpressionString: '{Nights}>5 AND {Location} = "Seattle"'
                            Response:
                              AllowInterrupt: true
                              MessageGroupsList:
                                - Message:
                                    PlainTextMessage:
                                      Value: “Sorry, we cannot book more than five nights in {Location} right now."
                            NextStep:
                              DialogAction:
                                Type: "EndConversation"
                        IsActive: true

                    PromptSpecification:
                      MessageGroupsList:
                        - Message:
                            PlainTextMessage:
                              Value: "How many nights will you be staying?"
                      MaxRetries: 2
                      AllowInterrupt: false

Because you changed the bot definition, you need to create a new version in the template and link it with the alias. This is a temporary modification because the business plans to allow large bookings in Seattle soon. The following are the two new resources you add to the template:

BookHotelConditionalBranches:
    DependsOn: BookHotelVersionWithLambda
    Type: AWS::Lex::BotVersion
    Properties:
      BotId: !Ref BookHotelBot
      BotVersionLocaleSpecification:
        - LocaleId: en_US
          BotVersionLocaleDetails:
            SourceBotVersion: DRAFT
      Description: Hotel Bot Version with conditional branches

  BookHotelDemoAlias:
    Type: AWS::Lex::BotAlias
    Properties:
      BotId: !Ref BookHotelBot
      BotAliasName: "BookHotelDemoAlias"
      BotVersion: !GetAtt BookHotelConditionalBranches.BotVersion
      # Remove BotAliasLocaleSettings if you aren't concerned with Lambda setup.
      # If you are you can modify the LambdaArn below to get started.
      BotAliasLocaleSettings:
        - LocaleId: en_US
          BotAliasLocaleSetting:
            Enabled: true
            CodeHookSpecification:
              LambdaCodeHook:
                CodeHookInterfaceVersion: "1.0"
                LambdaArn: !GetAtt HotelBotFunction.Arn

You can download the updated template. After you update your stack with this template version, the alias will be directed to the version incorporating the conditional branching feature. To undo this modification, you can update the alias to revert back to the previous version.

third version

alias for third version

Logs

You can also enable logs for your Amazon Lex bot. To do so, you must update the bot’s role to grant permissions for writing Amazon CloudWatch logs. The following is an example of adding a CloudWatch policy to the role:

BotRuntimeRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lexv2.amazonaws.com
            Action:
              - "sts:AssumeRole"
      Path: "/"
      Policies:
        - PolicyName: LexRuntimeRolePolicy
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action:
                  - "polly:SynthesizeSpeech"
                  - "comprehend:DetectSentiment"
                Resource: "*"
        - PolicyName: CloudWatchPolicy
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action:
                  - "logs:CreateLogStream"
                  - "logs:PutLogEvents"
                Resource: "*"

To ensure consistent and predictable behavior, you should be as specific as possible when defining resource names and properties in CloudFormation templates. This is because the use of the wildcard character (*) in CloudFormation templates can pose potential security risks and lead to unintended consequences. Therefore, it’s recommended to avoid using wildcards and instead use explicit values wherever possible.

Next, you create a CloudWatch log group resource, as shown in the following code, to direct your logs to this group:

  #Log Group
  LexLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: /lex/hotel-bot
      RetentionInDays: 5

Finally, you update your alias to enable conversation log settings:

BookHotelDemoAlias:
    Type: AWS::Lex::BotAlias
    Properties:
      BotId: !Ref BookHotelBot
      BotAliasName: "BookHotelDemoAlias"
      BotVersion: !GetAtt BookHotelConditionalBranches.BotVersion
      BotAliasLocaleSettings:
        - LocaleId: en_US
          BotAliasLocaleSetting:
            Enabled: true
            CodeHookSpecification:
              LambdaCodeHook:
                CodeHookInterfaceVersion: "1.0"
                LambdaArn: !GetAtt HotelBotFunction.Arn
      ConversationLogSettings:
        TextLogSettings:
          - Destination:
              CloudWatch:
                CloudWatchLogGroupArn: !GetAtt LexLogGroup.Arn
                LogPrefix: bookHotel
            Enabled: true

When you update the stack with this template, you enable the conversation logs for your bot. A new version is not created in this step because there are no changes to your bot resource. You can download the latest version of the template.

Clean Up

To prevent incurring charges in the future, delete the CloudFormation stack you created.

Conclusion

In this post, we discussed the step-by-step process to create a CloudFormation template for an Amazon Lex V2 bot. Initially, we deployed a basic bot, then we explored the potential of aliases and versions and how to use them efficiently with templates. Next, we learned how to integrate a Lambda function with an Amazon Lex V2 bot and implemented conditional branching in the bot’s conversation flow to accommodate business requirements. Finally, we added logging features by creating a CloudWatch log group resource and updating the bot’s role with the necessary permissions.

The template allows for the straightforward deployment and management of the bot, with the ability to revert changes as necessary. Overall, the CloudFormation template is useful for managing and optimizing an Amazon Lex V2 bot.

As the next step, you can explore sample Amazon Lex bots and apply the techniques discussed in this post to convert them into CloudFormation templates. This hands-on practice will solidify your understanding of managing Amazon Lex V2 bots through infrastructure as code.


About the Authors

Thomas Rindfuss is a Sr. Solutions Architect on the Amazon Lex team. He invents, develops, prototypes, and evangelizes new technical features and solutions for Language AI services that improves the customer experience and eases adoption.

Rijeesh Akkambeth Chathoth is a Professional Services Consultant at AWS. He helps customers in achieving their desired business
outcomes in the Contact Center space by leveraging Amazon Connect, Amazon Lex and GenAI features.

Read More

A secure approach to generative AI with AWS

A secure approach to generative AI with AWS

Generative artificial intelligence (AI) is transforming the customer experience in industries across the globe. Customers are building generative AI applications using large language models (LLMs) and other foundation models (FMs), which enhance customer experiences, transform operations, improve employee productivity, and create new revenue channels.

FMs and the applications built around them represent extremely valuable investments for our customers. They’re often used with highly sensitive business data, like personal data, compliance data, operational data, and financial information, to optimize the model’s output. The biggest concern we hear from customers as they explore the advantages of generative AI is how to protect their highly sensitive data and investments. Because their data and model weights are incredibly valuable, customers require them to stay protected, secure, and private, whether that’s from their own administrator’s accounts, their customers, vulnerabilities in software running in their own environments, or even their cloud service provider from having access.

At AWS, our top priority is safeguarding the security and confidentiality of our customers’ workloads. We think about security across the three layers of our generative AI stack:

  • Bottom layer – Provides the tools for building and training LLMs and other FMs
  • Middle layer – Provides access to all the models along with tools you need to build and scale generative AI applications
  • Top layer – Includes applications that use LLMs and other FMs to make work stress-free by writing and debugging code, generating content, deriving insights, and taking action

Each layer is important to making generative AI pervasive and transformative.

With the AWS Nitro System, we delivered a first-of-its-kind innovation on behalf of our customers. The Nitro System is an unparalleled computing backbone for AWS, with security and performance at its core. Its specialized hardware and associated firmware are designed to enforce restrictions so that nobody, including anyone in AWS, can access your workloads or data running on your Amazon Elastic Compute Cloud (Amazon EC2) instances. Customers have benefited from this confidentiality and isolation from AWS operators on all Nitro-based EC2 instances since 2017.

By design, there is no mechanism for any Amazon employee to access a Nitro EC2 instance that customers use to run their workloads, or to access data that customers send to a machine learning (ML) accelerator or GPU. This protection applies to all Nitro-based instances, including instances with ML accelerators like AWS Inferentia and AWS Trainium, and instances with GPUs like P4, P5, G5, and G6.

The Nitro System enables Elastic Fabric Adapter (EFA), which uses the AWS-built AWS Scalable Reliable Datagram (SRD) communication protocol for cloud-scale elastic and large-scale distributed training, enabling the only always-encrypted Remote Direct Memory Access (RDMA) capable network. All communication through EFA is encrypted with VPC encryption without incurring any performance penalty.

The design of the Nitro System has been validated by the NCC Group, an independent cybersecurity firm. AWS delivers a high level of protection for customer workloads, and we believe this is the level of security and confidentiality that customers should expect from their cloud provider. This level of protection is so critical that we’ve added it in our AWS Service Terms to provide an additional assurance to all of our customers.

Innovating secure generative AI workloads using AWS industry-leading security capabilities

From day one, AWS AI infrastructure and services have had built-in security and privacy features to give you control over your data. As customers move quickly to implement generative AI in their organizations, you need to know that your data is being handled securely across the AI lifecycle, including data preparation, training, and inferencing. The security of model weights—the parameters that a model learns during training that are critical for its ability to make predictions—is paramount to protecting your data and maintaining model integrity.

This is why it is critical for AWS to continue to innovate on behalf of our customers to raise the bar on security across each layer of the generative AI stack. To do this, we believe that you must have security and confidentiality built in across each layer of the generative AI stack. You need to be able to secure the infrastructure to train LLMs and other FMs, build securely with tools to run LLMs and other FMs, and run applications that use FMs with built-in security and privacy that you can trust.

At AWS, securing AI infrastructure refers to zero access to sensitive AI data, such as AI model weights and data processed with those models, by any unauthorized person, either at the infrastructure operator or at the customer. It’s comprised of three key principles:

  1. Complete isolation of the AI data from the infrastructure operator – The infrastructure operator must have no ability to access customer content and AI data, such as AI model weights and data processed with models.
  2. Ability for customers to isolate AI data from themselves – The infrastructure must provide a mechanism to allow model weights and data to be loaded into hardware, while remaining isolated and inaccessible from customers’ own users and software.
  3. Protected infrastructure communications – The communication between devices in the ML accelerator infrastructure must be protected. All externally accessible links between the devices must be encrypted.

The Nitro System fulfills the first principle of Secure AI Infrastructure by isolating your AI data from AWS operators. The second principle provides you with a way to remove administrative access of your own users and software to your AI data. AWS not only offers you a way to achieve that, but we also made it straightforward and practical by investing in building an integrated solution between AWS Nitro Enclaves and AWS Key Management Service (AWS KMS). With Nitro Enclaves and AWS KMS, you can encrypt your sensitive AI data using keys that you own and control, store that data in a location of your choice, and securely transfer the encrypted data to an isolated compute environment for inferencing. Throughout this entire process, the sensitive AI data is encrypted and isolated from your own users and software on your EC2 instance, and AWS operators cannot access this data. Use cases that have benefited from this flow include running LLM inferencing in an enclave. Until today, Nitro Enclaves operate only in the CPU, limiting the potential for larger generative AI models and more complex processing.

We announced our plans to extend this Nitro end-to-end encrypted flow to include first-class integration with ML accelerators and GPUs, fulfilling the third principle. You will be able to decrypt and load sensitive AI data into an ML accelerator for processing while providing isolation from your own operators and verified authenticity of the application used for processing the AI data. Through the Nitro System, you can cryptographically validate your applications to AWS KMS and decrypt data only when the necessary checks pass. This enhancement allows AWS to offer end-to-end encryption for your data as it flows through generative AI workloads.

We plan to offer this end-to-end encrypted flow in the upcoming AWS-designed Trainium2 as well as GPU instances based on NVIDIA’s upcoming Blackwell architecture, which both offer secure communications between devices, the third principle of Secure AI Infrastructure. AWS and NVIDIA are collaborating closely to bring a joint solution to market, including NVIDIA’s new NVIDIA Blackwell GPU 21 platform, which couples NVIDIA’s GB200 NVL72 solution with the Nitro System and EFA technologies to provide an industry-leading solution for securely building and deploying next-generation generative AI applications.

Advancing the future of generative AI security

Today, tens of thousands of customers are using AWS to experiment and move transformative generative AI applications into production. Generative AI workloads contain highly valuable and sensitive data that needs the level of protection from your own operators and the cloud service provider. Customers using AWS Nitro-based EC2 instances have received this level of protection and isolation from AWS operators since 2017, when we launched our innovative Nitro System.

At AWS, we’re continuing that innovation as we invest in building performant and accessible capabilities to make it practical for our customers to secure their generative AI workloads across the three layers of the generative AI stack, so that you can focus on what you do best: building and extending the uses of the generative AI to more areas. Learn more here.


About the authors

Anthony Liguori is an AWS VP and Distinguished Engineer for EC2

Colm MacCárthaigh is an AWS VP and Distinguished Engineer for EC2

Read More