April 2024 – Page 7

Live Meeting Assistant with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock

See CHANGELOG for latest features and fixes.

You’ve likely experienced the challenge of taking notes during a meeting while trying to pay attention to the conversation. You’ve probably also experienced the need to quickly fact-check something that’s been said, or look up information to answer a question that’s just been asked in the call. Or maybe you have a team member that always joins meetings late, and expects you to send them a quick summary over chat to catch them up.

Then there are the times that others are talking in a language that’s not your first language, and you’d love to have a live translation of what people are saying to make sure you understand correctly.

And after the call is over, you usually want to capture a summary for your records, or to send to the participants, with a list of all the action items, owners, and due dates.

All of this, and more, is now possible with our newest sample solution, Live Meeting Assistant (LMA).

Check out the following demo to see how it works.

In this post, we show you how to use LMA with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock.

Solution overview

The LMA sample solution captures speaker audio and metadata from your browser-based meeting app (as of this writing, Zoom and Chime are supported), or audio only from any other browser-based meeting app, softphone, or audio source. It uses Amazon Transcribe for speech to text, Knowledge Bases for Amazon Bedrock for contextual queries against your company’s documents and knowledge sources, and Amazon Bedrock models for customizable transcription insights and summaries.

Everything you need is provided as open source in our GitHub repo. It’s straightforward to deploy in your AWS account. When you’re done, you’ll wonder how you ever managed without it!

The following are some of the things LMA can do:

Live transcription with speaker attribution – LMA is powered by Amazon Transcribe ASR models for low-latency, high-accuracy speech to text. You can teach it brand names and domain-specific terminology if needed, using custom vocabulary and custom language model features in Amazon Transcribe.
Live translation – It uses Amazon Translate to optionally show each segment of the conversation translated into your language of choice, from a selection of 75 languages.
Context-aware meeting assistant – It uses Knowledge Bases for Amazon Bedrock to provide answers from your trusted sources, using the live transcript as context for fact-checking and follow-up questions. To activate the assistant, just say “Okay, Assistant,” choose the ASK ASSISTANT! button, or enter your own question in the UI.
On-demand summaries of the meeting – With the click of a button on the UI, you can generate a summary, which is useful when someone joins late and needs to get caught up. The summaries are generated from the transcript by Amazon Bedrock. LMA also provides options for identifying the current meeting topic, and for generating a list of action items with owners and due dates. You can also create your own custom prompts and corresponding options.
Automated summary and insights – When the meeting has ended, LMA automatically runs a set of large language model (LLM) prompts on Amazon Bedrock to summarize the meeting transcript and extract insights. You can customize these prompts as well.
Meeting recording – The audio is (optionally) stored for you, so you can replay important sections on the meeting later.
Inventory list of meetings – LMA keeps track of all your meetings in a searchable list.
Browser extension captures audio and meeting metadata from popular meeting apps – The browser extension captures meeting metadata—the meeting title and names of active speakers—and audio from you (your microphone) and others (from the meeting browser tab). As of this writing, LMA supports Chrome for the browser extension, and Zoom and Chime for meeting apps (with Teams and WebEx coming soon). Standalone meeting apps don’t work with LMA —instead, launch your meetings in the browser.

You are responsible for complying with legal, corporate, and ethical restrictions that apply to recording meetings and calls. Do not use this solution to stream, record, or transcribe calls if otherwise prohibited.

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

You also need an existing knowledge base in Amazon Bedrock. If you haven’t set one up yet, see Create a knowledge base. Populate your knowledge base with content to power LMA’s context-aware meeting assistant.

Finally, LMA uses Amazon Bedrock LLMs for its meeting summarization features. Before proceeding, if you have not previously done so, you must request access to the following Amazon Bedrock models:

Titan Embeddings G1 – Text
Anthropic: All Claude models

Deploy the solution using AWS CloudFormation

We’ve provided pre-built AWS CloudFormation templates that deploy everything you need in your AWS account.

If you’re a developer and you want to build, deploy, or publish the solution from code, refer to the Developer README.

Complete the following steps to launch the CloudFormation stack:

Log in to the AWS Management Console.
Choose Launch Stack for your desired AWS Region to open the AWS CloudFormation console and create a new stack.

Region	Launch Stack
US East (N. Virginia)
US West (Oregon)

For Stack name, use the default value, LMA.
For Admin Email Address, use a valid email address—your temporary password is emailed to this address during the deployment.
For Authorized Account Email Domain, use the domain name part of your corporate email address to allow users with email addresses in the same domain to create their own new UI accounts, or leave blank to prevent users from directly creating their own accounts. You can enter multiple domains as a comma-separated list.
For MeetingAssistService, choose BEDROCK_KNOWLEDGE_BASE (the only available option as of this writing).
For Meeting Assist Bedrock Knowledge Base Id (existing), enter your existing knowledge base ID (for example, JSXXXXX3D8). You can copy it from the Amazon Bedrock console.
For all other parameters, use the default values.

If you want to customize the settings later, for example to add your own AWS Lambda functions, use custom vocabularies and language models to improve accuracy, enable personally identifiable information (PII) redaction, and more, you can update the stack for these parameters.

Select the acknowledgement check boxes, then choose Create stack.

The main CloudFormation stack uses nested stacks to create the following resources in your AWS account:

Amazon Simple Storage Service (Amazon S3) buckets to hold build artifacts and call recordings
An AWS Fargate task with an Application Load Balancer providing a WebSocket server running code to consume stereo audio streams and relay to Amazon Transcribe, publish transcription segments in Amazon Kinesis Data Streams, and create and store stereo call recordings
A Kinesis data stream to relay call events and transcription segments to the enrichment processing function
LMA resources, including the QnABot on AWS solution stack, which interacts with Amazon OpenSearch Service and Amazon Bedrock
The AWS AppSync API, which provides a GraphQL endpoint to support queries and real-time updates
Website components, including an S3 bucket, Amazon CloudFront distribution, and Amazon Cognito user pool
A downloadable preconfigured browser extension application for Chrome browsers
Other supporting resources, including IAM roles and policies (using least privilege best practices), Amazon Virtual Private Cloud (Amazon VPC) resources, Amazon EventBridge event rules, and Amazon CloudWatch log groups.

The stacks take about 35–40 minutes to deploy. The main stack status shows CREATE_COMPLETE when everything is deployed.

Set your password

After you deploy the stack, open the LMA web user interface and set your password by completing the following steps:

Open the email you received, at the email address you provided, with the subject “Welcome to Live Meeting Assistant!”
Open your web browser to the URL shown in the email. You’re directed to the login page.
The email contains a generated temporary password that you use to log in and create your own password. Your user name is your email address.
Set a new password.

Your new password must have a length of at least eight characters, and contain uppercase and lowercase characters, plus numbers and special characters.

Follow the directions to verify your email address, or choose Skip to do it later.

You’re now logged in to LMA.

You also received a similar email with the subject “QnABot Signup Verification Code.” This email contains a generated temporary password that you use to log in and create your own password in the QnABot designer. You use QnABot designer only if you want to customize LMA options and prompts. Your username for QnABot is Admin. You can set your permanent QnABot Admin password now, or keep this email safe in case you want to customize things later.

Download and install the Chrome browser extension

For the best meeting streaming experience, install the LMA browser plugin (currently available for Chrome):

Choose Download Chrome Extension to download the browser extension .zip file (lma-chrome-extension.zip).
Choose (right-click) and expand the .zip file (lma-chrome-extension.zip) to create a local folder named lma-chrome-extension.
Open Chrome and enter the link chrome://extensions into the address bar.
Enable Developer mode.
Choose Load unpacked, navigate to the lma-chrome-extension folder (which you unzipped from the download), and choose Select. This loads your extension.
Pin the new LMA extension to the browser tool bar for easy access—you will use it often to stream your meetings!

Start using LMA

LMA provides two streaming options:

Chrome browser extension – Use this to stream audio and speaker metadata from your meeting browser app. It currently works with Zoom and Chime, but we hope to add more meeting apps.
LMA Stream Audio tab – Use this to stream audio from your microphone and any Chrome browser-based meeting app, softphone, or audio application.

We show you how to use both options in the following sections.

Use the Chrome browser extension to stream a Zoom call

Complete the following steps to use the browser extension:

Open the LMA extension and log in with your LMA credentials.
Join or start a Zoom meeting in your web browser (do not use the separate Zoom client).

If you already have the Zoom meeting page loaded, reload it.

The LMA extension automatically detects that Zoom is running in the browser tab, and populates your name and the meeting name.

Tell others on the call that you are about to start recording the call using LMA and obtain their permission. Do not proceed if participants object.
Choose Start Listening.
Read and accept the disclaimer, and choose Allow to share the browser tab.

The LMA extension automatically detects and displays the active speaker on the call. If you are alone in the meeting, invite some friends to join, and observe that the names they used to join the call are displayed in the extension when they speak, and are attributed to their words in the LMA transcript.

Choose Open in LMA to see your live transcript in a new tab.
Choose your preferred transcript language, and interact with the meeting assistant using the wake phrase “OK Assistant!” or the Meeting Assist Bot pane.

The ASK ASSISTANT button asks the meeting assistant service (Amazon Bedrock knowledge base) to suggest a good response based on the transcript of the recent interactions in the meeting. Your mileage may vary, so experiment!

When you are done, choose Stop Streaming to end the meeting in LMA.

Within a few seconds, the automated end-of-meeting summaries appear, and the audio recording becomes available. You can continue to use the bot after the call has ended.

Use the LMA UI Stream Audio tab to stream from your microphone and any browser-based audio application

The browser extension is the most convenient way to stream metadata and audio from supported meeting web apps. However, you can also use LMA to stream just the audio from any browser-based softphone, meeting app, or other audio source playing in your Chrome browser, using the convenient Stream Audio tab that is built into the LMA UI.

Open any audio source in a browser tab.

For example, this could be a softphone (such as Google Voice), another meeting app, or for demo purposes, you can simply play a local audio recording or a YouTube video in your browser to emulate another meeting participant. If you just want to try it, open the following YouTube video in a new tab.

In the LMA App UI, choose Stream Audio (no extension) to open the Stream Audio tab.
For Meeting ID, enter a meeting ID.
For Name, enter a name for yourself (applied to audio from your microphone).
For Participant Name(s), enter the names of the participants (applied to the incoming audio source).
Choose Start Streaming.
Choose the browser tab you opened earlier, and choose Allow to share.
Choose the LMA UI tab again to view your new meeting ID listed, showing the meeting as In Progress.
Choose the meeting ID to open the details page, and watch the transcript of the incoming audio, attributed to the participant names that you entered. If you speak, you’ll see the transcription of your own voice.

Use the Stream Audio feature to stream from any softphone app, meeting app, or any other streaming audio playing in the browser, along with your own audio captured from your selected microphone. Always obtain permission from others before recording them using LMA, or any other recording application.

Processing flow overview

How did LMA transcribe and analyze your meeting? Let’s look at how it works. The following diagram shows the main architectural components and how they fit together at a high level.

The LMA user joins a meeting in their browser, enables the LMA browser extension, and authenticates using their LMA credentials. If the meeting app (for example, Zoom.us) is supported by the LMA extension, the user’s name, meeting name, and active speaker names are automatically detected by the extension. If the meeting app is not supported by the extension, then the LMA user can manually enter their name and the meeting topic—active speakers’ names will not be detected.

After getting permission from other participants, the LMA user chooses Start Listening on the LMA extension pane. A secure WebSocket connection is established to the preconfigured LMA stack WebSocket URL, and the user’s authentication token is validated. The LMA browser extension sends a START message to the WebSocket containing the meeting metadata (name, topic, and so on), and starts streaming two-channel audio from the user’s microphone and the incoming audio channel containing the voices of the other meeting participants. The extension monitors the meeting app to detect active speaker changes during the call, and sends that metadata to the WebSocket, enabling LMA to label speech segments with the speaker’s name.

The WebSocket server running in Fargate consumes the real-time two-channel audio fragments from the incoming WebSocket stream. The audio is streamed to Amazon Transcribe, and the transcription results are written in real time to Kinesis Data Streams.

Each meeting processing session runs until the user chooses Stop Listening in the LMA extension pane, or ends the meeting and closes the tab. At the end of the call, the function creates a stereo recording file in Amazon S3 (if recording was enabled when the stack was deployed).

A Lambda function called the Call Event Processor, fed by Kinesis Data Streams, processes and optionally enriches meeting metadata and transcription segments. The Call Event Processor integrates with the meeting assist services. LMA is powered by Amazon Lex, Knowledge Bases for Amazon Bedrock, and Amazon Bedrock LLMs using the open source QnABot on AWS solution for answers based on FAQs and as an orchestrator for request routing to the appropriate AI service. The Call Event Processor also invokes the Transcript Summarization Lambda function when the call ends, to generate a summary of the call from the full transcript.

The Call Event Processor function interfaces with AWS AppSync to persist changes (mutations) in Amazon DynamoDB and send real-time updates to the LMA user’s logged-in web clients (conveniently opened by choosing the Open in LMA option in the browser extension).

The LMA web UI assets are hosted on Amazon S3 and served via CloudFront. Authentication is provided by Amazon Cognito.

When the user is authenticated, the web application establishes a secure GraphQL connection to the AWS AppSync API, and subscribes to receive real-time events such as new calls and call status changes for the meetings list page, and new or updated transcription segments and computed analytics for the meeting details page. When translation is enabled, the web application also interacts securely with Amazon Translate to translate the meeting transcription into the selected language.

The entire processing flow, from ingested speech to live webpage updates, is event driven, and the end-to-end latency is short—typically just a few seconds.

Monitoring and troubleshooting

AWS CloudFormation reports deployment failures and causes on the relevant stack’s Events tab. See Troubleshooting CloudFormation for help with common deployment problems. Look out for deployment failures caused by limit exceeded errors; the LMA stacks create resources that are subject to default account and Region service quotas, such as elastic IP addresses and NAT gateways. When troubleshooting CloudFormation stack failures, always navigate into any failed nested stacks to find the first nested resource failure reported—this is almost always the root cause.

Amazon Transcribe has a default limit of 25 concurrent transcription streams, which limits LMA to 25 concurrent meetings in a given AWS account or Region. Request an increase for the number of concurrent HTTP/2 streams for streaming transcription if you have many users and need to handle a larger number of concurrent meetings in your account.

LMA provides runtime monitoring and logs for each component using CloudWatch:

WebSocket processing and transcribing Fargate task – On the Amazon Elastic Container Service (Amazon ECS) console, navigate to the Clusters page and open the LMA-WEBSOCKETSTACK-xxxx-TranscribingCluster function. Choose the Tasks tab and open the task page. Choose Logs and View in CloudWatch to inspect the WebSocket transcriber task logs.
Call Event Processor Lambda function – On the Lambda console, open the LMA-AISTACK-CallEventProcessor function. Choose the Monitor tab to see function metrics. Choose View logs in CloudWatch to inspect function logs.
AWS AppSync API – On the AWS AppSync console, open the CallAnalytics-LMA API. Choose Monitoring in the navigation pane to see API metrics. Choose View logs in CloudWatch to inspect AWS AppSync API logs.

For QnABot on AWS for Meeting Assist, refer to the Meeting Assist README, and the QnABot solution implementation guide for additional information.

Cost assessment

LMA provides a WebSocket server using Fargate (2vCPU) and VPC networking resources costing about $0.10/hour (approximately $72/month). For more details, see AWS Fargate Pricing.

LMA is enabled using QnABot and Knowledge Bases for Amazon Bedrock. You create your own knowledge base, which you use for LMA and potentially other use cases. For more details, see Amazon Bedrock Pricing. Additional AWS services used by the QnABot solution cost about $0.77/hour. For more details, refer to the list of QnABot on AWS solution costs.

The remaining solution costs are based on usage.

The usage costs add up to about $0.17 for a 5-minute call, although this can vary based on options selected (such as translation), number of LLM summarizations, and total usage because usage affects Free Tier eligibility and volume tiered pricing for many services. For more information about the services that incur usage costs, see the following:

To explore LMA costs for yourself, use AWS Cost Explorer or choose Bill Details on the AWS Billing Dashboard to see your month-to-date spend by service.

Customize your deployment

Use the following CloudFormation template parameters when creating or updating your stack to customize your LCA deployment:

To use your own S3 bucket for meeting recordings, use Call Audio Recordings Bucket Name and Audio File Prefix.
To redact PII from the transcriptions, set Enable Content Redaction for Transcripts to true, and adjust Transcription PII Redaction Entity Types as needed. For more information, see Redacting or identifying PII in a real-time stream.
To improve transcription accuracy for technical and domain-specific acronyms and jargon, set Transcription Custom Vocabulary Name to the name of a custom vocabulary that you already created in Amazon Transcribe or set Transcription Custom Language Model Name to the name of a previously created custom language model. For more information, see Improving Transcription Accuracy.
To transcribe meetings in a supported language other than US English, choose the desired value for Language for Transcription.
To customize transcript processing, optionally set Lambda Hook Function ARN for Custom Transcript Segment Processing to the ARN of your own Lambda function. For more information, see Using a Lambda function to optionally provide custom logic for transcript processing.
To customize the meeting assist capabilities based on the QnABot on AWS solution, Amazon Lex, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock integration, see the Meeting Assist README.
To customize transcript summarization by configuring LMA to call your own Lambda function, see Transcript Summarization LAMBDA option.
To customize transcript summarization by modifying the default prompts or adding new ones, see Transcript Summarization.
To change the retention period, set Record Expiration In Days to the desired value. All call data is permanently deleted from the LMA DynamoDB storage after this period. Changes to this setting apply only to new calls received after the update.

LMA is an open source project. You can fork the LMA GitHub repository, enhance the code, and send us pull requests so we can incorporate and share your improvements!

Update an existing LMA stack

You can update your existing LMA stack to the latest release. For more details, see Update an existing stack.

Clean up

Congratulations! You have completed all the steps for setting up your live call analytics sample solution using AWS services.

When you’re finished experimenting with this sample solution, clean up your resources by using the AWS CloudFormation console to delete the LMA stacks that you deployed. This deletes resources that were created by deploying the solution. The recording S3 buckets, DynamoDB table, and CloudWatch log groups are retained after the stack is deleted to avoid deleting your data.

Live Call Analytics: Companion solution

Our companion solution, Live Call Analytics and Agent Assist (LCA), offers real-time transcription and analytics for contact centers (phone calls) rather than meetings. There are many similarities—in fact, LMA was built using an architecture and many components derived from LCA.

Conclusion

The Live Meeting Assistant sample solution offers a flexible, feature-rich, and customizable approach to provide live meeting assistance to improve your productivity during and after meetings. It uses Amazon AI/ML services like Amazon Transcribe, Amazon Lex, Knowledge Bases for Amazon Bedrock, and Amazon Bedrock LLMs to transcribe and extract real-time insights from your meeting audio.

The sample LMA application is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. Browse to the LMA GitHub repository to explore the code, choose Watch to be notified of new releases, and check the README for the latest documentation updates.

For expert assistance, AWS Professional Services and other AWS Partners are here to help.

We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the LMA GitHub repository.

About the authors

Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Chris Lott is a Principal Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California and enjoys gardening, aerospace, and traveling the world.

Babu Srinivasan is a Sr. Specialist SA – Language AI services in the World Wide Specialist organization at AWS, with over 24 years of experience in IT and the last 6 years focused on the AWS Cloud. He is passionate about AI/ML. Outside of work, he enjoys woodworking and entertains friends and family (sometimes strangers) with sleight of hand card magic.

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Gillian Armstrong is a Builder Solutions Architect. She is excited about how the Cloud is opening up opportunities for more people to use technology to solve problems, and especially excited about how cognitive technologies, like conversational AI, are allowing us to interact with computers in more human ways.

Meta Llama 3 models are now available in Amazon SageMaker JumpStart

Today, we are excited to announce that Meta Llama 3 foundation models are available through Amazon SageMaker JumpStart to deploy and run inference. The Llama 3 models are a collection of pre-trained and fine-tuned generative text models.

In this post, we walk through how to discover and deploy Llama 3 models via SageMaker JumpStart.

What is Meta Llama 3

Llama 3 comes in two parameter sizes — 8B and 70B with 8k context length — that can support a broad range of use cases with improvements in reasoning, code generation, and instruction following. Llama 3 uses a decoder-only transformer architecture and new tokenizer that provides improved model performance with 128k size. In addition, Meta improved post-training procedures that substantially reduced false refusal rates, improved alignment, and increased diversity in model responses. You can now derive the combined advantages of Llama 3 performance and MLOps controls with Amazon SageMaker features such as SageMaker Pipelines, SageMaker Debugger, or container logs. In addition, the model will be deployed in an AWS secure environment under your VPC controls, helping provide data security.

What is SageMaker JumpStart

With SageMaker JumpStart, you can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment. You can now discover and deploy Llama 3 models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as SageMaker Pipelines, SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping provide data security. Llama 3 models are available today for deployment and inferencing in Amazon SageMaker Studio in us-east-1 (N. Virginia), us-east-2 (Ohio), us-west-2 (Oregon), eu-west-1 (Ireland) and ap-northeast-1 (Tokyo) AWS Regions.

Discover models

You can access the foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

From the SageMaker JumpStart landing page, you can easily discover various models by browsing through different hubs which are named after model providers. You can find Llama 3 models in Meta hub. If you do not see Llama 3 models, please update your SageMaker Studio version by shutting down and restarting. For more information, refer to Shut down and Update Studio Classic Apps.

You can find Llama 3 models by searching for “Meta-llama-3“ from the search box located at top left.

You can discover all Meta models available in SageMaker JumpStart by clicking on Meta hub.

Clicking on a model card opens the corresponding model detail page, from which you can easily Deploy the model.

Deploy a model

When you choose Deploy and acknowledge the EULA terms, deployment will start.

You can monitor progress of the deployment on the page that shows up after clicking the Deploy button.

Alternatively, you can choose Open notebook to deploy through the example notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using the notebook, you start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker with the following code.

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id = "meta-textgeneration-llama-3-70b-instruct")
predictor = model.deploy(accept_eula=False)

By default accept_eula is set to False. You need to manually accept the EULA to deploy the endpoint successfully, By doing so, you accept the user license agreement and acceptable use policy. You can also find the license agreement Llama website. This deploys the model on SageMaker with default configurations including the default instance type and default VPC configurations. You can change these configuration by specifying non-default values in JumpStartModel. To learn more, please refer to the following documentation.

The following table lists all the Llama 3 models available in SageMaker JumpStart along with the model_ids, default instance types and maximum number of total tokens (sum of the number of input tokens and number of generated tokens) supported for each of these models.

Model Name	Model ID	Max Total Tokens	Default instance type
Meta-Llama-3-8B	meta-textgeneration-llama-3-8B	8192	ml.g5.12xlarge
Meta-Llama-3-8B-Instruct	meta-textgeneration-llama-3-8B-instruct	8192	ml.g5.12xlarge
Meta-Llama-3-70B	meta-textgeneration-llama-3-70b	8192	ml.p4d.24xlarge
Meta-Llama-3-70B-Instruct	meta-textgeneration-llama-3-70b-instruct	8192	ml.p4d.24xlarge

Run inference

After you deploy the model, you can run inference against the deployed endpoint through SageMaker predictor. Fine-tuned instruct models (Llama 3: 8B Instruct and 70B Instruct) accept a history of chats between the user and the chat assistant, and generate the subsequent chat. The pre-trained models (Llama 3: 8B and 70B) require a string prompt and perform text completion on the provided prompt.

predictor.predict(payload)

Inference parameters control the text generation process at the endpoint. The Max new tokens control the size of the output generated by the model. This is not same as the number of words because the vocabulary of the model is not the same as the English language vocabulary, and each token may not be an English language word. The temperature parameter controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.

Example prompts for the 70B model

You can use Llama 3 models for text completion for any piece of text. Through text generation, you can perform a variety of tasks such as question answering, language translation, and sentiment analysis, and more. The input payload to the endpoint looks like the following code:

payload = {
    “inputs”: str,
   (optional)"parameters":{"max_new_tokens":int, "top_p":float, "temperature":float}
}

The following are some sample example prompts and the text generated by the model. All outputs are generated with inference parameters {"max_new_tokens":64, "top_p":0.9, "temperature":0.6}.

Input: Simply put, the theory of relativity states that
Output: 1) the laws of physics are the same for all observers, and 2) the speed of light is constant for all observers. The first part of the theory is known as the special theory of relativity, and the second part is known as the general theory of relativity.nThe special theory of relativity states

Input: I believe the meaning of life is
Output:  to live a life of purpose, on purpose. I believe that we are all here to make a difference in the world, and that we are all here to make a difference in the lives of others. I believe that we are all here to make a difference in the world, and that we are all here to mak

Input: A brief message congratulating the team on the launch:nnHi everyone,nnI just 
Output: wanted to say congratulations on the launch of the new website! It looks great and I'm sure it will be a valuable resource for our customers. I know that a lot of hard work went into making this happen, so thank you all for your efforts.nnBest regards,nJohn Doen

In the next example, we show how to use Llama 3 models with few shot in-context learning where we provide training samples available to the model. We only run inference on the deployed model and during this process, and model weights do not change.

Input:
Translate English to French:nsea otter => loutre de mernpeppermint => menthe poivréenplush girafe => girafe peluchencheese =>
Output:  fromagenpanda => pandanpenguin => pingouinnpolar bear => ours polairenpuffin => macareuxnsloth => paresseuxnwalrus => morsenwhale => baleinenzebra => zu00e8brenTranslate English to German:n

Example prompts for the 70B-Instruct model

With Llama 3 instruct models which are optimized for dialogue use cases, the input to the instruct model endpoints is the previous history between the chat assistant and the user. You can ask questions contextual to the conversation that has happened so far. You can also provide the system configuration, such as personas, which define the chat assistant’s behavior. While the input payload format is the same as the base pre-trained model, the input text should be formatted in the following manner:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

[SYSTEM_TEXT]<|eot_id|><|start_header_id|>user<|end_header_id|>

[USER_TEXT]<|eot_id|><|start_header_id|>assistant<|end_header_id|>

In this instruction template, you can optionally start with a system role and include as many alternating roles as desired in the turn-based history. The final role should always be assistant and end with two new line feeds.

Next, consider a few example prompts and responses from the model. In the following example, the user is asking a simple question to the assistant.

Input: <|begin_of_text|><|start_header_id|>user<|end_header_id|>nnwhat is the recipe of mayonnaise?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: The classic condiment! Mayonnaise is a thick, creamy emulsion sauce made from a combination of oil, egg yolks, acid (such as vinegar or lemon juice), and seasonings. Here's a simple recipe to make mayonnaise at home:nn**Ingredients:**nn* 2 egg yolksn* 1 tablespoon lemon juice or vinegar (such as apple cider vinegar or white wine vinegar)n* 1/2 teaspoon Dijon mustard (optional, but recommended for flavor)n* 1/2 cup (120 ml) neutral-tasting oil, such as canola, grapeseed, or sunflower oiln* Salt, to tastenn**Instructions:**nn1. **Start with room temperature ingredients**: This is crucial for emulsification to occur.n2. **In a medium-sized bowl**, whisk together the egg yolks, lemon juice or vinegar, and Dijon mustard (if using) until well combined.n3. **Slowly add the oil**: While continuously whisking the egg yolk mixture, slowly pour in the oil in a thin, steady stream. Start with a very slow drizzle and gradually increase the flow as the mixture thickens.n4. **Whisk constantly**: Keep whisking until the mixture has doubled in volume and has a thick, creamy consistency. This should take about 5-7 minutes.n5. **Check the consistency**: Stop whisking when the mayonnaise has reached the desired consistency. It should be smooth, thick, and hold its shape when the whisk is lifted.n6. **Season with salt**: Add salt to taste, and whisk until well combined.n7. **Taste and adjust**: If the mayonnaise is too thick, add a little water. If it's not thick enough, add a little more oil.nn**Tips and Variations:**nn* Use a neutral-tasting oil to avoid flavoring the mayonnaise.n* If you're using a stand mixer, use the whisk attachment and start with a slow speed, increasing as needed.n* Add flavorings like minced garlic, chopped fresh herbs (e.g., parsley, dill, or tarragon), or grated ginger to create different flavor profiles.n* For a lighter mayonnaise, use only 1 egg yolk or add a little water to thin it out.n* If the mayonnaise breaks or separates, don't worry! Simply start again with a new egg yolk and slowly whisk in the broken mayonnaise.nn**Homemade Mayonnaise vs. Store-Bought:**nnWhile store-bought mayonnaise is convenient, homemade mayonnaise offers several advantages:nn* Freshness: Homemade mayonnaise is made to order, ensuring optimal flavor and texture.n* Customization: You can tailor the flavor to your liking with various seasonings and ingredients.n* No preservatives: Homemade mayonnaise doesn't contain added preservatives, making it a healthier option.nnEnjoy your freshly made mayonnaise!

In the following example, the user has a conversation with the assistant about tourist sites in Paris. Then the user inquires about the first option recommended by the chat assistant.

Input: <|begin_of_text|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nnParis, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:nn1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.n2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.nnThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.<|eot_id|><|start_header_id|>user<|end_header_id|>nnWhat is so great about #1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: The Eiffel Tower! It's an iconic symbol of Paris and one of the most recognizable landmarks in the world. Here are some reasons why it's so great:nn1. **Engineering marvel**: When it was built for the 1889 World's Fair, the Eiffel Tower was a groundbreaking feat of engineering. It was the tallest structure in the world at the time, standing at 324 meters (1,063 feet).n2. **Panoramic views**: The Eiffel Tower offers breathtaking 360-degree views of the city from its observation decks. You can see many of Paris' famous landmarks, such as the Seine River, the Louvre, and Montmartre.n3. **Romantic atmosphere**: The Eiffel Tower is often associated with romance and is a popular spot for proposals, weddings, and romantic dinners. The tower's iron latticework and twinkling lights at night add to its romantic ambiance.n4. **Historical significance**: The Eiffel Tower has played a significant role in French history, serving as a communication tower during World War I and a symbol of French culture and engineering.n5. **Accessibility**: The Eiffel Tower is easily accessible by public transportation or on foot, and there are various ways to experience it, including taking the stairs, elevator, or a guided tour.n6. **Nighttime spectacle**: The Eiffel Tower is even more magical at night, when it's illuminated with thousands of twinkling lights. The light show, known as the "Eiffel Tower Sparkle," takes place every hour on the hour from dusk to midnight.n7. **Iconic status**: The Eiffel Tower has been featured in countless films, literature, and art, making it an instantly recognizable symbol of Paris and France.nnOverall, the Eiffel Tower is a must-visit attraction in Paris, offering a unique combination of history, engineering, romance, and stunning views.

In the following examples, we set the system’s configuration.

Input: <|begin_of_text|><|start_header_id|>system<|end_header_id|>nnAlways answer with Haiku<|eot_id|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: Eiffel's iron lacenRiver Seine's gentle flow bynMontmartre's charm calls
Input: <|begin_of_text|><|start_header_id|>system<|end_header_id|>nnAlways answer with emojis<|eot_id|><|start_header_id|>user<|end_header_id|>nnHow to go from Beijing to NY?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: ✈🛬🇺🇸

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Llama 3 models in SageMaker Studio. You now have access to four Llama 3 foundation models that contain billions of parameters. Because foundation models are pretrained, they can also help lower training and infrastructure costs and enable customization for your use case. Check out SageMaker JumpStart in SageMaker Studio now to get started.

About Authors

Kyle Ulrich is an Applied Scientist II at AWS
Xin Huang is a Senior Applied Scientist at AWS
Qing Lan is a Senior Software Developer Engineer at AWS
Haotian An is a Software Developer Engineer II at AWS
Christopher Whitten is a Software Development Engineer II at AWS
Tyler Osterberg is a Software Development Engineer I at AWS
Manan Shah is a Software Development Manager at AWS
Jonathan Guinegagne is a Senior Software Developer Engineer at AWS
Adriana Simmons is a Senior Product Marketing Manager at AWS
June Won is a Senior Product Manager at AWS
Ashish Khetan is a Senior Applied Scientist at AWS
Rachna Chadha is a Principal Solution Architect at AWS
Deepak Rupakula is a Principal GTM Specialist at AWS

Wide Open: NVIDIA Accelerates Inference on Meta Llama 3

NVIDIA today announced optimizations across all its platforms to accelerate Meta Llama 3, the latest generation of the large language model (LLM).

The open model combined with NVIDIA accelerated computing equips developers, researchers and businesses to innovate responsibly across a wide variety of applications.

Trained on NVIDIA AI

Meta engineers trained Llama 3 on a computer cluster packing 24,576 NVIDIA H100 Tensor Core GPUs, linked with an NVIDIA Quantum-2 InfiniBand network. With support from NVIDIA, Meta tuned its network, software and model architectures for its flagship LLM.

To further advance the state of the art in generative AI, Meta recently described plans to scale its infrastructure to 350,000 H100 GPUs.

Putting Llama 3 to Work

Versions of Llama 3, accelerated on NVIDIA GPUs, are available today for use in the cloud, data center, edge and PC.

From a browser, developers can try Llama 3 at ai.nvidia.com. It’s packaged as an NVIDIA NIM microservice with a standard application programming interface that can be deployed anywhere.

Businesses can fine-tune Llama 3 with their data using NVIDIA NeMo, an open-source framework for LLMs that’s part of the secure, supported NVIDIA AI Enterprise platform. Custom models can be optimized for inference with NVIDIA TensorRT-LLM and deployed with NVIDIA Triton Inference Server.

Taking Llama 3 to Devices and PCs

Llama 3 also runs on NVIDIA Jetson Orin for robotics and edge computing devices, creating interactive agents like those in the Jetson AI Lab.

What’s more, NVIDIA RTX and GeForce RTX GPUs for workstations and PCs speed inference on Llama 3. These systems give developers a target of more than 100 million NVIDIA-accelerated systems worldwide.

Get Optimal Performance with Llama 3

Best practices in deploying an LLM for a chatbot involves a balance of low latency, good reading speed and optimal GPU use to reduce costs.

Such a service needs to deliver tokens — the rough equivalent of words to an LLM — at about twice a user’s reading speed which is about 10 tokens/second.

Applying these metrics, a single NVIDIA H200 Tensor Core GPU generated about 3,000 tokens/second — enough to serve about 300 simultaneous users — in an initial test using the version of Llama 3 with 70 billion parameters.

That means a single NVIDIA HGX server with eight H200 GPUs could deliver 24,000 tokens/second, further optimizing costs by supporting more than 2,400 users at the same time.

For edge devices, the version of Llama 3 with eight billion parameters generated up to 40 tokens/second on Jetson AGX Orin and 15 tokens/second on Jetson Orin Nano.

Advancing Community Models

An active open-source contributor, NVIDIA is committed to optimizing community software that helps users address their toughest challenges. Open-source models also promote AI transparency and let users broadly share work on AI safety and resilience.

Learn more about how NVIDIA’s AI inference platform, including how NIM, TensorRT-LLM and Triton use state-of-the-art techniques such as low-rank adaptation to accelerate the latest LLMs.

SAMMO: A general-purpose framework for prompt optimization

SAMMO optimizer diagram showing progression from starting prompt to optimized prompt.

Large language models (LLMs) have revolutionized a wide range of tasks and applications that were previously reliant on manually crafted machine learning (ML) solutions, streamlining through automation. However, despite these advances, a notable challenge persists: the need for extensive prompt engineering to adapt these models to new tasks. New generations of language models like GPT-4 and Mixtral 8x7B advance the capability to process long input texts. This progress enables the use of longer inputs, providing richer context and detailed instructions to language models. A common technique that uses this enhanced capacity is the Retrieval Augmented Generation (RAG) approach. RAG dynamically incorporates information into the prompt based on the specific input example. This process is illustrated in Figure 1, which shows a RAG prompt designed to translate user queries into a domain-specific language (DSL), also known as semantic parsing.

A table showing an example metaprompt for a semantic parsing task. The underlying metaprompt consists of three larger parts, each of which comes with a variety of aspects that can be optimized. For example, the input example can be rendered using different formats, the few shot example included can be retrieved using various similarity functions, or the task description can be paraphrased. — Figure 1: A RAG prompt is used for a semantic parsing task. The underlying prompt consists of three larger parts, each with a variety of aspects that can be optimized.

The example in Figure 1 combines three distinct structures to construct the final prompt. The first structure, the task description, remains static and independent of the input as a result of conventional prompt optimization techniques. However, RAG contains two input-specific structures: the example retriever and the input text itself. These introduce numerous optimization opportunities that surpass the scope of most traditional approaches. Despite previous efforts in prompt optimization, the evolution towards more complex prompt structures has rendered many older strategies ineffective in this new context.

SAMMO: A prompt optimization approach

Download

SAMMO

To address these challenges, we developed the Structure-Aware Multi-objective Metaprompt Optimization (SAMMO) framework. SAMMO is a new open-source tool that streamlines the optimization of prompts, particularly those that combine different types of structural information like in the RAG example above. It can make structural changes, such as removing entire components or replacing them with different ones. These features enable AI practitioners and researchers to efficiently refine their prompts with little manual effort.

Central to SAMMO’s innovation is its approach to treating prompts not just as static text inputs but as dynamic, programmable entities—metaprompts. SAMMO represents these metaprompts as function graphs, where individual components and substructures can be modified to optimize performance, similar to the optimization process that occurs during traditional program compilation.

The following key features contribute to SAMMO’s effectiveness:

Structured optimization: Unlike current methods that focus on text-level changes, SAMMO focuses on optimizing the structure of metaprompts. This granular approach facilitates precise modifications and enables the straightforward integration of domain knowledge, for instance, through rewrite operations targeting specific stylistic objectives.

Multi-objective search: SAMMO’s flexibility enables it to simultaneously address multiple objectives, such as improving accuracy and computational efficiency. Our paper illustrates how SAMMO can be used to compress prompts without compromising their accuracy.

General purpose application: SAMMO has proven to deliver significant performance improvements across a variety of tasks, including instruction tuning, RAG, and prompt compression.

Exploring SAMMO’s impact through use cases

Use case 1: RAG optimization

A common application of LLMs involves translating natural user queries into domain-specific language (DSL) constructions, often to communicate with external APIs. For example, Figure 1 shows how an LLM can be used to map user queries about geography facts to a custom DSL.

In a realistic RAG scenario, SAMMO demonstrates significant performance improvements. To demonstrate this, we conducted experiments across three semantic parsing datasets of varying complexity: GeoQuery, SMCalFlow, and Overnight. Given the often limited availability of data in practical settings, we trained and tested the model on a subsampled dataset (training and retrieval set n=600, test set n=100). We compared SAMMO against a manually designed competitive baseline, using enumerative search within a search space of 24 configurations. This included variations in data formats, the number of few-shot examples, and DSL specifications.

Evaluation

As illustrated in Figure 2, SAMMO improved accuracy across different datasets and backend LLMs in almost all cases, with the most notable gains observed in older generation models. However, even newer models like GPT-4, SAMMO facilitated accuracy improvements exceeding 100 percent.

A series of four bar charts showing the performance of SAMMO on semantic parsing tasks. SAMMO achieves substantial improvements for most backend models and datasets. — Figure 2: For semantic parsing with RAG, SAMMO achieves substantial improvements across most backend models and datasets.

Use case 2: Instruction tuning

Instruction tuning addresses the optimization of static instructions given to LLMs that provide the goal and constraints of a task. To show that SAMMO extends beyond many previous prompt tuning methods, we applied this conventional setting.

To align with previous research, we used eight zero-shot BigBench classification tasks where the baseline prompt for GPT-3.5 achieved an accuracy of less than 0.9. We compared it against Automatic Prompt Optimization (APO) and GrIPS, applying open-source models Mixtral 7x8B and Llama-2 70B, alongside GPT-3.5 as backend LLMs. We did not include GPT-4 due to minimal improvement potential identified in pilot experiments. The results, shown in Figure 3, demonstrate that SAMMO outperformed all baselines regardless of the backend model, proving its effectiveness with even more complex metaprompts.

A series of three bar charts comparing the accuracy of different methods on instruction tuning. SAMMO matches or exceeds the performance of competing methods for instruction tuning on classification tasks. — Figure 3: SAMMO does at least as well as older methods for instruction tuning on simpler tasks.

Implications and looking forward

SAMMO introduces a new and flexible approach to optimize prompts for specific requirements. Its design works with any LLM, and it features versatile components and operators suitable for a broad range of applications.

We are excited to integrate and apply SAMMO to the components and pipelines behind AI-powered assistant technologies. We also hope to establish a user-driven community centered around SAMMO, where people can exchange best practices and patterns, and encourage the expansion of the existing set of search operators.

Get the code

Read the paper

The post SAMMO: A general-purpose framework for prompt optimization appeared first on Microsoft Research.

Up to No Good: ‘No Rest for the Wicked’ Early Access Launches on GeForce NOW

It’s time to get a little wicked. Members can now stream No Rest for the Wicked from the cloud.

It leads six new games joining the GeForce NOW library of more than 1,500 games.

Holy Moly

No Rest For The Wicked on GeForce NOW — There’s always another fight to be won.

No Rest for the Wicked is the highly anticipated action role-playing game from Moon Studios, developer of the Ori series, and publisher Private Division. Amid a plague-ridden world, step into the boots of a Cerim, a holy warrior on a desperate mission. The Great Pestilence has ravaged the land of Sacra, and a new king reigns. As a colonialist inquisition unfolds, engage in visceral combat, battle plague-infested creatures and uncover the secrets of the continent. Make the character you want with the game’s flexible soft-class system, explore a rich storyline, and prepare for intense boss battles as you build up the town of Sacrament.

Embark on a dark and perilous journey, where no rest awaits the wicked. Rise to the challenge and stream from GeForce RTX 4080 servers with a GeForce NOW Ultimate membership for the smoothest gameplay from the cloud. Be among the first to experience early access of the game, without having to wait for downloads.

Shiny New Games

Evil West on GeForce NOW — “Yippie ki-yay, evil doers!”

Become a Wild West superhero in Evil West, streaming on GeForce NOW this week and part of PC Game Pass. It’s part of six newly supported games this week:

Kill It With Fire 2 (New release on Steam, April 16)
The Crew Motorfest (New release on Steam, April 18)
No Rest for the Wicked (New release on Steam, April 18)
Evil West (Xbox, available on PC Game Pass)
Lightyear Frontier (Steam)
Tomb Raider I-III Remastered (Steam)

Riot Games shared in its 14.8 patch notes that it will soon add its Vanguard security software to League of Legends as part of the publisher’s commitment to remove scripters, bots and bot-leveled accounts from the game and make it more challenging for them to continue. Since Vanguard won’t support virtual machines when it’s added to League of Legends, the game will be put under maintenance and will no longer be playable on GeForce NOW once the 14.9 update goes live globally — currently planned for May 1, 2024. Members can continue to enjoy the game on GeForce NOW until then.

What are you planning to play this weekend? Let us know on X or in the comments below.

Slack delivers native and secure generative AI powered by Amazon SageMaker JumpStart

This post is co-authored by Jackie Rocca, VP of Product, AI at Slack

Slack is where work happens. It’s the AI-powered platform for work that connects people, conversations, apps, and systems together in one place. With the newly launched Slack AI—a trusted, native, generative artificial intelligence (AI) experience available directly in Slack—users can surface and prioritize information so they can find their focus and do their most productive work.

We are excited to announce that Slack, a Salesforce company, has collaborated with Amazon SageMaker JumpStart to power Slack AI’s initial search and summarization features and provide safeguards for Slack to use large language models (LLMs) more securely. Slack worked with SageMaker JumpStart to host industry-leading third-party LLMs so that data is not shared with the infrastructure owned by third party model providers.

This keeps customer data in Slack at all times and upholds the same security practices and compliance standards that customers expect from Slack itself. Slack is also using Amazon SageMaker inference capabilities for advanced routing strategies to scale the solution to customers with optimal performance, latency, and throughput.

“With Amazon SageMaker JumpStart, Slack can access state-of-the-art foundation models to power Slack AI, while prioritizing security and privacy. Slack customers can now search smarter, summarize conversations instantly, and be at their most productive.”

– Jackie Rocca, VP Product, AI at Slack

Foundation models in SageMaker JumpStart

SageMaker JumpStart is a machine learning (ML) hub that can help accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select foundation models (FMs) quickly based on predefined quality and responsibility metrics to perform tasks like article summarization and image generation. Pretrained models are fully customizable for your use case with your data, and you can effortlessly deploy them into production with the user interface or SDK. In addition, you can access prebuilt solutions to solve common use cases and share ML artifacts, including ML models and notebooks, within your organization to accelerate ML model building and deployment. None of your data is used to train the underlying models. All the data is encrypted and is never shared with third-party vendors so you can trust that your data remains private and confidential.

Check out the SageMaker JumpStart model page for available models.

Slack AI

Slack launched Slack AI to provide native generative AI capabilities so that customers can easily find and consume large volumes of information quickly, enabling them to get even more value out of their shared knowledge in Slack. For example, users can ask a question in plain language and instantly get clear and concise answers with enhanced search. They can catch up on channels and threads in one click with conversation summaries. And they can access personalized, daily digests of what’s happening in select channels with the newly launched recaps.

Because trust is Slack’s most important value, Slack AI runs on an enterprise-grade infrastructure they built on AWS, upholding the same security practices and compliance standards that customers expect. Slack AI is built for security-conscious customers and is designed to be secure by design—customer data remains in-house, data is not used for LLM training purposes, and data remains siloed.

Solution overview

SageMaker JumpStart provides access to many LLMs, and Slack selects the right FMs that fit their use cases. Because these models are hosted on Slack’s owned AWS infrastructure, data sent to models during invocation doesn’t leave Slack’s AWS infrastructure. In addition, to provide a secure solution, data sent for invoking SageMaker models is encrypted in transit. The data sent to SageMaker JumpStart endpoints for invoking models is not used to train base models. SageMaker JumpStart allows Slack to support high standards for security and data privacy, while also using state-of-the-art models that help Slack AI perform optimally for Slack customers.

SageMaker JumpStart endpoints serving Slack business applications are powered by AWS instances. SageMaker supports a wide range of instance types for model deployment, which allows Slack to pick the instance that is best suited to support latency and scalability requirements of Slack AI use cases. Slack AI has access to multi-GPU based instances to host their SageMaker JumpStart models. Multiple GPU instances allow each instance backing Slack AI’s endpoint to host multiple copies of a model. This helps improve resource utilization and reduce model deployment cost. For more information, refer to Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency.

The following diagram illustrates the solution architecture.

To use the instances most effectively and support the concurrency and latency requirements, Slack used SageMaker-offered routing strategies with their SageMaker endpoints. By default, a SageMaker endpoint uniformly distributes incoming requests to ML instances using a round-robin algorithm routing strategy called RANDOM. However, with generative AI workloads, requests and responses can be extremely variable, and it’s desirable to load balance by considering the capacity and utilization of the instance rather than random load balancing. To effectively distribute requests across instances backing the endpoints, Slack uses the LEAST_OUTSTANDING_REQUESTS (LAR) routing strategy. This strategy routes requests to the specific instances that have more capacity to process requests instead of randomly picking any available instance. The LAR strategy provides more uniform load balancing and resource utilization. As a result, Slack AI noticed over a 39% latency decrease in their p95 latency numbers when enabling LEAST_OUTSTANDING_REQUESTS compared to RANDOM.

For more details on SageMaker routing strategies, see Minimize real-time inference latency by using Amazon SageMaker routing strategies.

Conclusion

Slack is delivering native generative AI capabilities that will help their customers be more productive and easily tap into the collective knowledge that’s embedded in their Slack conversations. With fast access to a large selection of FMs and advanced load balancing capabilities that are hosted in dedicated instances through SageMaker JumpStart, Slack AI is able to provide rich generative AI features in a more robust and quicker manner, while upholding Slack’s trust and security standards.

Learn more about SageMaker JumpStart, Slack AI and how the Slack team built Slack AI to be secure and private. Leave your thoughts and questions in the comments section.

About the Authors

Jackie Rocca is VP of Product at Slack, where she oversees the vision and execution of Slack AI, which brings generative AI natively and securely into Slack’s user experience. In her five years at Slack, Jackie has delivered on a number of initiatives to push Slack’s business forward. Now she’s on a mission to help customers accelerate their productivity and get even more value out of their conversations, data, and collective knowledge with generative AI. Prior to her time at Slack, Jackie was a Product Manager at Google for more than six years, where she helped launch and grow Youtube TV. Jackie is based in the San Francisco Bay Area.

Rachna Chadha is a Principal Solutions Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that the ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Maninder (Mani) Kaur is the AI/ML Specialist lead for Strategic ISVs at AWS. With her customer-first approach, Mani helps strategic customers shape their AI/ML strategy, fuel innovation, and accelerate their AI/ML journey. Mani is a firm believer of ethical and responsible AI, and strives to ensure that her customers’ AI solutions align with these principles.

Gene Ting is a Principal Solutions Architect at AWS. He is focused on helping enterprise customers build and operate workloads securely on AWS. In his free time, Gene enjoys teaching kids technology and sports, as well as following the latest on cybersecurity.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

NVIDIA Honors Partners of the Year in Europe, Middle East, Africa

NVIDIA today recognized 18 partners in Europe, the Middle East and Africa for their achievements and commitment to driving AI adoption.

The recipients were honored at the annual EMEA Partner Day hosted by the NVIDIA Partner Network (NPN). The awards span seven categories that highlight the various ways partners work with NVIDIA to transform the region’s industries with AI.

“This year marks another milestone for NVIDIA and our partners across EMEA as we pioneer technological breakthroughs and unlock new business opportunities using NVIDIA’s full-stack platform,” said Dirk Barfuss, director of EMEA channel at NVIDIA. “These awards celebrate our partners’ dedication and expertise in delivering groundbreaking solutions that drive cost efficiencies, enhance productivity and inspire innovation.”

The 2024 NPN award winners for EMEA are:

Rising Star Awards

Vesper Technologies received the Rising Star Northern Europe award for its exceptional revenue growth and broad customer base deploying NVIDIA AI solutions in data centers. The company has demonstrated outstanding growth in recent years, augmenting the success of its existing business.
AMBER AI & Data Science Solutions GmbH received the Rising Star Central Europe award for its revenue growth of more than 100% across the complete portfolio of NVIDIA technologies. Through extensive collaboration with NVIDIA, the company has become a cornerstone of the NVIDIA partner landscape in Germany.
HIPER Global Enterprise Ltd. received the Rising Star Southern Europe & Middle East award for its excellence in serving its broad customer base with NVIDIA compute technologies. Last year, it supported one of the largest customer projects in the region, further accelerating its growth rate.

Star Performer Awards

Boston Limited received the Star Performer Northern Europe award for its consistent success in delivering full-stack implementations of NVIDIA technologies for customers across industries. The company over the last year achieved record revenue growth across its business areas.
DELTA Computer Products GmbH received the Star Performer Central Europe award for its outstanding sales achievements and strong customer relationships. With a massive technical knowledge base, the company has served as a trusted advisor for customers deploying NVIDIA technologies across industry, higher education and research.
COMMit DMCC received the Star Performer Southern Europe & Middle East award for its exceptional execution of strategic and complex solutions built on NVIDIA technologies, which led to record revenues for the United Arab Emirates-based company.

Distributor of the Year

PNY received the Distributor of the Year award for the third consecutive year, underscoring its consistent investment in technology training and commitment to providing NVIDIA accelerated computing platforms and software across markets.
TD Synnex received the Networking Distributor of the Year award for the second year in a row, highlighting its massive investments in NVIDIA’s portfolio of technologies — especially networking — and dedication to delivering technical expertise to customers.

Go-to-Market Excellence

Bynet Data Communications Ltd. received the Go-to-Market Excellence award for its collaboration with NVIDIA regional leads to devise and execute effective go-to-market strategies for the Israeli market. This included identifying key opportunities and creating localized marketing campaigns. Its efforts led to great success with the installation of NVIDIA DGX SuperPODs into several new industries in the region.
Vesper Technologies was Highly Commended in the Go-to-Market Excellence category for its fully integrated go-to-market strategy around the launch of the NVIDIA GH200 Grace Hopper Superchip. The company successfully deployed a results-driven marketing campaign, demonstrated a commitment to technical training and developed a pre-sales trial and evaluation platform.
M Computers s.r.o. was Highly Commended in the Go-to-Market Excellence category for its success and leadership in engaging AI customers in eastern Europe with NVIDIA technologies. The company’s marketing efforts, including speeches at AI events and social media campaigns, helped lead to the first NVIDIA DGX H100 and NVIDIA Grace CPU Superchip projects in the region.

Industry Innovation

WPP received the Industry Innovation award for its innovative applications of AI and NVIDIA technology in the marketing and advertising sector. The company worked with NVIDIA to build a groundbreaking generative AI-powered content engine, built on the NVIDIA Omniverse platform, that enables the creation of brand-consistent content at scale.
Ascon Systems was Highly Commended in the Industry Innovation category for its cutting-edge Industrial Metaverse Portal, powered by NVIDIA Omniverse, that helped transform BMW Group’s manufacturing processes with real-time product control and enhanced visualization and interaction.
Gcore was Highly Commended in the Industry Innovation category for its creation of the first speech-to-text technology for Luxembourgish, using its fine-tuned LuxemBERT AI model. The technology integrates seamlessly into corporate systems and Luxembourgish messaging platforms, fostering the preservation of the traditionally spoken language, which lacked adequate tools for written communication.

Pioneer

Arrow Electronics – Intelligent Business Solutions received the Pioneer Award for its work promoting the NVIDIA IGX Orin platform for healthcare applications and building strategies to drive adoption of the technology. The company’s innovative approach and support led to the first integration of the NVIDIA IGX Orin Developer Kit with an NVIDIA RTX 6000 Ada Generation GPU for a robotic surgery platform.

Consulting Partner of the Year

SoftServe received the Consulting Partner of the Year award for its excellence in working with partners to drive the adoption of NVIDIA’s full-stack technologies, helping transform customers’ business with generative AI and NVIDIA Omniverse. Through its SoftServe University corporate learning hub, SoftServe trained its employees, customers and partners to expertly use NVIDIA technology.
Deloitte was Highly Commended in the Consulting Partner of the Year category for its focus on building sales and technical skills, efforts to deliver meaningful impact through projects and go-to-market strategy that helped drive enterprise-level AI transformation in the region.
Data Monsters was highly commended in the Consulting Partner of the Year category for its development of a virtual assistant with lifelike hearing, speech and animation capabilities using NVIDIA Avatar Cloud Engine and large language models.

Learn how to join NPN, or find a local NPN partner.

Seeing Beyond: Living Optics CEO Robin Wang on Democratizing Hyperspectral Imaging

Step into the realm of the unseen with Robin Wang, CEO of Living Optics. The startup cofounder discusses the power of hyperspectral imaging with AI Podcast host Noah Kravitz in an episode recorded live at the NVIDIA GTC global AI conference. Living Optics’ hyperspectral imaging camera, which can capture visual data across 96 colors, reveals details invisible to the human eye. Potential applications are as diverse as monitoring plant health to detecting cracks in bridges. The startup aims to empower users across industries to gain new insights from richer, more informative datasets fueled by hyperspectral imaging technology.

Living Optics is a member of the NVIDIA Inception program for cutting-edge startups.

Stay tuned for more episodes recorded live from GTC.

The AI Podcast · Living Optics CEO Robin Wang on Democratizing Hyperspectral Imaging

Time Stamps

1:05: What is hyperspectral imaging?

1:45: The Living Optics camera’s ability to capture 96 colors

3:36: Where is hyperspectral imaging being used, and why is it so important?

7:19: How are hyperspectral images represented and accessed by the user?

9:34: Other use cases of hyperspectral imaging

13:07: What’s unique about Living Optics’ hyperspectral imaging camera?

18:36: Breakthroughs, challenges during the technology’s development

23:27: What’s next for Living Optics and hyperspectral imaging?

You Might Also Like…

Dotlumen CEO Cornel Amariei on Assisstive Technology for the Visually Impaired – Ep. 217

Dotlumen is illuminating a new technology to help people with visual impairments navigate the world. In this episode of NVIDIA’s AI Podcast, recorded live at the NVIDIA GTC global AI conference, host Noah Kravitz spoke with the Romanian startup’s founder and CEO, Cornel Amariei, about developing its flagship Dotlumen Glasses.

DigitalPath’s Ethan Higgins on Using AI to Fight Wildfires – Ep. 211

DigitalPath is igniting change in the golden state — using computer vision, generative adversarial networks and a network of thousands of cameras to detect signs of fire in real time. In the latest episode of NVIDIA’s AI Podcast, host Noah Kravtiz spoke with DigitalPath system architect Ethan Higgins about the company’s role in the ALERTCalifornia initiative, a collaboration between California’s wildfire fighting agency CAL FIRE and the University of California, San Diego.

MosaicML’s Naveen Rao on Making Custom LLMs More Accessible – Ep. 199

Startup MosaicML is on a mission to help the AI community enhance prediction accuracy, decrease costs, and save time by providing tools for easy training and deployment of large AI models. In this episode of NVIDIA’s AI Podcast, host Noah Kravitz speaks with MosaicML CEO and co-founder Naveen Rao about how the company aims to democratize access to large language models.

Peter Ma on Using AI to Find Promising Signals for Alien Life – Ep. 191

In this episode of the NVIDIA AI Podcast, host Noah Kravitz interviews Ma, an undergraduate student at the University of Toronto, about how he developed an AI algorithm that outperformed traditional methods in the search for extraterrestrial intelligence.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

Research Focus: Week of April 15, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Appropriate reliance on Generative AI: Research synthesis

Appropriate reliance on AI happens when people accept correct AI outputs and reject incorrect ones. It requires users of AI systems to know when to trust the AI and when to trust themselves. But fostering appropriate reliance comes with new complexities when generative AI (genAI) systems are involved. Though their capabilities are advancing, genAI systems, which use generative models to produce content such as text, music, images, and videos, have limitations as well. Inappropriate reliance – either under-reliance or overreliance – on genAI can have negative consequences, such as poor task performance and even product abandonment.

In a recent paper: Appropriate reliance on Generative AI: Research synthesis, researchers from Microsoft, who reviewed 50 papers from various disciplines, provide an overview of the factors that affect overreliance on genAI, the effectiveness of different mitigation strategies for overreliance on genAI, and potential design strategies to facilitate appropriate reliance on genAI.

Read the paper

NEW RESEARCH

Characterizing Power Management Opportunities for LLMs in the Cloud

Cloud providers and datacenter operators are grappling with increased demand for graphics processing units (GPUs) due to expanding use of large language models (LLMs). To try to keep up, enterprises are exploring various means to address the challenge, such as power oversubscription and adding more servers. Proper power usage analysis and management could help providers meet demand safely and more efficiently.

In a recent paper: Characterizing Power Management Opportunities for LLMs in the Cloud, researchers from Microsoft analyze power patterns for several popular, open-source LLMs across commonly used configurations and identify opportunities to improve power management for LLMs in the cloud. They present a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, POLCA simulations demonstrate it could deploy 30% more servers in existing clusters while incurring minimal power throttling events. POLCA improves power efficiency, reduces the need for additional energy sources and datacenters, and helps to promptly meet demand for running additional LLM workloads.

Read the paper

NEW RESEARCH

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Various prompting techniques, such as chain-of-thought (CoT), in-context learning (ICL), and retrieval augmented generation (RAG), can empower large language models (LLMs) to handle complex and varied tasks through rich and informative prompts. However, these prompts are lengthy, sometimes exceeding tens of thousands of tokens, which increases computational and financial overhead and degrades the LLMs’ ability to perceive information. Recent efforts to compress prompts in a task-aware manner, without losing essential information, have resulted in shorter prompts tailored to a specific task or query. This typically enhances performance on downstream tasks, particularly in question answering. However, the task-specific features present challenges in efficiency and generalizability.

In a recent paper: LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression, researchers from Microsoft and Tsinghua University propose a data distillation procedure to derive knowledge from an LLM (GPT-4) and compress the prompts without losing crucial information. They introduce an extractive text compression dataset, containing pairs of original texts from MeetingBank and their compressed versions. Despite its small size, their model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. The new model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x.

Read the paper

NEW RESEARCH

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Despite recent progress in scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging. Evaluation is often performed using n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics like COMET have a higher correlation; however, challenges such as the lack of evaluation data with human ratings for under-resourced languages, the complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and the limited language coverage of multilingual encoders, have hampered their applicability to African languages.

In a recent paper: AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages (opens in new tab), researchers from University College London, University of Maryland, Unbabel, Microsoft and the Masakhane Community (opens in new tab), address these challenges, creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. They also develop AFRICOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLMR) to create state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).

Read the paper

NEW RESEARCH

Comparing the Agency of Hybrid Meeting Remote Users in 2D and 3D Interfaces of the Hybridge System

Video communication often lacks the inclusiveness and simultaneity enabled by physical presence in a shared space. This is especially apparent during hybrid meetings, where some attendees meet physically in a room while others join remotely. Remote participants are at a disadvantage, unable to navigate the physical space like in-room participants.

In a Late Breaking Work paper to be presented at CHI2024: Comparing the Agency of Hybrid Meeting Remote Users in 2D and 3D Interfaces of the Hybridge System,” Microsoft researchers present an experimental system for exploring designs for improving the inclusion of remote attendees in hybrid meetings. In-room users see remote participants on individual displays positioned around a table. Remote participants see video feeds from the room integrated into a digital twin of the meeting room, choosing where they appear in the meeting room and from where they view it. The researchers designed both a 2D and a 3D version of the interface. They found that 3D outperformed 2D in participants’ perceived sense of awareness, sense of agency, and physical presence. A majority of participants also subjectively preferred 3D over 2D. The next step in this research will test the inclusiveness of Hybridge 3D meetings against fully in-room meetings and traditional hybrid meetings.

Read the paper

NEW RESEARCH

FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction. This is because models like transformers and convolutional networks aggressively pool information over large areas.

In a paper that was published at ICLR 2024: FeatUp: A Model-Agnostic Framework for Features at Any Resolution, researchers from Microsoft and external colleagues introduce a task- and model-agnostic framework to restore lost spatial information in deep features. The paper introduces two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multiview consistency loss with deep analogies to neural radiance fields (NeRFs), a deep learning method of building 3D representations of a scene using sparse 2D images. In the new research, features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains, even without re-training. FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.

Read the paper

Project page

Code

Uncover hidden connections in unstructured financial data with Amazon Bedrock and Amazon Neptune

In asset management, portfolio managers need to closely monitor companies in their investment universe to identify risks and opportunities, and guide investment decisions. Tracking direct events like earnings reports or credit downgrades is straightforward—you can set up alerts to notify managers of news containing company names. However, detecting second and third-order impacts arising from events at suppliers, customers, partners, or other entities in a company’s ecosystem is challenging.

For example, a supply chain disruption at a key vendor would likely negatively impact downstream manufacturers. Or the loss of a top customer for a major client poses a demand risk for the supplier. Very often, such events fail to make headlines featuring the impacted company directly, but are still important to pay attention to. In this post, we demonstrate an automated solution combining knowledge graphs and generative artificial intelligence (AI) to surface such risks by cross-referencing relationship maps with real-time news.

Broadly, this entails two steps: First, building the intricate relationships between companies (customers, suppliers, directors) into a knowledge graph. Second, using this graph database along with generative AI to detect second and third-order impacts from news events. For instance, this solution can highlight that delays at a parts supplier may disrupt production for downstream auto manufacturers in a portfolio though none are directly referenced.

With AWS, you can deploy this solution in a serverless, scalable, and fully event-driven architecture. This post demonstrates a proof of concept built on two key AWS services well suited for graph knowledge representation and natural language processing: Amazon Neptune and Amazon Bedrock. Neptune is a fast, reliable, fully managed graph database service that makes it straightforward to build and run applications that work with highly connected datasets. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Overall, this prototype demonstrates the art of possible with knowledge graphs and generative AI—deriving signals by connecting disparate dots. The takeaway for investment professionals is the ability to stay on top of developments closer to the signal while avoiding noise.

Build the knowledge graph

The first step in this solution is building a knowledge graph, and a valuable yet often overlooked data source for knowledge graphs is company annual reports. Because official corporate publications undergo scrutiny before release, the information they contain is likely to be accurate and reliable. However, annual reports are written in an unstructured format meant for human reading rather than machine consumption. To unlock their potential, you need a way to systematically extract and structure the wealth of facts and relationships they contain.

With generative AI services like Amazon Bedrock, you now have the capability to automate this process. You can take an annual report and trigger a processing pipeline to ingest the report, break it down into smaller chunks, and apply natural language understanding to pull out salient entities and relationships.

For example, a sentence stating that “[Company A] expanded its European electric delivery fleet with an order for 1,800 electric vans from [Company B]” would allow Amazon Bedrock to identify the following:

[Company A] as a customer
[Company B] as a supplier
A supplier relationship between [Company A] and [Company B]
Relationship details of “supplier of electric delivery vans”

Extracting such structured data from unstructured documents requires providing carefully crafted prompts to large language models (LLMs) so they can analyze text to pull out entities like companies and people, as well as relationships such as customers, suppliers, and more. The prompts contain clear instructions on what to look out for and the structure to return the data in. By repeating this process across the entire annual report, you can extract the relevant entities and relationships to construct a rich knowledge graph.

However, before committing the extracted information to the knowledge graph, you need to first disambiguate the entities. For instance, there may already be another ‘[Company A]’ entity in the knowledge graph, but it could represent a different organization with the same name. Amazon Bedrock can reason and compare the attributes such as business focus area, industry, and revenue-generating industries and relationships to other entities to determine if the two entities are actually distinct. This prevents inaccurately merging unrelated companies into a single entity.

After disambiguation is complete, you can reliably add new entities and relationships into your Neptune knowledge graph, enriching it with the facts extracted from annual reports. Over time, the ingestion of reliable data and integration of more reliable data sources will help build a comprehensive knowledge graph that can support revealing insights through graph queries and analytics.

This automation enabled by generative AI makes it feasible to process thousands of annual reports and unlocks an invaluable asset for knowledge graph curation that would otherwise go untapped due to the prohibitively high manual effort needed.

The following screenshot shows an example of the visual exploration that’s possible in a Neptune graph database using the Graph Explorer tool.

Process news articles

The next step of the solution is automatically enriching portfolio managers’ news feeds and highlighting articles relevant to their interests and investments. For the news feed, portfolio managers can subscribe to any third-party news provider through AWS Data Exchange or another news API of their choice.

When a news article enters the system, an ingestion pipeline is invoked to process the content. Using techniques similar to the processing of annual reports, Amazon Bedrock is used to extract entities, attributes, and relationships from the news article, which are then used to disambiguate against the knowledge graph to identify the corresponding entity in the knowledge graph.

The knowledge graph contains connections between companies and people, and by linking article entities to existing nodes, you can identify if any subjects are within two hops of the companies that the portfolio manager has invested in or is interested in. Finding such a connection indicates the article may be relevant to the portfolio manager, and because the underlying data is represented in a knowledge graph, it can be visualized to help the portfolio manager understand why and how this context is relevant. In addition to identifying connections to the portfolio, you can also use Amazon Bedrock to perform sentiment analysis on the entities referenced.

The final output is an enriched news feed surfacing articles likely to impact the portfolio manager’s areas of interest and investments.

Solution overview

The overall architecture of the solution looks like the following diagram.

The workflow consists of the following steps:

A user uploads official reports (in PDF format) to an Amazon Simple Storage Service (Amazon S3) bucket. The reports should be officially published reports to minimize the inclusion of inaccurate data into your knowledge graph (as opposed to news and tabloids).
The S3 event notification invokes an AWS Lambda function, which sends the S3 bucket and file name to an Amazon Simple Queue Service (Amazon SQS) queue. The First-In-First-Out (FIFO) queue makes sure that the report ingestion process is performed sequentially to reduce the likelihood of introducing duplicate data into your knowledge graph.
An Amazon EventBridge time-based event runs every minute to start the run of an AWS Step Functions state machine asynchronously.
The Step Functions state machine runs through a series of tasks to process the uploaded document by extracting key information and inserting it into your knowledge graph:
1. Receive the queue message from Amazon SQS.
2. Download the PDF report file from Amazon S3, split it into multiple smaller text chunks (approximately 1,000 words) for processing, and store the text chunks in Amazon DynamoDB.
3. Use Anthropic’s Claude v3 Sonnet on Amazon Bedrock to process the first few text chunks to determine the main entity that the report is referring to, together with relevant attributes (such as industry).
4. Retrieve the text chunks from DynamoDB and for each text chunk, invoke a Lambda function to extract out entities (such as company or person), and its relationship (customer, supplier, partner, competitor, or director) to the main entity using Amazon Bedrock.
5. Consolidate all extracted information.
6. Filter out noise and irrelevant entities (for example, generic terms such as “consumers”) using Amazon Bedrock.
7. Use Amazon Bedrock to perform disambiguation by reasoning using the extracted information against the list of similar entities from the knowledge graph. If the entity does not exist, insert it. Otherwise, use the entity that already exists in the knowledge graph. Insert all relationships extracted.
8. Clean up by deleting the SQS queue message and the S3 file.
A user accesses a React-based web application to view the news articles that are supplemented with the entity, sentiment, and connection path information.
Using the web application, the user specifies the number of hops (default N=2) on the connection path to monitor.
Using the web application, the user specifies the list of entities to track.
To generate fictional news, the user chooses Generate Sample News to generate 10 sample financial news articles with random content to be fed into the news ingestion process. Content is generated using Amazon Bedrock and is purely fictional.
To download actual news, the user chooses Download Latest News to download the top news happening today (powered by NewsAPI.org).
The news file (TXT format) is uploaded to an S3 bucket. Steps 8 and 9 upload news to the S3 bucket automatically, but you can also build integrations to your preferred news provider such as AWS Data Exchange or any third-party news provider to drop news articles as files into the S3 bucket. News data file content should be formatted as <date>{dd mmm yyyy}</date><title>{title}</title><text>{news content}</text>.
The S3 event notification sends the S3 bucket or file name to Amazon SQS (standard), which invokes multiple Lambda functions to process the news data in parallel:
1. Use Amazon Bedrock to extract entities mentioned in the news together with any related information, relationships, and sentiment of the mentioned entity.
2. Check against the knowledge graph and use Amazon Bedrock to perform disambiguation by reasoning using the available information from the news and from within the knowledge graph to identify the corresponding entity.
3. After the entity has been located, search for and return any connection paths connecting to entities marked with INTERESTED=YES in the knowledge graph that are within N=2 hops away.
The web application auto refreshes every 1 second to pull out the latest set of processed news to display on the web application.

Deploy the prototype

You can deploy the prototype solution and start experimenting yourself. The prototype is available from GitHub and includes details on the following:

Deployment prerequisites
Deployment steps
Cleanup steps

Summary

This post demonstrated a proof of concept solution to help portfolio managers detect second- and third-order risks from news events, without direct references to companies they track. By combining a knowledge graph of intricate company relationships with real-time news analysis using generative AI, downstream impacts can be highlighted, such as production delays from supplier hiccups.

Although it’s only a prototype, this solution shows the promise of knowledge graphs and language models to connect dots and derive signals from noise. These technologies can aid investment professionals by revealing risks faster through relationship mappings and reasoning. Overall, this is a promising application of graph databases and AI that warrants exploration to augment investment analysis and decision-making.

If this example of generative AI in financial services is of interest to your business, or you have a similar idea, reach out to your AWS account manager, and we will be delighted to explore further with you.

About the Author

Xan Huang is a Senior Solutions Architect with AWS and is based in Singapore. He works with major financial institutions to design and build secure, scalable, and highly available solutions in the cloud. Outside of work, Xan spends most of his free time with his family and getting bossed around by his 3-year-old daughter. You can find Xan on LinkedIn.

Solution overview

Prerequisites

Deploy the solution using AWS CloudFormation

Set your password

Download and install the Chrome browser extension

Start using LMA

Use the Chrome browser extension to stream a Zoom call

Use the LMA UI Stream Audio tab to stream from your microphone and any browser-based audio application

Processing flow overview

Monitoring and troubleshooting

Cost assessment

Customize your deployment

Update an existing LMA stack

Clean up

Live Call Analytics: Companion solution

Conclusion

About the authors

What is Meta Llama 3

What is SageMaker JumpStart

Discover models

Deploy a model

Run inference

Example prompts for the 70B model

Example prompts for the 70B-Instruct model

Clean up

Conclusion

About Authors

Trained on NVIDIA AI

Putting Llama 3 to Work

Taking Llama 3 to Devices and PCs

Get Optimal Performance with Llama 3

Advancing Community Models

SAMMO: A prompt optimization approach

AI Explainer: Foundation models ​and the next era of AI

Exploring SAMMO’s impact through use cases

Use case 1: RAG optimization

Use case 2: Instruction tuning

Implications and looking forward

Holy Moly

Shiny New Games

Foundation models in SageMaker JumpStart

Slack AI

Solution overview

Conclusion

About the Authors

Rising Star Awards

Star Performer Awards

Distributor of the Year

Go-to-Market Excellence

Industry Innovation

Pioneer

Consulting Partner of the Year

Time Stamps

You Might Also Like…

Subscribe to the AI Podcast

NEW RESEARCH

Appropriate reliance on Generative AI: Research synthesis

NEW RESEARCH

Characterizing Power Management Opportunities for LLMs in the Cloud

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

NEW RESEARCH

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

NEW RESEARCH

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

NEW RESEARCH

Comparing the Agency of Hybrid Meeting Remote Users in 2D and 3D Interfaces of the Hybridge System

NEW RESEARCH

FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Build the knowledge graph

Process news articles

Solution overview

Deploy the prototype

Summary

About the Author

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.

AI Explainer: Foundation models and the next era of AI