Build and deploy a UI for your generative AI applications with AWS and Python

Build and deploy a UI for your generative AI applications with AWS and Python

The emergence of generative AI has ushered in a new era of possibilities, enabling the creation of human-like text, images, code, and more. However, as exciting as these advancements are, data scientists often face challenges when it comes to developing UIs and to prototyping and interacting with their business users. Traditionally, building frontend and backend applications has required knowledge of web development frameworks and infrastructure management, which can be daunting for those with expertise primarily in data science and machine learning.

AWS provides a powerful set of tools and services that simplify the process of building and deploying generative AI applications, even for those with limited experience in frontend and backend development. In this post, we explore a practical solution that uses Streamlit, a Python library for building interactive data applications, and AWS services like Amazon Elastic Container Service (Amazon ECS), Amazon Cognito, and the AWS Cloud Development Kit (AWS CDK) to create a user-friendly generative AI application with authentication and deployment.

Solution overview

For this solution, you deploy a demo application that provides a clean and intuitive UI for interacting with a generative AI model, as illustrated in the following screenshot.

The UI consists of a text input area where users can enter their queries, and an output area to display the generated results.

The default interface is simple and straightforward, but you can extend and customize it to fit your specific needs. With Streamlit’s flexibility, you can add additional features, adjust the styling, and integrate other functionalities as required by your use case.

The solution we explore consists of two main components: a Python application for the UI and an AWS deployment architecture for hosting and serving the application securely.

The Python application uses the Streamlit library to provide a user-friendly interface for interacting with a generative AI model. Streamlit allows data scientists to create interactive web applications using Python, using their existing skills and knowledge. With Streamlit, you can quickly build and iterate on your application without the need for extensive frontend development experience.

The AWS deployment architecture makes sure the Python application is hosted and accessible from the internet to authenticated users. The solution uses the following key components:

  • Amazon ECS and AWS Fargate provide a serverless container orchestration platform for running the Python application
  • Amazon Cognito handles user authentication, making sure only authorized users can access the generative AI application
  • Application Load Balancer (ALB) and Amazon CloudFront are responsible for load balancing and content delivery, so the application is available for users worldwide
  • The AWS CDK allows you to define and provision AWS infrastructure resources using familiar programming languages like Python
  • Amazon Bedrock is a fully managed service that offers a choice of high-performing generative AI models through an API

The following diagram illustrates this architecture.

Prerequisites

As a prerequisite, you need to enable model access in Amazon Bedrock and have access to a Linux or macOS development environment. You could also use a Windows development environment, in which case you need to update the instructions in this post.

Access to Amazon Bedrock foundation models is not granted by default. Complete the following steps to enable access to Anthropic’s Claude on Amazon Bedrock, which we use as part of this post:

  1. Sign in to the AWS Management Console.
  2. Choose the us-east-1 AWS Region from the top right corner.
  3. On the Amazon Bedrock console, choose Model access in the navigation pane.
  4. Choose Manage model access.
  5. Select the model you want access to (for this post, Anthropic’s Claude). You can also select other models for future use.
  6. Choose Next and then Submit to confirm your selection.

For more information on how to manage model access, see Access Amazon Bedrock foundation models.

Set up your development environment

To get started with deploying the Streamlit application, you need access to a development environment with the following software installed:

You also need to configure the AWS CLI. One way to do it is to get your access key through the console, and use the aws configure command in your terminal to set up your credentials.

Clone the GitHub repository

Use the terminal of your development environment to enter the commands in the following steps:

  1. Clone the deploy-streamlit-app repository from the AWS Samples GitHub repository:
git clone https://github.com/aws-samples/deploy-streamlit-app.git

  1. Navigate to the cloned repository:
cd deploy-streamlit-app

Create the Python virtual environment and install the AWS CDK

Complete the following steps to set up the virtual environment and the AWS CDK:

  1. Create a new Python virtual environment (your Python version should be 3.8 or greater):
python3 -m venv .venv
  1. Activate the virtual environment:
source .venv/bin/activate
  1. Install the AWS CDK, which is in the required Python dependencies:
pip install -r requirements.txt

Configure the Streamlit application

Complete the following steps to configure the Streamlit application:

  1. In the docker_app directory, locate the config_file.py file.
  2. Open config_file.py in your editor and modify the STACK_NAME and CUSTOM_HEADER_VALUE variables:
    1. The stack name enables you to deploy multiple applications in the same account. Choose a different stack name for each application. For your first application, you can leave the default value.
    2. The custom header value is a security token that CloudFront uses to authenticate on the load balancer. You can choose it randomly, and it must be kept secret.

Deploy the AWS CDK template

Complete the following steps to deploy the AWS CDK template:

  1. From your terminal, bootstrap the AWS CDK:
cdk bootstrap
  1. Deploy the AWS CDK template, which will create the necessary AWS resources:
cdk deploy
  1. Enter y (yes) when asked if you want to deploy the changes.

The deployment process may take 5–10 minutes. When it’s complete, note the CloudFront distribution URL and Amazon Cognito user pool ID from the output.

Create an Amazon Cognito user

Complete the following steps to create an Amazon Cognito user:

  1. On the Amazon Cognito console, navigate to the user pool that you created as part of the AWS CDK deployment.
  2. On the Users tab, choose Create user.

  1. Enter a user name and password.
  2. Choose Create user.

Access the Streamlit application

Complete the following steps to access the Streamlit application:

  1. Open a new web browser window or tab and navigate to the CloudFront distribution URL from the AWS CDK deployment output.

If you have not noted this URL, you can open the AWS CloudFormation console and find it in the outputs of the stack.

  1. Log in to the Streamlit application using the Amazon Cognito user credentials you created in the previous step.

You should now be able to access and interact with the Streamlit application, which is deployed and running on AWS using the provided AWS CDK template.

This deployment is intended as a starting point and a demo. Before using this application in a production environment, you should thoroughly review and implement appropriate security measures, such as configuring HTTPS on the load balancer and following AWS best practices for securing your resources. See the README.md file in the GitHub repository for more information.

Customize the application

The aws-samples/deploy-streamlit-app GitHub repository provides a solid foundation for building and deploying generative AI applications, but it’s also highly customizable and extensible.

Let’s explore how you can customize the Streamlit application. Because the application is written in Python, you can modify it to integrate with different generative AI models, add new features, or change the UI to better align with your application’s requirements.

For example, let’s say you want to add a button to invoke the LLM answer instead of invoking it automatically when the user enters input text. Complete the following steps to modify the docker_app/app.py file:

  1. After the definition of the input_sent text input, add a Streamlit button:
# Insert this after the line starting with input_sent = …
submit_button = st.button("Get LLM Response")
  1. Change the if condition to check if the button is clicked instead of checking for input_sent:
# Replace the line `if input_sent:` by the following
if submit_button:

  1. Redeploy the application by entering the following in the terminal:
cdk deploy

The deployment should take less than 5 minutes. In the next section, we show how to test your changes locally before deploying, which will accelerate your development workflow.

  1. When the deployment is complete, refresh the webpage in your browser.

The Streamlit application will now display a button labeled Get LLM Response. When the user chooses this button, the LLM will be invoked, and the output will be displayed on the UI.

This is just one example of how you can customize the Streamlit application to meet your specific requirements. You can modify the code further to integrate with different generative AI models, add additional features, or enhance the UI as needed.

Test your changes locally before deploying

Although deploying the application using cdk deploy allows you to test your changes in the actual AWS environment, it can be time-consuming, especially during the development and testing phase. Fortunately, you can run and test your application locally before deploying it to AWS.

To test your changes locally, follow these steps:

  1. In your terminal, navigate to the docker_app directory, where the Streamlit application is located:
cd docker_app
  1. If you haven’t already, install the dependencies of the Python application. These dependencies are different from the ones of the AWS CDK application that you installed previously.
pip install -r requirements.txt
  1. Start the Streamlit server with the following command:
streamlit run app.py --server.port 8080

This will start the Streamlit application on port 8080.

You should now be able to interact with the locally running Streamlit application and test your changes without having to redeploy the application to AWS.

Remember to stop the Streamlit server (by pressing Ctrl+C in the terminal) when you’re done testing.

By testing your changes locally, you can significantly speed up the development and testing cycle, allowing you to iterate more quickly and catch issues early in the process.

Clean up

To avoid incurring additional charges, clean up the resources created during this demo:

  1. Open the terminal in your development environment.
  2. Make sure you’re in the root directory of the project and your virtual environment is activated:
cd ~/environment/deploy-streamlit-app
source .venv/bin/activate
  1. Destroy the AWS CDK stack:
cdk destroy
  1. Confirm the deletion by entering yes when prompted.

Conclusion

Building and deploying user-friendly generative AI applications no longer requires extensive knowledge of frontend and backend development frameworks. By using Streamlit and AWS services, data scientists can focus on their core expertise while still delivering secure, scalable, and accessible applications to business users.

The full code of the demo is available in the GitHub repository. It provides a valuable starting point for building and deploying generative AI applications, allowing you to quickly set up a working prototype and iterate from there. We encourage you to explore the repository and experiment with the provided solution to create your own applications.

As the adoption of generative AI continues to grow, the ability to build and deploy user-friendly applications will become increasingly important. With AWS and Python, data scientists now have the tools and resources to bridge the gap between their technical expertise and the need to showcase their models to business users through secure and accessible UIs.


About the Author

Picture of Lior PerezLior Perez is a Principal Solutions Architect on the Construction team based in Toulouse, France. He enjoys supporting customers in their digital transformation journey, using big data, machine learning, and generative AI to help solve their business challenges. He is also personally passionate about robotics and IoT, and constantly looks for new ways to use technologies for innovation.

Read More

Unearth insights from audio transcripts generated by Amazon Transcribe using Amazon Bedrock

Unearth insights from audio transcripts generated by Amazon Transcribe using Amazon Bedrock

Generative AI continues to push the boundaries of what’s possible. One area garnering significant attention is the use of generative AI to analyze audio and video transcripts, increasing our ability to extract valuable insights from content stored in audio or video files. Speech data is unique and complex, which makes it difficult to analyze and extract insights. Manually transcribing and analyzing it can be time-consuming and resource-intensive.

Existing methods for extracting insights from speech data often require tedious human transcription and review. You can use automatic voice recognition tools to convert your audio and video data to text. However, you still have to rely on manual processes for extracting specific insights and data points, or get summaries of the content. This approach is time-consuming and as organizations amass vast amounts of this content, the need for a more efficient and insightful solution becomes increasingly pressing. There is a significant opportunity to add business value given the amount of data organizations store in these formats and the valuable insights that might otherwise go undiscovered. The following are some of the new insights and capabilities that can be obtained through the use of large language models (LLM) with audio transcripts:

  • LLMs can analyze and understand the context of a conversation, not just the words spoken, but also the implied meaning, intent, and emotions. Previously, this would have required extensive human interpretation.
  • LLMs can perform advanced sentiment analysis. Previously, sentiment analysis could be captured, but LLMs can capture more emotions, such as sarcasm, ambivalence, or mixed feelings by understanding the context of the conversation.
  • LLMs can generate concise summarizations not just by extracting content, but by understanding the context of the conversation.
  • Users can now ask complex, natural language questions and receive insightful answers.
  • LLMs can infer personas or roles in a conversation, enabling targeted insights and actions.
  • LLMs can support the creation of new content based on audio assets or conversations following predetermined templates or flows.

In this post, we examine how to create business value through speech analytics with some examples focused on the following:

  • Automatically summarizing, categorizing, and analyzing marketing content such as podcasts, recorded interviews, or videos, and creating new marketing materials based on those assets
  • Automatically extracting key points, summaries, and sentiment from a recorded meeting (such as an earnings call)
  • Transcribing and analyzing contact center calls to improve customer experience.

The first step in getting these audio data insights involves transcribing the audio file using Amazon Transcribe. Amazon Transcribe is a machine learning (ML) based managed service that automatically converts speech to text, enabling developers to seamlessly integrate speech-to-text capabilities into their applications. It also recognizes multiple speakers, automatically redacts personally identifiable information (PII), and allows you to enhance the accuracy of a transcription by providing custom vocabularies specific to your industries or use case, or by using custom language models.

The second step involves using foundation models (FMs) with Amazon Bedrock to summarize the content, identify topics, and recognize conclusions, extracting valuable insights that can guide strategic decisions and innovations. Automatic generation of new content also adds value, increasing creativity and productivity.

Generative AI is reshaping the way we analyze audio transcripts, enabling you to unlock insights such as customer sentiment, pain points, common themes, avenues for risk mitigation, and more, that were previously obfuscated.

Use case overview

In this post, we discuss three example use cases in detail. The code artifacts are in Python. We used a Jupyter notebook to run the code snippets. You can follow along by creating and running a notebook in Amazon SageMaker Studio.

Audio summarization and insights, and automated generation of new content using Amazon Transcribe and Amazon Bedrock

Through this use case, we demonstrate how to take an existing marketing asset (a video) and create a new blog post to announce the launch of the video, create an abstract, and extract the main topics and the search engine optimization (SEO) keywords present in the post for documenting and categorizing the asset.

Transcribe audio with Amazon Transcribe

In this case, we use an AWS re:Invent 2023 technical talk as a sample. For the purpose of this notebook, we downloaded the MP4 file for the recording and stored it in an Amazon Simple Storage Service (Amazon S3) bucket.

The first step is to transcribe the audio file using Amazon Transcribe:

# Create a Amazon Transcribe transcirption job by specifying the audio/video file's S3 location
import boto3
import time
import random
transcribe = boto3.client('transcribe')
response = transcribe.start_transcription_job(
    TranscriptionJobName=f"podcast-transcription-{int(time.time())}_{random.randint(1000, 9999)}",
    LanguageCode='en-US',
    MediaFormat='mp3',
    Media={
        'MediaFileUri': '<S3 URI of the media file>'
    },
    OutputBucketName='<name of the S3 bucket that will store the output>',
    OutputKey='transcribe_bedrock_blog/output-files/',
    Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 3
    }
)
max_tries = 60
while max_tries > 0:
    max_tries -= 1
    job = transcribe.get_transcription_job(TranscriptionJobName=response['TranscriptionJob']['TranscriptionJobName'])
    job_status = job["TranscriptionJob"]["TranscriptionJobStatus"]
    if job_status in ["COMPLETED", "FAILED"]:
        if job_status == "COMPLETED":
            print(
                f"Download the transcript fromn"
                f"t{job['TranscriptionJob']['Transcript']['TranscriptFileUri']}."
            )
        break
    else:
        print(f"Waiting for {response['TranscriptionJob']['TranscriptionJobName']}. Current status is {job_status}.")
    time.sleep(10)

The transcription job will take a few minutes to complete.

When the job is complete, you can inspect the transcription output and check the plain text transcript that was generated (the following has been trimmed for brevity):

# Get the Transcribe Output JSON file
s3 = boto3.client('s3')
output_bucket = job['TranscriptionJob']['Transcript']['TranscriptFileUri'].split('https://')[1].split('/',2)[1]
output_file_key = job['TranscriptionJob']['Transcript']['TranscriptFileUri'].split('https://')[1].split('/',2)[2]
s3_response_object = s3.get_object(Bucket=output_bucket, Key=output_file_key)
object_content = s3_response_object['Body'].read()

transcription_output = json.loads(object_content)

# Let's see what we have inside the job output JSON
print(transcription_output['results']['transcripts'][0]['transcript'])

……….Once the alert comes, how do you kind of correlate these alerts, not by just text signals and text passing, but understanding the topology, the infrastructure topology that is supporting that application or that business service. It is the topology that ultimately gives the source of truth when an alert comes, right. That’s what we mean by a correlation that is assisted with topology in in in the thing that ultimately results in finding a probable root cause. And once ……….

After you have validated the existence of the text, you can use Amazon Bedrock to analyze the output:

# Now let's use this transcript to extract insights with a help of a Large Language Model on Amazon Bedrock
# First let's initialize the bedrock runtime client to invoke the model. 
bedrock_runtime = boto3.client('bedrock-runtime')
# Selecting Claude 3 Sonnet
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

Using the transcription from the technical talk, we use Amazon Bedrock to call an FM (we use Anthropic’s Claude 3 Sonnet on Amazon Bedrock in this case). You can choose from the language models available on Amazon Bedrock from AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon.

You can now perform additional tasks.

Extract the main topics with Amazon Bedrock

The following prompt provides instructions to ask the LLM for the main topics in the technical talk:

# Extracting the main topics with Amazon Bedrock
main_topics_prompt = """Based on the contents of <transcript></transcript>, what are the main topics being discussed? Display the topics as a list.

<transcript>
{transcript}
</transcript>
"""

user_message = {"role": "user", "content": main_topics_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
{
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1000,
"messages": messages
}
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
main_topics = response_body['content'][0]['text']
print(main_topics)

We have created a prompt that uses prompting best practices for Anthropic’s Claude. In this case, we pass the transcript within the <transcript></transcript> XML tags and ask for main topics discussed. Based on that, we get the following output:

Based on the contents of the transcript, the main topics being discussed are:

  1. AI and Machine Learning in IT Operations
  2. Challenges faced by IT Operations teams
  3. Observable data explosion from cloud, on-premise, and edge infrastructure
  4. The convergence of IT, SRE, and DevOps roles
  5. Discovery to Resolution pipeline for IT Operations
  6. Alert correlation and intelligent alerting using AI/ML
  7. Probable root cause analysis with AI/ML
  8. Automated remediation and runbook automation
  9. Predictive analytics and preventive maintenance using AI/ML
  10. Benefits of AI/ML for IT Operations (proactive operations, reduced human effort, improved service health, cost optimization)

Generate a blog post to announce the video with Amazon Bedrock

The following prompt provides instructions to ask the LLM for the generation of a new blog post announcing the launch of the preceding recording. Just like the preceding prompt, we use prompting best practices for Anthropic’s Claude:

# Generating a blog post to announce the Podcast with Amazon Bedrock
generate_blog_prompt = """The contents of <transcript></transcript> are a transcription of a technical talk video recording. Based on this information write a blogpost that summarizes the technical talk and announces that the recording of this talk is now available for public consumption. Limit the blogpost to 1500 words. Do not include any information that is not mentioned in <transcript></transcript>. 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": generate_blog_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
generate_blog = response_body['content'][0]['text']
print(generate_blog)

Based on that, we get the following output:

Here is a draft 1462 word blogpost summarizing the technical talk and announcing the recording is now available:

Title: Revolutionizing IT Operations with AI/ML: A Future-Ready Approach

The world of IT operations is undergoing a seismic shift, driven by the explosion of observable data, the convergence of IT roles, and the advent of hybrid multi-cloud environments. Traditional reactive approaches are no longer sufficient to keep pace with the complexity and scale of modern IT systems. This is where AI and machine learning (ML) come into play, offering a future-ready approach to revolutionize IT operations.

In a recent technical talk, Verma from HPE OpsRamp outlined the drivers behind this AI/ML imperative and demonstrated how this cutting-edge technology is fundamentally transforming IT operations. The recording of this insightful session is now available for public consumption.

The Drivers Behind the AI/ML Imperative

The first driving force behind the adoption of AI/ML in IT operations is the sheer volume of observable data that………………..

Create an abstract with Amazon Bedrock

The following prompt provides instructions to create an abstract summarizing the content of the technical talk:

# Creating an Abstract with Amazon Bedrock
abstract_prompt = """The contents of <transcript></transcript> are a transcription of a technical talk video recording. Based on this information write an abstract for this technical talk in no more than 100 words. Do not include any information that is not mentioned in <transcript></transcript>.

<transcript>
{transcript}
</transcript>
"""

user_message = {"role": "user", "content": abstract_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
{
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1000,
"messages": messages
}
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
abstract = response_body['content'][0]['text']
print(abstract)

Based on this, we get the following output:

Here is a 96 word abstract for the technical talk:

This talk discussed how AI and machine learning innovations are transforming IT operations for hybrid multi-cloud workloads. The speakers outlined the challenges of dealing with exploding observable data from modern infrastructure and the convergence of IT roles. They demonstrated how AI/ML techniques like intelligent alerting, alert correlation, probable root cause analysis, and automated remediation can drive more proactive, predictive operations. Key benefits showcased included reduced human effort, improved service health, and cost optimization. The talk featured a demo of the OpsRamp platform leveraging AI/ML models to streamline the discovery-to-resolution pipeline for managing edge-to-cloud environments.

Extract SEO keywords from the generated blog post with Amazon Bedrock

The following prompt provides instructions to extract the main SEO keywords from the generated blog post. Based on online research on how to extract SEO keywords from long-form text, we came up with the following list. This demonstrates how you can empower an LLM like Anthropic’s Claude to follow instructions and best practices for a particular task or domain. Also, the prompt specifies that the output should be in JSON. This is helpful in use cases where you want to programmatically get results from an LLM and therefore require consistent formatting. Based on best practices for Anthropic’s Claude, we use the Assistant message in the messages API to pre-fill the model’s response to have further control on the output format:

# Extracting SEO keywords from the generated blog post
SEO_keywords_prompt = """Extract the most relevant keywords and phrases from the given blog post text present in <blog></blog> that would be valuable for SEO (search engine optimization) based on the instructions present in <instructions></instructions> below. The ideal keywords should capture the main topics, concepts, entities, and high-value terms present in the content. Use JSON format with key "keywords" and value as an array of keywords. Skip the preamble; go straight into the JSON. 

<blog>
{textblog}
</blog>

<instructions>
1. Carefully read through the entire blog post text to understand the main topics, concepts, and ideas covered.
2. Identify important nouns, noun phrases, multi-word phrases, and relevant adjective-noun combinations that relate to the core subject matter of the post.
3. Look for words and phrases that potential searchers might use to find content like this.
4. Prioritize terms that are highly specific and relevant to the blog topic over generic words.
5. Vary the keyword length and include both head terms (shorter, more popular keywords) and long-tail terms (longer, more specific phrases).
6. Aim to extract around 10-20 of the most valuable, high-impact keywords and phrases for SEO.
</instructions>
"""

user_message =  {"role": "user", "content": SEO_keywords_prompt.format(textblog = generate_blog)}
assistant_message = {"role": "assistant", "content": '{"keywords": ['}
messages = [user_message, assistant_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
SEO_keywords = response_body['content'][0]['text']
SEO_keywords = '{"keywords": [' + SEO_keywords
SEO_keywords_json = json.loads(SEO_keywords)
print(SEO_keywords_json)

Based on this, we get the following output:

{‘keywords’: [‘AI-driven IT operations’, ‘machine learning IT operations’, ‘proactive IT operations’, ‘predictive IT operations’, ‘AI for hybrid cloud’, ‘AI for multi-cloud’, ‘FutureOps’, ‘AI-assisted IT operations’, ‘AI-powered event correlation’, ‘intelligent alerting’, ‘automated remediation workflows’, ‘predictive analytics for IT’, ‘AI anomaly detection’, ‘AI root cause analysis’, ‘AI-driven observability’, ‘AI for DevOps’, ‘AI for SRE’, ‘AI IT operations management’]}

For consistent formatting and structured output, you can also use the Converse and ConverseStream APIs in Amazon Bedrock and use the tool calling capabilities of the LLMs that offer it.

Generate a new blog post version emphasizing specific SEO keywords with Amazon Bedrock

The following prompt provides instructions to generate a new blog post emphasizing three keywords or phrases from the ones extracted in the previous step. In this case, we use the first three keywords in the returned keywords list. The instructions specified in the prompt are derived from SEO content best practices:

# Extracting SEO keywords from the generated blog post
rewrite_SEO_keywords_prompt = """Rewrite the given blog post text present in <blog></blog> to improve its ranking for 3 target keywords listed in <keywords></keywords>. Rewrite the blog post, naturally incorporating the target keywords multiple times throughout the text. Keep the rewritten content high-quality, readable, and relevant, while strategically emphasizing the keywords to improve SEO.

<blog>
{textblog}
</blog>

<keywords>
{kw1},
{kw2},
{kw3}
</keywords>
"""

user_message = {"role": "user", "content": rewrite_SEO_keywords_prompt.format(textblog = generate_blog, kw1 = SEO_keywords_json['keywords'][0], kw2 = SEO_keywords_json['keywords'][1], kw3 = SEO_keywords_json['keywords'][2])}
messages = [user_message]

body=json.dumps(
{
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1000,
"messages": messages
}
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
rewrite_SEO_blog = response_body['content'][0]['text']
print(rewrite_SEO_blog)

Based on this, we get the following output:

<blog>

Title: Unleash the Power of AI-Driven and Machine Learning IT Operations for Proactive IT Operations

In today’s rapidly evolving digital landscape, traditional IT operations approaches are struggling to keep pace. The explosion of data from cloud workloads, Kubernetes clusters, and distributed edge locations has overwhelmed conventional monitoring tools, leading to fragmented views and reactive operations that respond to issues after they occur. To stay ahead of the curve, organizations must embrace AI-driven and machine learning IT operations, paving the way for proactive IT operations that enhance efficiency and minimize downtime.

At a recent technical talk, Verma and Radha Krishna Kunni from OpsRamp (recently acquired by HPE) delved into the transformative impact of AI and machine learning on IT operations, DevOps, and SRE for hybrid multi-cloud environments. They highlighted the key challenges ops teams face today and introduced the innovative “FutureOps” approach, which leverages AI and machine learning to revolutionize IT operations.

The full video recording of this insightful technical talk is now available [link], providing a comprehensive understanding of…………

Summarize content discussed in a recorded meeting using Amazon Transcribe and Amazon Bedrock

Through this use case, we demonstrate how to take an existing recording from a meeting (we use a recording from an AWS earnings call) to summarize the content discussed, extract the key points, and provide details on the sentiment of the meeting. For additional information on this use case, see Live Meeting Assistant with Amazon Transcribe, Amazon Bedrock, and Amazon Bedrock Knowledge Bases or Amazon Q Business.

Transcribe audio with Amazon Transcribe

In this use case, we use an Amazon 2024 Q1 earnings call as a sample. For the purpose of this notebook, we downloaded the WAV file for the recording and stored in an S3 bucket.

The first step is to transcribe the audio file using Amazon Transcribe:

# Create a Amazon Transcribe transcription job by specifying the audio/video file's S3 location
import boto3
import time
import random
transcribe = boto3.client('transcribe')
response = transcribe.start_transcription_job(
    TranscriptionJobName=f"meeting-transcription-{int(time.time())}_{random.randint(1000, 9999)}",
    LanguageCode='en-US',
    MediaFormat='mp3',
    Media={
        'MediaFileUri': '<S3 URI of the media file>'
    },
    OutputBucketName='<name of the S3 bucket that will store the output>',
    OutputKey='transcribe_bedrock_blog/output-files/',
    Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 10
    }
)
# Check whether the transcribe job is complete

max_tries = 60
while max_tries > 0:
    max_tries -= 1
    job = transcribe.get_transcription_job(TranscriptionJobName=response['TranscriptionJob']['TranscriptionJobName'])
    job_status = job["TranscriptionJob"]["TranscriptionJobStatus"]
    if job_status in ["COMPLETED", "FAILED"]:
        if job_status == "COMPLETED":
            print(
                f"Download the transcript fromn"
                f"t{job['TranscriptionJob']['Transcript']['TranscriptFileUri']}."
            )
        break
    else:
        print(f"Waiting for {response['TranscriptionJob']['TranscriptionJobName']}. Current status is {job_status}.")
    time.sleep(10)

The transcription job will take a few minutes to complete.

When the job is complete, you can inspect the transcription output and check for the plain text transcript that was generated:

import json
# Get the Transcribe Output JSON file
s3 = boto3.client('s3')
output_bucket = job['TranscriptionJob']['Transcript']['TranscriptFileUri'].split('https://')[1].split('/',2)[1]
output_file_key = job['TranscriptionJob']['Transcript']['TranscriptFileUri'].split('https://')[1].split('/',2)[2]
s3_response_object = s3.get_object(Bucket=output_bucket, Key=output_file_key)
object_content = s3_response_object['Body'].read()

transcription_output = json.loads(object_content)

# Let's see what we have inside the job output JSON
print(transcription_output['results']['transcripts'][0]['transcript'])

Thank you for standing by. Good day, everyone and welcome to the amazon.com first quarter, 2024 financial results teleconference. At this time, all participants are in a listen only mode. After the presentation, we will conduct a question and answer session. Today’s call is being recorded and for opening remarks, I’ll be turning the call over to the Vice President of Investor………

After you have validated the existence of the text, you can use Amazon Bedrock to analyze the output:

# Now let's use this transcript to extract insights with a help of a Large Language Model on Amazon Bedrock
# First let's initialize the bedrock runtime client to invoke the model. 
bedrock_runtime = boto3.client('bedrock-runtime')
# Selecting Claude 3 Sonnet
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

Using the transcription from the earnings call recording, we use Amazon Bedrock to call an FM (we use Anthropic’s Claude 3 Sonnet in this case). You can choose from other FMs available on Amazon Bedrock.

You can now perform additional tasks.

Identify the financial ratios highlighted during this earnings call

The following prompt provides instructions to identify financial ratios highlighted during the earnings call and their implications:

# Identify the financial ratios highlighted during this earnings call 
financial_ratios_prompt = """Based on the contents of <transcript></transcript>,identify the financial ratios highlighted during this earnings call and their implications <transcript></transcript> . 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": financial_ratios_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
financial_ratios = response_body['content'][0]['text']
print(financial_ratios)

Based on this, we get the following output:

Based on the earnings call transcript, the following financial ratios and their implications were highlighted:

  1. Operating Income Margin:
    • Amazon reported its highest ever quarterly operating income of $15.3 billion, which was $3.3 billion above the high end of their guidance range. This was driven by strong operational performance across all three reportable segments (North America, International, and AWS) and better-than-expected operating leverage, including lower cost to serve.
    • North America segment operating income was $5 billion with an operating margin of 5.8%, up 460 basis points year-over-year, driven by improvements in cost to serve, including benefits from regionalization efforts, more consolidated customer shipments, and improved leverage.
    • International segment operating income was $903 million with an operating margin of 2.8%, up 710 basis points year-over-year, primarily driven by cost efficiencies through network design enhancements and improved volume leverage in established countries, as well as progress in emerging countries.
    • AWS operating income was $9.4 billion, an increase of $4.3 billion year-over-year, with improved leverage from managing infrastructure and fixed costs while growing at a healthy rate.

Implication: The higher operating income margins across all segments indicate Amazon’s focus on driving efficiencies and improving profitability while continuing to invest in growth opportunities.

  1. Revenue Growth:
    • Worldwide revenue was $143.3 billion, up 13% year-over-year (excluding the impact of foreign exchange).
    • AWS revenue grew 17.2% year-over-year, accelerating from 13.2% in Q4 2023, driven by strong demand for both generative AI and non-generative AI workloads.
    • Advertising revenue grew 24% year-over-year (excluding the impact of foreign exchange), primarily driven by sponsored products and improvements in relevancy and measurement capabilities.

Implication: The strong revenue growth, particularly in AWS and advertising, highlights Amazon’s diversified revenue streams and the growth opportunities in cloud computing and digital advertising.

  1. Capital Expenditures (Capex):
    • Amazon anticipates a meaningful increase in overall capital expenditures in 2024, primarily driven by higher infrastructure Capex for growth in AWS, including generative AI investments.
    • In Q1 2024, Capex was $14 billion, expected to be the lowest quarter of the year.

Implication: The increase in Capex signals Amazon’s confidence in the strong demand for AWS and their commitment to investing in emerging technologies like generative AI to drive future growth.

Overall, the financial ratios and commentary indicate Amazon’s focus on improving profitability, driving operational efficiencies, and investing in growth opportunities, particularly in AWS and generative AI, while maintaining a diversified revenue stream and managing costs effectively.

Identify the speakers from the earnings call with Amazon Bedrock

The following prompt provides instructions to identify the speakers in the meeting from the transcription:

# Identify the speakers from the earnings call 
speakers_prompt = """Based on the contents of <transcript></transcript>,identify the speakers on this earnings call <transcript></transcript> . 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": speakers_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
speakers = response_body['content'][0]['text']
print(speakers)

Based on this, we get the following output:

Based on the transcript, the key speakers on this Amazon earnings call appear to be:

  1. Andy Jassy – CEO of Amazon
  2. Brian Olsavsky – CFO of Amazon
  3. Dave Fildes – Vice President of Investor Relations at Amazon

The call begins with opening remarks from Dave Fildes, followed by prepared statements from Andy Jassy and Brian Olsavsky. They then take questions from analysts, with Andy and Brian providing the responses.

Obtain the challenges or negative areas discussed on the earnings call with Amazon Bedrock

The following prompt provides instructions to obtain the challenges or negative areas discussed from the transcription:

# Obtain the challenges or negative areas discussed on earnings call

challenges_prompt = """Based on the contents of <transcript></transcript>,Obtaining the challenges or negative areas discussed on earnings <transcript></transcript> . 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": challenges_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
challenges = response_body['content'][0]['text']
print(challenges)

Based on this, we get the following output:

Based on the transcript, some of the key challenges or negative areas discussed include:

  1. Foreign exchange headwinds: Amazon faced an unfavorable impact from global currencies weakening against the U.S. dollar in Q1, leading to a $700 million or 50 basis point headwind to revenue compared to guidance.
  2. Increasing capital expenditures: Amazon expects to meaningfully increase its capital expenditures year-over-year in 2024, primarily driven by higher infrastructure spending for AWS growth, including investments in generative AI capabilities.
  3. Consumer spending concerns: Amazon mentioned keeping an eye on consumer spending trends, specifically in Europe, where it appears weaker relative to the U.S.
  4. International segment profitability: While the international segment’s profitability improved, with an operating margin of 2.8%, Amazon acknowledged the need to continue working on cost efficiencies and profitability, particularly in emerging countries.
  5. Cost optimization challenges: Although Amazon believes the majority of cost optimization efforts are behind them, there is still a need to continually streamline processes, optimize inventory placement, and invest in automation to further reduce the cost to serve.

Overall, the challenges centered around foreign exchange impacts, increasing capital intensity for AWS and generative AI investments, consumer demand uncertainties, and ongoing efforts to improve operational efficiencies and international profitability.

Get insights from a call center call between an agent and a customer using Amazon Transcribe and Amazon Bedrock

Through this use case, we demonstrate how to take an existing call recording from a contact center and summarize the content discussed, extract the main topic, key phrases, call reason, customer satisfaction, overall call sentiment, and sentiment about the products and services discussed. For additional details about this use case, see Live call analytics and agent assist for your contact center with Amazon language AI services and Post call analytics for your contact center with Amazon language AI services.

Transcribe audio with Amazon Transcribe

The first step is to transcribe the audio file using Amazon Transcribe. In this case, we use a sample from the Amazon Transcribe Post Call Analytics Solution GitHub repository. For the purpose of this notebook, we downloaded the WAV file and stored it in an S3 bucket.

# Create a Amazon Transcribe transcirption job by specifying the audio/video file's S3 location
import boto3
import time
import random
transcribe = boto3.client('transcribe')
response = transcribe.start_transcription_job(
    TranscriptionJobName=f"call_center-transcription-{int(time.time())}_{random.randint(1000, 9999)}",
    LanguageCode='en-US',
    MediaFormat='mp3',
    Media={
        'MediaFileUri': '<S3 URI of the media file>'
    },
    OutputBucketName='<name of the S3 bucket that will store the output>',
    OutputKey='transcribe_bedrock_blog/output-files/',
    Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 3
    }
)
max_tries = 60
while max_tries > 0:
    max_tries -= 1
    job = transcribe.get_transcription_job(TranscriptionJobName=response['TranscriptionJob']['TranscriptionJobName'])
    job_status = job["TranscriptionJob"]["TranscriptionJobStatus"]
    if job_status in ["COMPLETED", "FAILED"]:
        if job_status == "COMPLETED":
            print(
                f"Download the transcript fromn"
                f"t{job['TranscriptionJob']['Transcript']['TranscriptFileUri']}."
            )
        break
    else:
        print(f"Waiting for {response['TranscriptionJob']['TranscriptionJobName']}. Current status is {job_status}.")
    time.sleep(10)

The transcription job will take a few minutes to complete.

When the job’s complete, you can inspect the transcription output and check for the plain text transcript that was generated:

import json
# Get the Transcribe Output JSON file
s3 = boto3.client('s3')
output_bucket = job['TranscriptionJob']['Transcript']['TranscriptFileUri'].split('https://')[1].split('/',2)[1]
output_file_key = job['TranscriptionJob']['Transcript']['TranscriptFileUri'].split('https://')[1].split('/',2)[2]
s3_response_object = s3.get_object(Bucket=output_bucket, Key=output_file_key)
object_content = s3_response_object['Body'].read()

transcription_output = json.loads(object_content)

# Let's see what we have inside the job output JSON
print(transcription_output['results']['transcripts'][0]['transcript'])

Thank you for calling Big Jim’s Auto. This is Travis. How can I help you today? Hello, my name is Violet King and I bought a car not too long ago and a light is coming on um a light on the dashboard. And so I was wondering what I should do about that. Ok. It may depend on what kind of light we’re looking at here today, ma’am. Uh Could I get your first and last name spelled out for me so I can just get some information pulled up? Yes. My name is Violet Vviolet. My last name is King King. Ok, I got the call last week. Ok. Uh And what kind of car are we examining today? It’s, it’s a Ford Fusion. It’s 2017, 2017 Ford Fusion. OK. And for verification, ma’am. Do you happen to know the purchase date of the car? Yes, it, it, it was last Tuesday, August 10th. You say the 10th? Ok. And can you describe to me what kind of light we’re looking at? Y yes, it’s uh I, I think it’s a, an oil, an oil light, an oil light? Ok. Ok. And uh just for clarity on my end, ma’am. Um, uh, is this the first call you’ve made regarding this? Yes. Ok. And, uh, this might be kind of a silly question because I know you just got the car. But sometimes they make me ask a silly question about how many miles has the car been driven since you bought it? Oh. I’m not sure. Should I check? Uh, no, that’s ok. I’ll just, I’ll just put in that. We don’t know at this time. It’s ok. Um, so, um, uh, under the warranty we offer, um, we, the, we don’t handle in house oil changes. Um, we basically, when, when someone buys a car from us, the warranty, we have, it, it covers some stuff like, um, weather damage. Um, and, uh, if the engine light comes on, we take a look at that, but the oil change is something that we just don’t have, uh, here at the dealership that’s a little bit outsourced out and they’re pretty backed up right now because a lot of people have been, uh, staying in due to the recent pandemic and now everyone’s just starting to get out and a whole bunch of places are just completely bogged down. So we have a place that we typically outsource to, um, and they’re, they’re pretty reasonable. They’re about, I wanna say somewhere between 25 and $35 to do an oil change. So it’s really not that bad, but they’re a little bit backed up right now from what I’ve heard, I would recommend giving them a call as soon as you can before they close. Ok. What is their number? Uh, give me a second. Let me just rustle through the desk here. See if I can find their information. Uh, ok. Ok. Yes, I’m all right with them one moment, please. Sure. Ok. All their number is 888 333 2222. Ok, and they can fix my car. Yeah, they should be able to handle the oil change. I’m sorry, that’s not something that we cover under the warranty that uh, we have, um, but they should be able to get you settled and, uh, sorted. Ok. Ok. Thank you. Cool. No problem. Have a good one. Thank you. Yup. Bye.

After you have validated the existence of the text, you can use Amazon Bedrock to analyze the output:

# Now let's use this transcript to extract insights with a help of a Large Language Model on Amazon Bedrock
# First let's initialize the bedrock runtime client to invoke the model. 
bedrock_runtime = boto3.client('bedrock-runtime')
# Selecting Claude 3 Sonnet
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

Using the transcription from the recorded call center call, we use Amazon Bedrock to call an FM (we use Anthropic’s Claude 3 Sonnet in this case), but you can choose from the other language models available on Amazon Bedrock.

You can now perform additional tasks.

Summarize the call between agent and client with Amazon Bedrock

The following prompt provides instructions to summarize the call discussion from the transcription:

#Summarize the call between agent and customer
summarize_prompt = """Based on the contents of <transcript></transcript>,summarize the call between agent and customer with focus on resolution <transcript></transcript> . 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": summarize_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
summarization = response_body['content'][0]['text']
print(summarization)

We get the following output:

Based on the contents of the transcript, here is a summary of the call between the agent (Travis) and the customer (Violet King) with a focus on resolution:

Violet King called about a light on the dashboard of her recently purchased 2017 Ford Fusion from Big Jim’s Auto. The light appeared to be an oil light. Travis explained that while their warranty covers certain issues like weather damage and engine lights, it does not cover oil changes. He recommended calling an outsourced oil change service that Big Jim’s Auto typically uses, which charges between $25-35 for an oil change.

Travis provided the phone number for the oil change service (888-333-2222) and mentioned that they are currently backed up due to the recent pandemic. He advised Violet to call them as soon as possible before they close to get her oil changed and resolve the issue with the oil light on her dashboard.

The resolution was for Violet to contact the recommended third-party oil change service to have her car’s oil changed, which should address the oil light issue she was experiencing with her newly purchased vehicle.

Extract the main topics with Amazon Bedrock

The following prompt provides instructions to extract the main topics discussed in the conversation from the transcription:

# Extracting the main topics from the conversation
maintopic_prompt = """The contents of <transcript></transcript> are a transcription of a conversation between agent and client. Based on the information, extract the main topics from the conversation  <transcript></transcript>. 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": maintopic_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
main_topic = response_body['content'][0]['text']
print(main_topic)

We get the following output:

Based on the conversation transcript, the main topics appear to be:

  1. Dashboard warning light (specifically an oil light) on a recently purchased 2017 Ford Fusion.
  2. Determining if the issue is covered under the warranty provided by the dealership (Big Jim’s Auto).
  3. Recommendation to contact an external auto service provider (phone number provided) for an oil change service, as the dealership does not handle oil changes in-house.
  4. Confirming that the external auto service provider can likely resolve the oil light issue by performing an oil change.

Extract the key phrases with Amazon Bedrock

The following prompt provides instructions to extract the key phrases discussed in the conversation from the transcription:

# Extracting the key phrases with Amazon Bedrock
keyphrase_prompt = """The contents of <transcript></transcript> are a transcription of a conversation between agent and client. Based on the information, extract the key phrases discussed in the conversation <transcript></transcript>. 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": keyphrase_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
keyphrases = response_body['content'][0]['text']
print(keyphrases)

We get the following output:

Based on the conversation transcript, here are the key phrases discussed:

  • Oil light
  • 2017 Ford Fusion
  • Purchase date: August 10th
  • First call regarding the issue
  • Car mileage unknown
  • Warranty does not cover oil changes
  • Outsourced oil change service recommended
  • Oil change service contact number: 888-333-2222
  • Oil change service cost: $25 – $35
  • Service is backed up due to the pandemic

Extract the reason why the client called the call center with Amazon Bedrock

The following prompt provides instructions to extract the reason for this client call to the call center from the transcription:

# Extracting the reason why client called the call center
reason_prompt = """The content of <transcript></transcript> is  transcription of a conversation between agent and client. Based on the information, extract the reason why client called the call center <transcript></transcript>. 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": reason_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
reason = response_body['content'][0]['text']
print(reason)

We get the following output:

Based on the transcription, the client called the call center because a light (specifically an oil light) was coming on the dashboard of their recently purchased 2017 Ford Fusion car. The client was seeking guidance on what to do about the oil light being on.

Extract the level of customer satisfaction with Amazon Bedrock

The following prompt provides instructions to extract the level of customer satisfaction experienced by the client from the transcription:

# Extracting the level of Customer Satisfaction 
satisfaction_prompt = """The content of <transcript></transcript> is  transcription of a conversation between agent and client. Based on the information, extract the level of customer satisfaction <transcript></transcript>. 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": satisfaction_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
csat= response_body['content'][0]['text']
print(csat)

We get the following output:

Based on the transcript, the level of customer satisfaction seems to be moderate.

Evidence:

  1. The agent provided clear explanations regarding the issue with the oil light and why oil changes are not covered under their warranty.
  2. The agent offered a recommendation for an external service provider that could perform the oil change, along with their contact information.
  3. The customer acknowledged the information provided by the agent, indicating some level of satisfaction with the response.

However, there are no explicit statements from the customer expressing high satisfaction or dissatisfaction. The interaction remains polite and resolves the customer’s initial query, but there is no strong indication of exceptional satisfaction or disappointment.

Obtain the overall customer sentiment with Amazon Bedrock

The following prompt provides instructions to obtain the overall customer sentiment from the transcription:

# Extracting the overall customer sentiment.
sentiment_prompt = """The content of <transcript></transcript> is transcription of a conversation between agent and client. Based on the information, what is the overall customer sentiment of the conversation <transcript></transcript>. 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": sentiment_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
sentiment = response_body['content'][0]['text']
print(sentiment)

We get the following output:

Based on the conversation transcript, the overall customer sentiment seems neutral to slightly positive. Although the customer, Violet King, was initially concerned about a warning light on her recently purchased car’s dashboard, the agent (Travis) explained the situation clearly and provided a recommendation for getting an oil change from a third-party service provider. The customer acknowledged and accepted the suggestion without expressing significant frustration or dissatisfaction. The conversation ended on a polite note with the customer thanking the agent.

Obtain sentiment about the products or services discussed during the call with Amazon Bedrock

The following prompt provides instructions to obtain sentiment about products and services discussed during the call from the transcription:

## Obtaining sentiment about the products or services discussed.
sentiment_product_prompt = """The content of <transcript></transcript> is transcription of a conversation between agent and client. Based on the information, what is the overall sentiment about the products discussed in the conversation <transcript></transcript>. 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": sentiment_product_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
sentiment_product = response_body['content'][0]['text']
print(sentiment_product)

We get the following output:

Based on the conversation transcript, the overall sentiment about the products (the 2017 Ford Fusion car) discussed seems neutral to slightly positive. There are no major complaints or strong negative sentiments expressed about the car itself. The issue raised is regarding an oil light coming on, which seems to be a routine maintenance matter that needs to be addressed by getting an oil change. The agent helpfully provides the customer with information on where to get an oil change done, although mentioning that the warranty does not cover routine oil changes at the dealership. Overall, there is no indication of major dissatisfaction with the car itself from the conversation.

Create an email to send to the client with the call summary

The following prompt provides instructions to create an email to send to the client after the call is completed, summarizing the call:

## Creating an email to send to the client with the call summary
email_prompt = """The content of <transcript></transcript> is transcription of a conversation between agent and client. Based on the information, create an email to send to the client after the call is completed summarizing the call. 
User formal but friendly english style. <transcript></transcript>. 

<transcript>
{transcript}
</transcript>
"""

user_message =  {"role": "user", "content": email_prompt.format(transcript = transcription_output['results']['transcripts'][0]['transcript'])}

messages = [user_message]

body=json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": messages
    }
)
bedrock_response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(bedrock_response.get('body').read())
email_conversation = response_body['content'][0]['text']
print(email_conversation)

We get the following output:

Dear Ms. Violet King,

I hope this email finds you well. I’m writing to summarize our conversation regarding the issue with your 2017 Ford Fusion that you purchased last Tuesday, August 10th.

During our call, you mentioned that an oil light was illuminated on the dashboard of your vehicle. As discussed, our dealership’s warranty does not cover routine maintenance services like oil changes. However, we have a trusted partner that we typically recommend for such services.

The recommended auto service provider is reachable at 888-333-2222. They are experienced in handling oil changes and should be able to assist you with your vehicle’s needs. Please note that they have been experiencing a high volume of requests due to the recent pandemic, so it’s advisable to call them as soon as possible to schedule an appointment.

Conclusion

Using generative AI through Amazon Bedrock to analyze audio transcripts generated by Amazon Transcribe unlocks valuable insights that would otherwise remain hidden within the audio data. By combining the powerful speech-to-text capabilities of Amazon Transcribe with the natural language understanding and generation capabilities of LLMs like those available through Amazon Bedrock, businesses can more efficiently extract key information, generate summaries, identify topics and sentiments, and create new content from their audio and video assets. This approach not only saves time and resources compared to manual transcription and analysis, but also opens up new opportunities for using existing content in innovative ways.

Whether it’s repurposing marketing materials, quickly capturing key points from meetings, or improving customer experience through call center analytics, the combination of Amazon Transcribe and large language models (LLMs) on Amazon Bedrock provides a powerful solution for unlocking the full potential of audio data.As these use cases have demonstrated, this technology can be applied across various domains, from content creation and SEO optimization to business intelligence and customer service. By staying at the forefront of these advancements, organizations can gain a competitive edge by effectively harnessing the wealth of information contained within their audio and video repositories, driving insights, and making more informed decisions.


About the Authors

Ana Maria Echeverri is an AI/ML Worldwide Service Specialist at AWS, focused on driving adoption of generative AI speech analytics use cases. She has worked in the data and AI industry for over 30 years, with over 10 years focused on helping organizations grow their AI maturity and capabilities for successful execution of AI strategies.

Vishesh Jha is a Senior Solutions Architect at AWS. His area of interest lies in generative AI, and he has helped customers and partners get started with NLP using AWS services such as Amazon Bedrock, Amazon Transcribe, and Amazon SageMaker. He is an avid soccer fan, and in his free time enjoys watching and playing the sport. He also loves cooking, gaming, and traveling with his family.

Bala Krishna Jakka is a Technical Account manager at AWS, with a passion for contact center and generative AI technologies. With extensive expertise in helping organizations use cutting-edge solutions, he thrives on staying ahead of the curve in this rapidly evolving field. When not immersed in the realms of AI and customer experience, he finds joy in the game of cricket, showcasing his skills on the pitch. A devoted family man, he cherishes the moments spent with his loved ones, creating lasting memories and finding balance amidst the demands of his professional pursuits

Read More

From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy

From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy

AI2BMD blog hero - illustration of a chip with network nodes extending from all sides

The essence of the biological world lies in the ever-changing nature of its molecules and their interactions. Understanding the dynamics and interactions of biomolecules is crucial for deciphering the mechanisms behind biological processes and for developing biomaterials and drugs. As Richard Feynman famously said, “Everything that living things do can be understood in terms of the jigglings and wigglings of atoms.” Yet capturing these real-life movements is nearly impossible through experiments. 

In recent years, with the development of deep learning methods represented by AlphaFold and RoseTTAFold, predicting the static crystal protein structures has been achieved with experimental accuracy (as recognized by the 2024 Nobel Prize in Chemistry). However, accurately characterizing dynamics at an atomic resolution remains much more challenging, especially when the proteins play their roles and interact with other biomolecules or drug molecules.

As one approach, Molecular Dynamics (MD) simulation combines the laws of physics with numerical simulations to tackle the challenge of understanding biomolecular dynamics. This method has been widely used for decades to explore the relationship between the movements of molecules and their biological functions. In fact, the significance of MD simulations was underscored when the classic version of this technique was recognized with a Nobel Prize in 2013 (opens in new tab) (opens in new tab), highlighting its crucial role in advancing our understanding of complex biological systems. Similarly, the quantum mechanical approach—known as Density Functional Theory (DFT)—received its own Nobel Prize in 1998 (opens in new tab) (opens in new tab), marking a pivotal moment in computational chemistry.  

In MD simulations, molecules are modeled at the atomic level by numerically solving equations of motions that account for the system’s time evolution, through which kinetic and thermodynamic properties can be computed. MD simulations are used to model the time-dependent motions of biomolecules. If you think of proteins like intricate gears in a clock, AI2BMD doesn’t just capture them in place—it watches them spin, revealing how their movements drive the complex processes that keep life running.

MD simulations can be roughly divided into two classes: classical MD and quantum mechanics. Classical MD employs simplified representations of the molecular systems, achieving fast simulation speed for long-time conformational changes but less accurate. In contrast, quantum mechanics models, such as Density Functional Theory, provide ground-up calculations, but are computationally prohibitive for large biomolecules.

Ab initio biomolecular dynamics simulation by AI 

Microsoft Research has been working on the development of efficient methods aiming for ab initio accuracy simulations of biomolecules. This method, AI2BMD (AI-based ab initio biomolecular dynamics system), has published in the journal Nature (opens in new tab), representing the culmination of a four-year research endeavor.

AI2BMD efficiently simulates a wide range of proteins in all-atom resolution with more than 10,000 atoms at an approximate ab initio—or first-principles—accuracy. It thus strikes a previously inaccessible tradeoff for biomolecular simulations than standard simulation techniques – achieving higher accuracies than classical simulation, at a computational cost that is higher than classical simulation but orders of magnitude faster than what DFT could achieve. This development could unlock new capabilities in biomolecular modeling, especially for processes where high accuracy is needed, such as protein-drug interactions. 

Fig.1 The overall pipeline of AI2BMD. Proteins are divided into protein units by a fragmentation process. The AI2BMD potential is designed based on ViSNet, and the datasets are generated at the DFT level. It calculates the energy and atomic forces for the whole protein. The AI2BMD simulation system is built upon these components and provides a generalizable solution for simulating the molecular dynamics of proteins. It achieves ab initio accuracy in energy and force calculations. Through comprehensive analysis from both kinetics and thermodynamics perspectives, AI2BMD exhibits good alignment with wet-lab experimental data and detects different phenomena compared to molecular mechanics.
Figure 1. The flowchart of AI2BMD

AI2BMD employs a novel-designed generalizable protein fragmentation approach that splits proteins into overlapping units, creating a dataset of 20 million snapshots—the largest ever at the DFT level. Based on our previously designed ViSNet (opens in new tab), a universal molecular geometry modeling foundation model published in Nature Communications (opens in new tab) and incorporated into PyTorch Geometry library (opens in new tab), we trained AI2BMD’s potential energy function using machine learning. Simulations are then performed by the highly efficient AI2BMD simulation system, where at each step, the AI2BMD potential based on ViSNet calculates the energy and atomic forces for the protein with ab initio accuracy. By comprehensive analysis from both kinetics and thermodynamics, AI2BMD exhibits much better alignments with wet-lab data, such as the folding free energy of proteins and different phenomenon than classic MD.   

Microsoft research podcast

Abstracts: August 15, 2024

Advanced AI may make it easier for bad actors to deceive others online. A multidisciplinary research team is exploring one solution: a credential that allows people to show they’re not bots without sharing identifying information. Shrey Jain and Zoë Hitzig explain.


Advancing biomolecular MD simulation

AI2BMD represents a significant advancement in the field of MD simulations from the following aspects: 

(1) Ab initio accuracy: introduces a generalizable “machine learning force field,” a machine learned model of the interactions between atoms and molecules, for full-atom protein dynamics simulations with ab initio accuracy.

Fig.2 Evaluation of energy and force calculations by AI2BMD and molecular mechanics (MM). The upper panel exhibits the folded structures of four evaluated proteins. The lower panel exhibits the mean absolute error (MAE) of potential energy.
Figure 2. Evaluation on the energy calculation error between AI2BMD and Molecular Mechanics (MM) for different proteins. 

(2) Addressing generalization: It is the first to address the generalization challenge of a machine learned force field for simulating protein dynamics, demonstrating robust ab initio MD simulations for a variety of proteins. 

(3) General compatibility: AI2BMD expands the Quantum Mechanics (QM) modeling from small, localized regions to entire proteins without requiring any prior knowledge on the protein. This eliminates the potential incompatibility between QM and MM calculations for proteins and accelerates QM region calculation by several orders of magnitude, bringing near ab initio calculation for full-atom proteins to reality. Consequently, AI2BMD paves the road for numerous downstream applications and allows for a fresh perspective on characterizing complex biomolecular dynamics.

(4) Speed advantage: AI2BMD is several orders of magnitude faster than DFT and other quantum mechanics. It supports ab initio calculations for proteins with more than 10 thousand atoms, making it one of the fastest AI-driven MD simulation programs among multidisciplinary fields.

Fig.3 Comparison of time consumption between AI2BMD, DFT and other AI driven simulation software. The left panel shows the time consumption of AI2BMD and DFT. The right panel shows the time consumption of AI2BMD, DPMD and Allegro.
Figure 3. Comparison of time consumption between AI2BMD, DFT and other AI driven simulation software. 

(5) Diverse conformational space exploration: For the protein folding and unfolding simulated by AI2BMD and MM, AI2BMD explores more possible conformational space that MM cannot detect. Therefore, AI2BMD opens more opportunities to study flexible protein motions during the drug-target binding process, enzyme catalysis, allosteric regulations, intrinsic disorder proteins and so on, better aligning with the wet-lab experiments and providing more comprehensive explanations and guidance to biomechanism detection and drug discovery. 

Fig.4 Analysis of the simulation trajectories performed by AI2BMD. In the upper panel, AI2BMD folds protein of Chignolin starting from an unfolded structure and achieves smaller energy error than MM. In the lower panel, it explores more conformational regions that MM cannot detect.
Figure 4. AI2BMD folds protein of Chignolin starting from an unfolded structure, achieves smaller energy error than MM and explores more conformational regions that MM cannot detect. 

(6) Experimental agreement: AI2BMD outperforms the QM/MM hybrid approach and demonstrates high consistency with wet-lab experiments on different biological application scenarios, including J-coupling, enthalpy, heat capacity, folding free energy, melting temperature, and pKa calculations.

Looking ahead

Achieving ab initio accuracy in biomolecular simulations is challenging but holds great potential for understanding the mystery of biological systems and designing new biomaterials and drugs. This breakthrough is a testament to the vision of AI for Science—an initiative to channel the capabilities of artificial intelligence to revolutionize scientific inquiry. The proposed framework aims to address limitations regarding accuracy, robustness, and generalization in the application of machine learning force fields. AI2BMD provides generalizability, adaptability, and versatility in simulating various protein systems by considering the fundamental structure of proteins, namely stretches of amino acids. This approach enhances energy and force calculations as well as the estimation of kinetic and thermodynamic properties. 

One key application of AI2BMD is its ability to perform highly accurate virtual screening for drug discovery. In 2023, at the inaugural Global AI Drug Development competition (opens in new tab),  AI2BMD made a breakthrough by predicting a chemical compound that binds to the main protease of SARS-CoV-2. Its precise predictions surpassed those of all other competitors, securing first place and showcasing its immense potential to accelerate real-world drug discovery efforts. 

Since 2022, Microsoft Research also partnered with the Global Health Drug Discovery Institute (GHDDI), a nonprofit research institute founded and supported by the Gates Foundation, to apply AI technology to design drugs that treat diseases that unproportionally affect low- and middle- income countries (LMIC), such as tuberculosis and malaria. Now, we have been closely collaborating with GHDDI to leverage AI2BMD and other AI capabilities to accelerate the drug discovery process. 

AI2BMD can help advance solutions to scientific problems and enable new biomedical research in drug discovery, protein design, and enzyme engineering.  

The post From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy appeared first on Microsoft Research.

Read More

NVIDIA Advances Robot Learning and Humanoid Development With New AI and Simulation Tools

NVIDIA Advances Robot Learning and Humanoid Development With New AI and Simulation Tools

Robotics developers can greatly accelerate their work on AI-enabled robots, including humanoids, using new AI and simulation tools and workflows that NVIDIA revealed this week at the Conference for Robot Learning (CoRL) in Munich, Germany.

The lineup includes the general availability of the NVIDIA Isaac Lab robot learning framework; six new humanoid robot learning workflows for Project GR00T, an initiative to accelerate humanoid robot development; and new world-model development tools for video data curation and processing, including the NVIDIA Cosmos tokenizer and NVIDIA NeMo Curator for video processing.

The open-source Cosmos tokenizer provides robotics developers superior visual tokenization by breaking down images and videos into high-quality tokens with exceptionally high compression rates. It runs up to 12x faster than current tokenizers, while NeMo Curator provides video processing curation up to 7x faster than unoptimized pipelines.

Also timed with CoRL, NVIDIA presented 23 papers and nine workshops related to robot learning and released training and workflow guides for developers. Further, Hugging Face and NVIDIA announced they’re collaborating to accelerate open-source robotics research with LeRobot, NVIDIA Isaac Lab and NVIDIA Jetson for the developer community.

Accelerating Robot Development With Isaac Lab 

NVIDIA Isaac Lab is an open-source, robot learning framework built on NVIDIA Omniverse, a platform for developing OpenUSD applications for industrial digitalization and physical AI simulation.

Developers can use Isaac Lab to train robot policies at scale. This open-source unified robot learning framework applies to any embodiment — from humanoids to quadrupeds to collaborative robots — to handle increasingly complex movements and interactions.

Leading commercial robot makers, robotics application developers and robotics research entities around the world are adopting Isaac Lab, including 1X, Agility Robotics, The AI Institute, Berkeley Humanoid, Boston Dynamics, Field AI, Fourier, Galbot, Mentee Robotics, Skild AI, Swiss-Mile, Unitree Robotics and XPENG Robotics.

Project GR00T: Foundations for General-Purpose Humanoid Robots 

Building advanced humanoids is extremely difficult, demanding multilayer technological and interdisciplinary approaches to make the robots perceive, move and learn skills effectively for human-robot and robot-environment interactions.

Project GR00T is an initiative to develop accelerated libraries, foundation models and data pipelines to accelerate the global humanoid robot developer ecosystem.

Six new Project GR00T workflows provide humanoid developers with blueprints to realize the most challenging humanoid robot capabilities. They include:

  • GR00T-Gen for building generative AI-powered, OpenUSD-based 3D environments
  • GR00T-Mimic for robot motion and trajectory generation
  • GR00T-Dexterity for robot dexterous manipulation
  • GR00T-Control for whole-body control
  • GR00T-Mobility for robot locomotion and navigation
  • GR00T-Perception for multimodal sensing

“Humanoid robots are the next wave of embodied AI,” said Jim Fan, senior research manager of embodied AI at NVIDIA. “NVIDIA research and engineering teams are collaborating across the company and our developer ecosystem to build Project GR00T to help advance the progress and development of global humanoid robot developers.”

New Development Tools for World Model Builders

Today, robot developers are building world models — AI representations of the world that can predict how objects and environments respond to a robot’s actions. Building these world models is incredibly compute- and data-intensive, with models requiring thousands of hours of real-world, curated image or video data.

NVIDIA Cosmos tokenizers provide efficient, high-quality encoding and decoding to simplify the development of these world models. They set a new standard of minimal distortion and temporal instability, enabling high-quality video and image reconstructions.

Providing high-quality compression and up to 12x faster visual reconstruction, the Cosmos tokenizer paves the path for scalable, robust and efficient development of generative applications across a broad spectrum of visual domains.

1X, a humanoid robot company, has updated the 1X World Model Challenge dataset to use the Cosmos tokenizer.

“NVIDIA Cosmos tokenizer achieves really high temporal and spatial compression of our data while still retaining visual fidelity,” said Eric Jang, vice president of AI at 1X Technologies. “This allows us to train world models with long horizon video generation in an even more compute-efficient manner.”

Other humanoid and general-purpose robot developers, including XPENG Robotics and Hillbot, are developing with the NVIDIA Cosmos tokenizer to manage high-resolution images and videos.

NeMo Curator now includes a video processing pipeline. This enables robot developers to improve their world-model accuracy by processing large-scale text, image and video data.

Curating video data poses challenges due to its massive size, requiring scalable pipelines and efficient orchestration for load balancing across GPUs. Additionally, models for filtering, captioning and embedding need optimization to maximize throughput.

NeMo Curator overcomes these challenges by streamlining data curation with automatic pipeline orchestration, reducing processing time significantly. It supports linear scaling across multi-node, multi-GPU systems, efficiently handling over 100 petabytes of data. This simplifies AI development, reduces costs and accelerates time to market.

Advancing the Robot Learning Community at CoRL

The nearly two dozen research papers the NVIDIA robotics team released with CoRL cover breakthroughs in integrating vision language models for improved environmental understanding and task execution, temporal robot navigation, developing long-horizon planning strategies for complex multistep tasks and using human demonstrations for skill acquisition.

Groundbreaking papers for humanoid robot control and synthetic data generation include SkillGen, a system based on synthetic data generation for training robots with minimal human demonstrations, and HOVER, a robot foundation model for controlling humanoid robot locomotion and manipulation.

NVIDIA researchers will also be participating in nine workshops at the conference. Learn more about the full schedule of events.

Availability

NVIDIA Isaac Lab 1.2 is available now and is open source on GitHub. NVIDIA Cosmos tokenizer is available now on GitHub and Hugging Face. NeMo Curator for video processing will be available at the end of the month.

The new NVIDIA Project GR00T workflows are coming soon to help robot companies build humanoid robot capabilities with greater ease. Read more about the workflows on the NVIDIA Technical Blog.

Researchers and developers learning to use Isaac Lab can now access developer guides and tutorials, including an Isaac Gym to Isaac Lab migration guide.

Discover the latest in robot learning and simulation in an upcoming OpenUSD insider livestream on robot simulation and learning on Nov. 13, and attend the NVIDIA Isaac Lab office hours for hands-on support and insights.

Developers can apply to join the NVIDIA Humanoid Robot Developer Program.

Read More

Hugging Face and NVIDIA to Accelerate Open-Source AI Robotics Research and Development

Hugging Face and NVIDIA to Accelerate Open-Source AI Robotics Research and Development

At the Conference for Robot Learning (CoRL) in Munich, Germany, Hugging Face and NVIDIA announced a collaboration to accelerate robotics research and development by bringing together their open-source robotics communities.

Hugging Face’s LeRobot open AI platform combined with NVIDIA AI, Omniverse and Isaac robotics technology will enable researchers and developers to drive advances across a wide range of industries, including manufacturing, healthcare and logistics.

Open-Source Robotics for the Era of Physical AI

The era of physical AI — robots understanding physical properties of environments — is here, and it’s rapidly transforming the world’s industries.

To drive and sustain this rapid innovation, robotics researchers and developers need access to open-source, extensible frameworks that span the development process of robot training, simulation and inference. With models, datasets and workflows released under shared frameworks, the latest advances are readily available for use without the need to recreate code.

Hugging Face’s leading open AI platform serves more than 5 million machine learning researchers and developers, offering tools and resources to streamline AI development. Hugging Face users can access and fine-tune the latest pretrained models and build AI pipelines on common APIs with over 1.5 million models, datasets and applications freely accessible on the Hugging Face Hub.

LeRobot, developed by Hugging Face, extends the successful paradigms from its  Transformers and Diffusers libraries into the robotics domain. LeRobot offers a comprehensive suite of tools for sharing data collection, model training and simulation environments along with designs for low-cost manipulator kits.

NVIDIA’s AI technology, simulation and open-source robot learning modular framework such as NVIDIA Isaac Lab can accelerate the LeRobot’s data collection, training and verification workflow. Researchers and developers can share their models and datasets built with LeRobot and Isaac Lab, creating a data flywheel for the robotics community.

Scaling Robot Development With Simulation

Developing physical AI is challenging. Unlike language models that use extensive internet text data, physics-based robotics relies on physical interaction data along with vision sensors, which is harder to gather at scale. Collecting real-world robot data for dexterous manipulation across a large number of tasks and environments is time-consuming and labor-intensive.

Making this easier, Isaac Lab, built on NVIDIA Isaac Sim, enables robot training by demonstration or trial-and-error in simulation using  high-fidelity rendering and physics simulation to create realistic synthetic environments and data. By combining GPU-accelerated physics simulations and parallel environment execution, Isaac Lab provides the ability to generate vast amounts of training data — equivalent to thousands of real-world experiences — from a single demonstration.

Generated motion data is then used to train a policy with imitation learning. After successful training and validation in simulation, the policies are deployed on a real robot, where they are further tested and tuned to achieve optimal performance.

This iterative process leverages real-world data’s accuracy and the scalability of simulated synthetic data, ensuring robust and reliable robotic systems.

By sharing these datasets, policies and models on Hugging Face, a robot data flywheel is created that enables developers and researchers to build upon each other’s work, accelerating progress in the field.

“The robotics community thrives when we build together,” said Animesh Garg, assistant professor at Georgia Tech. “By embracing open-source frameworks such as Hugging Face’s LeRobot and NVIDIA Isaac Lab, we accelerate the pace of research and innovation in AI-powered robotics.”

Fostering Collaboration and Community Engagement

The planned collaborative workflow involves collecting data through teleoperation and simulation in Isaac Lab, storing it in the standard LeRobotDataset format. Data generated using GR00T-Mimic, will then be used to train a robot policy with imitation learning, which is subsequently evaluated in simulation. Finally, the validated policy is deployed on real-world robots with NVIDIA Jetson for real-time inference.

The initial steps in this collaboration have already been taken, having shown a physical picking setup with LeRobot software running on NVIDIA Jetson Orin Nano, providing a powerful, compact compute platform for deployment.

“Combining Hugging Face open-source community with NVIDIA’s hardware and Isaac Lab simulation has the potential to accelerate innovation in AI for robotics,” said Remi Cadene, principal research scientist at LeRobot.

This work builds on NVIDIA’s community contributions in generative AI at the edge, supporting the latest open models and libraries, such as Hugging Face Transformers, optimizing inference for large language models (LLMs), small language models (SLMs) and multimodal vision-language models (VLMs), along with VLM’s action-based variants of  vision language action models (VLAs), diffusion policies and speech models — all with strong, community-driven support.

Together, Hugging Face and NVIDIA aim to accelerate the work of the global ecosystem of robotics researchers and developers transforming industries ranging from transportation to manufacturing and logistics.

Learn about NVIDIA’s robotics research papers at CoRL, including VLM integration for better environmental understanding, temporal navigation and long-horizon planning. Check out workshops at CoRL with NVIDIA researchers.

Read More

Get Plugged In: How to Use Generative AI Tools in Obsidian

Get Plugged In: How to Use Generative AI Tools in Obsidian

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and NVIDIA RTX workstation users.

As generative AI evolves and accelerates industry, a community of AI enthusiasts is experimenting with ways to integrate the powerful technology into common productivity workflows.

Applications that support community plug-ins give users the power to explore how large language models (LLMs) can enhance a variety of workflows. By using local inference servers powered by the NVIDIA RTX-accelerated llama.cpp software library, users on RTX AI PCs can integrate local LLMs with ease.

Previously, we looked at how users can take advantage of Leo AI in the Brave web browser to optimize the web browsing experience. Today, we look at Obsidian, a popular writing and note-taking application, based on the Markdown markup language, that’s useful for keeping complex and linked records for multiple projects. The app supports community-developed plug-ins that bring additional functionality, including several that enable users to connect Obsidian to a local inferencing server like Ollama or LM Studio.

Using Obsidian and LM Studio to generate notes with a 27B-parameter LLM accelerated by RTX.

Connecting Obsidian to LM Studio only requires enabling the local server functionality in LM Studio by clicking on the “Developer” icon on the left panel, loading any downloaded model, enabling the CORS toggle and clicking “Start.” Take note of the chat completion URL from the “Developer” log console (“http://localhost:1234/v1/chat/completions” by default), as the plug-ins will need this information to connect.

Next, launch Obsidian and open the “Settings” panel. Click “Community plug-ins” and then “Browse.” There are several community plug-ins related to LLMs, but two popular options are Text Generator and Smart Connections.

  • Text Generator is helpful for generating content in an Obsidian vault, like notes and summaries on a research topic.
  • Smart Connections is useful for asking questions about the contents of an Obsidian vault, such as the answer to an obscure trivia question previously saved years ago.

Each plug-in has its own way of entering the LM Server URL.

For Text Generator, open the settings and select “Custom” for “Provider profile” and paste the whole URL into the “Endpoint” field. For Smart Connections, configure the settings after starting the plug-in. In the settings panel on the right side of the interface, select “Custom Local (OpenAI Format)” for the model platform. Then, enter the URL and the model name (e.g., “gemma-2-27b-instruct”) into their respective fields as they appear in LM Studio.

Once the fields are filled in, the plug-ins will function. The LM Studio user interface will also show logged activity if users are curious about what’s happening on the local server side.

Transforming Workflows With Obsidian AI Plug-Ins

Both the Text Generator and Smart Connections plug-ins use generative AI in compelling ways.

For example, imagine a user wants to plan a vacation to the fictitious destination of Lunar City and brainstorm ideas for what to do there. The user would start a new note, titled “What to Do in Lunar City.” Since Lunar City is not a real place, the query sent to the LLM will need to include a few extra instructions to guide the responses. Click the Text Generator plug-in icon, and the model will generate a list of activities to do during the trip.

Obsidian, via the Text Generator plug-in, will request LM Studio to generate a response, and in turn LM Studio will run the Gemma 2 27B model. With RTX GPU acceleration in the user’s computer, the model can quickly generate a list of things to do.

The Text Generator community plug-in in Obsidian enables users to connect to an LLM in LM Studio and generate notes for an imaginary vacation. The Text Generator community plug-in in Obsidian allows users to access an LLM through LM Studio to generate notes for a fictional vacation.

Or, suppose many years later the user’s friend is going to Lunar City and wants to know where to eat. The user may not remember the names of the places where they ate, but they can check the notes in their vault (Obsidian’s term for a collection of notes) in case they’d written something down.

Rather than looking through all of the notes manually, a user can use the Smart Connections plug-in to ask questions about their vault of notes and other content. The plug-in uses the same LM Studio server to respond to the request, and provides relevant information it finds from the user’s notes to assist the process. The plug-in does this using a technique called retrieval-augmented generation.

The Smart Connections community plug-in in Obsidian uses retrieval-augmented generation and a connection to LM Studio to enable users to query their notes.

These are fun examples, but after spending some time with these capabilities, users can see the real benefits and improvements for everyday productivity. Obsidian plug-ins are just two ways in which community developers and AI enthusiasts are embracing AI to supercharge their PC experiences. .

NVIDIA GeForce RTX technology for Windows PCs can run thousands of open-source models for developers to integrate into their Windows apps.

Learn more about the power of LLMs, Text Generation and Smart Connections by integrating Obsidian into your workflow and play with the accelerated experience available on RTX AI PCs

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

Abstracts: November 5, 2024

Abstracts: November 5, 2024

Outlined illustrations of Chris Hawblitzel and Jay Lorch for the Microsoft Research Podcast, Abstracts series.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Microsoft senior principal researchers Chris Hawblitzel and Jay Lorch join host Amber Tingle to discuss “Verus: A Practical Foundation for Systems Verification,” which received the Distinguished Artifact Award at this year’s Symposium on Operating Systems Principles, or SOSP. In their research, Hawblitzel, Lorch, and their coauthors leverage advances in programming languages and formal verification with two aims. The first aim is to help make software verification more accessible for systems developers so they can demonstrate their code will behave as intended. The second aim is to provide the research community with sound groundwork to tackle the application of formal verification to large, complex systems. 

Transcript 

[MUSIC] 

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers. 

[MUSIC FADES] 

Our guests today are Chris Hawblitzel and Jay Lorch. They are both senior principal researchers at Microsoft and two of the coauthors on a paper called “Verus: A Practical Foundation for Systems Verification.” This work received the Distinguished Artifact Award at the 30th Symposium on Operating Systems Principles, also known as SOSP, which is happening right now in Austin, Texas. Chris and Jay, thank you for joining us today for Abstracts and congratulations!

JAY LORCH: Thank you for having us. 

CHRIS HAWBLITZEL: Glad to be here. 

TINGLE: Chris, let’s start with an overview. What problem does this research address, and why is Verus something that the broader research community should know about? 


HAWBLITZEL: So what we’re trying to address is a very simple problem where we’re trying to help developers write software that doesn’t have bugs in it. And we’re trying to provide a tool with Verus that will help developers show that their code actually behaves the way it’s supposed to; it obeys some sort of specification for what the program is supposed to do. 

TINGLE: How does this publication build on or differ from other research in this field, including your previous Verus-related work? 

HAWBLITZEL: So formal verification is a process where you write down what it is that you want your program to do in mathematical terms. So if you’re writing an algorithm to sort a list, for example, you might say that the output of this algorithm should be a new list that is a rearrangement of the elements of the old list, but now this rearrangement should be in sorted order. So you can write that down using standard mathematics. And now given that mathematical specification, the challenge is to prove that your piece of software written in a particular language, like Java or C# or Rust, actually generates an output that meets that mathematical specification. So this idea of using verification to prove that your software obeys some sort of specification, this has been around for a long time, so, you know, even Alan Turing talked about ways of doing this many, many decades ago. The challenge has always been that it’s really hard to develop these proofs for any large piece of software. It simply takes a long time for a human being to write down a proof of correctness of their software. And so what we’re trying to do is to build on earlier work in verification and recent developments in programming languages to try to make this as easy as possible and to try to make it as accessible to ordinary software developers as possible. So we’ve been using existing tools. There are automated theorem provers—one of them from Microsoft Research called Z3—where you give it a mathematical formula and ask it to prove that the formula is valid. We’re building on that. And we’re also taking a lot of inspiration from tools developed at Microsoft Research and elsewhere, like Dafny and F* and so on, that we’ve used in the past for our previous verification projects. And we’re trying to take ideas from those and make them accessible to developers who are using common programming languages. In this case, the Rust programming language is what we’re focusing on. 

TINGLE: Jay, could you describe your methodology for us and maybe share a bit about how you and your coauthors tested the robustness of Verus.

LORCH: So the question we really want to answer is, is Verus suitable for systems programming? So that means a variety of things. Is it amenable to a variety of kinds of software that you want to build as part of a system? Is it usable by developers? Can they produce compact proofs? And can they get timely feedback about those proofs? Can the verifier tell you quickly that your proof is correct or, if it’s wrong, that it’s wrong and guide you to fix it? So the main two methodological techniques we used were millibenchmarks and full systems. So the millibenchmarks are small pieces of programs that have been verified by other tools in the past, and we built them in Verus and compared to what other tools would do to find whether we could improve usability. And we found generally that we could verify the same things but with more compact proofs and proofs that would give much snappier feedback. The difference between one second and 10 seconds might not seem a lot, but when you’re writing code and working with the verifier, it’s much nicer to get immediate feedback about what is wrong with your proof so you can say, oh, what about this? And it can say, oh, well, I still see a problem there. And you could say, OK, let me fix that. As opposed to waiting 10, 20 seconds between each such query to the verifier. So the millibenchmarks helped us evaluate that. And the macrobenchmarks, the building entire systems, we built a couple of distributed systems that had been verified before—a key value store and a node replication system—to show that you could do them more effectively and with less verification time. We also built some new systems, a verified OS page table, a memory allocator, and a persistent memory append-only log. 

TINGLE: Chris, the paper mentions that successfully verifying system software has required—you actually use the word heroic to describe the developer effort. Thinking of those heroes in the developer community and perhaps others, what real-world impact do you expect Verus to have? What kind of gains are we talking about here? 

HAWBLITZEL: Yeah, so I think, you know, traditionally verification or this formal software verification that we’re doing has been considered a little bit of a pie-in-the-sky research agenda. Something that people have applied to small research problems but has not necessarily had a real-world impact before. And so I think it’s just, you know, recently, in the last 10 or 15 years, that we started to see a change in this and started to see verified software actually deployed in practice. So on one of our previous projects, we worked on verifying the cryptographic primitives that people use when, say, they browse the web or something and their data is encrypted. So in these cryptographic primitives, there’s a very clear specification for exactly what bytes you’re supposed to produce when you encrypt some data. And the challenge is just writing software that actually performs those operations and does so efficiently. So in one of our previous projects that we worked on called HACL* and EverCrypt, we verified some of the most commonly used and efficient cryptographic primitives for things like encryption and hashing and so on. And these are things that are actually used on a day-to-day basis. So we, kind of, took from that experience that the tools that we’re building are getting ready for prime time here. We can actually verify software that is security critical, reliability critical, and is in use. So some of the things that Jay just mentioned, like verifying, you know, persistent memory storage systems and so on, those are the things that we’re looking at next for software that would really benefit from reliability and where we can formally prove that your data that’s written to disk is read correctly back from disk and not lost during a crash, for example. So that’s the kind of software that we’re looking to verify to try to have a real-world impact. 

LORCH: The way I see the real-world impact, is it going to enable Microsoft to deal with a couple of challenges that are severe and increasing in scale? So the first challenge is attackers, and the second challenge is the vast scale at which we operate. There’s a lot of hackers out there with a lot of resources that are trying to get through our defenses, and every bug that we have offers them purchase, and techniques like this, that can get rid of bugs, allow us to deal with that increasing attacker capability. The other challenge we have is scale. We have billions of customers. We have vast amounts of data and compute power. And when you have a bug that you’ve thoroughly tested but then you run it on millions of computers over decades, those rare bugs eventually crop up. So they become a problem, and traditional testing has a lot of difficulty finding those. And this technology, which enables us to reason about the infinite possibilities in a finite amount of time and observe all possible ways that the system can go wrong and make sure that it can deal with them, that enables us to deal with the vast scale that Microsoft operates on today.

HAWBLITZEL: Yeah, and I think this is an important point that differentiates us from testing. Traditionally, you find a bug when you see that bug happen in running software. With formal verification, we’re catching the bugs before you run the software at all. We’re trying to prove that on all possible inputs, on all possible executions of the software, these bugs will not happen, and it’s much cheaper to fix bugs before you’ve deployed the software that has bugs, before attackers have tried to exploit those bugs. 

TINGLE: So, Jay, ideally, what would you like our listeners and your fellow SOSP conference attendees to tell their colleagues about Verus? What’s the key takeaway here? 

LORCH: I think the key takeaway is that it is possible now to build software without bugs, to build systems code that is going to obey its specification on all possible inputs always. We have that technology. And this is possible now because a lot of technology has advanced to the point where we can use it. So for one thing, there’s advances in programming languages. People are moving from C to Rust. They’ve discovered that you can get the high performance that you want for systems code without having to sacrifice the ability to reason about ownership and lifetimes, concurrency. The other thing that we build on is advances in computer-aided theorem proving. So we can really make compact and quick-to-verify mathematical descriptions of all possible behaviors of a program and get fast answers that allow us to rapidly turn around proof challenges from developers. 

TINGLE: Well, finally, Chris, what are some of the open questions or future opportunities for formal software verification research, and what might you and your collaborators tackle next? I heard a few of the things earlier. 

HAWBLITZEL: Yes, I think despite, you know, the effort that we and many other researchers have put into trying to make these tools more accessible, trying to make them easier to use, there still is a lot of work to prove a piece of software correct, even with advanced state-of-the-art tools. And so we’re still going to keep trying to push to make that easier. Trying to figure out how to automate the process better. There’s a lot of interest right now in artificial intelligence for trying to help with this, especially if you think about artificial intelligence actually writing software. You ask it to write a piece of software to do a particular task, and it generates some C code or some Rust code or some Java code, and then you hope that that’s correct because it could have generated any sort of code that performs the right thing or does total nonsense. So it would be really great going forward if when we ask AI to develop software, we also expect it to create a proof that the software is correct and does what the user asked for. We’ve started working on some projects, and we found that the AI is not quite there yet for realistic code. It can do small examples this way. But I think this is still a very large challenge going forward that could have a large payoff in the future if we can get AI to develop software and prove that the software is correct. 

LORCH: Yeah, I see there’s a lot of synergy between—potential synergy—between AI and verification. Artificial intelligence can solve one of the key challenges of verification, namely making it easy for developers to write that code. And verification can solve one of the key challenges of AI, which is hallucinations, synthesizing code that is not correct, and Verus can verify that that code actually is correct. 

TINGLE: Well, Chris Hawblitzel and Jay Lorch, thank you so much for joining us today on the Microsoft Research Podcast to discuss your work on Verus. 

[MUSIC] 

HAWBLITZEL: Thanks for having us. 

LORCH: Thank you. 

TINGLE: And to our listeners, we appreciate you, too. If you’d like to learn more about Verus, you’ll find a link to the paper at aka.ms/abstracts or you can read it on the SOSP website. Thanks for tuning in. I’m Amber Tingle, and we hope you’ll join us again for Abstracts.

[MUSIC FADES] 

The post Abstracts: November 5, 2024 appeared first on Microsoft Research.

Read More

Austin Calling: As Texas Absorbs Influx of Residents, Rekor Taps NVIDIA Technology for Roadway Safety, Traffic Relief

Austin Calling: As Texas Absorbs Influx of Residents, Rekor Taps NVIDIA Technology for Roadway Safety, Traffic Relief

Austin is drawing people to jobs, music venues, comedy clubs, barbecue and more. But with this boom has come a big city blues: traffic jams.

Rekor, which offers traffic management and public safety analytics, has a front-row seat to the increasing traffic from an influx of new residents migrating to Austin. Rekor works with the Texas Department of Transportation, which has a $7 billion project addressing this, to help mitigate the roadway concerns.

“Texas has been trying to meet that growth and demand on the roadways by investing a lot in infrastructure, and they’re focusing a lot on digital infrastructure,” said Shervin Esfahani, vice president of global marketing and communications at Rekor. “It’s super complex, and they realized their traditional systems were unable to really manage and understand it in real time.”

Rekor, based in Columbia, Maryland, has been harnessing NVIDIA Metropolis for real-time video understanding and NVIDIA Jetson Xavier NX modules for edge AI in Texas, Florida, Philadelphia, Georgia, Nevada, Oklahoma and many more U.S. destinations as well as in Israel and other places internationally.

Metropolis is an application framework for smart infrastructure development with vision AI. It provides developer tools, including the NVIDIA DeepStream SDK, NVIDIA TAO Toolkit, pretrained models on the NVIDIA NGC catalog and NVIDIA TensorRT. NVIDIA Jetson is a compact, powerful and energy-efficient accelerated computing platform used for embedded and robotics applications.

Rekor’s efforts in Texas and Philadelphia to help better manage roads with AI are the latest development in an ongoing story for traffic safety and traffic management.

Reducing Rubbernecking, Pileups, Fatalities and Jams

Rekor offers two main products: Rekor Command and Rekor Discover. Command is an AI-driven platform for traffic management centers, providing rapid identification of traffic events and zones of concern. It offers departments of transportation with real-time situational awareness and alerts that allows them to keep city roadways safer and more congestion-free.

Discover taps into Rekor’s edge system to fully automate the capture of comprehensive traffic and vehicle data and provides robust traffic analytics that turn roadway data into measurable, reliable traffic knowledge. With Rekor Discover, departments of transportation can see a full picture of how vehicles move on roadways and the impact they make, allowing them to better organize and execute their future city-building initiatives.

The company has deployed Command across Austin to help detect issues, analyze incidents and respond to roadway activity with a real-time view.

“For every minute an incident happens and stays on the road, it creates four minutes of traffic, which puts a strain on the road, and the likelihood of a secondary incident like an accident from rubbernecking massively goes up,” said Paul-Mathew Zamsky, vice president of strategic growth and partnerships at Rekor. “Austin deployed Rekor Command and saw a 159% increase in incident detections, and they were able to respond eight and a half minutes faster to those incidents.”

Rekor Command takes in many feeds of data — like traffic camera footage, weather, connected car info and construction updates — and taps into any other data infrastructure, as well as third-party data. It then uses AI to make connections and surface up anomalies, like a roadside incident. That information is presented in workflows to traffic management centers for review, confirmation and response.

“They look at it and respond to it, and they are doing it faster than ever before,” said Esfahani. “It helps save lives on the road, and it also helps people’s quality of life, helps them get home faster and stay out of traffic, and it reduces the strain on the system in the city of Austin.”

In addition to adopting NVIDIA’s full-stack accelerated computing for roadway intelligence, Rekor is going all in on NVIDIA AI and NVIDIA AI Blueprints, which are reference workflows for generative AI use cases, built with NVIDIA NIM microservices as part of the NVIDIA AI Enterprise software platform. NVIDIA NIM is a set of easy-to-use inference microservices for accelerating deployments of foundation models on any cloud or data center while keeping data secure.

Rekor has multiple large language models and vision language models  running on NVIDIA Triton Inference Server in production,” according to Shai Maron, senior vice president of global software and data engineering at Rekor. 

“Internally, we’ll use it for data annotation, and it will help us optimize different aspects of our day to day,” he said. “LLMs externally will help us calibrate our cameras in a much more efficient way and configure them.”

Rekor is using the NVIDIA AI Blueprint for video search and summarization to build AI agents for city services, particularly in areas such as traffic management, public safety and optimization of city infrastructure. NVIDIA recently announced a new AI Blueprint for video search and summarization enabling a range of interactive visual AI agents that extracts complex activities from massive volumes of live or archived video.

Philadelphia Monitors Roads, EV Charger Needs, Pollution

Philadelphia Navy Yard is a tourism hub run by the Philadelphia Industrial Development Corporation (PIDC), which has some challenges in road management and gathering data on new developments for the popular area. The Navy Yard location, occupying 1,200 acres, has more than 150 companies and 15,000 employees, but a $6 billion redevelopment plan there promises to bring in 12,000-plus new jobs and thousands more as residents to the area.

PIDC sought greater visibility into the effects of road closures and construction projects on mobility and how to improve mobility during significant projects and events. PIDC also looked to strengthen the Navy Yard’s ability to understand the volume and traffic flow of car carriers or other large vehicles and quantify the impact of speed-mitigating devices deployed across hazardous stretches of roadway.

Discover provided PIDC insights into additional infrastructure projects that need to be deployed to manage any changes in traffic.

Understanding the number of electric vehicles, and where they’re entering and leaving the Navy Yard, provides PIDC with clear insights on potential sites for electric vehicle (EV) charge station deployment in the future. By pulling insights from Rekor’s edge systems, built with NVIDIA Jetson Xavier NX modules for powerful edge processing and AI, Rekor Discover lets Navy Yard understand the number of EVs and where they’re entering and leaving, allowing PIDC to better plan potential sites for EV charge station deployment in the future.

Rekor Discover enabled PIDC planners to create a hotspot map of EV traffic by looking at data provided by the AI platform. The solution relies on real-time traffic analysis using NVIDIA’s DeepStream data pipeline and Jetson. Additionally, it uses NVIDIA Triton Inference Server to enhance LLM capabilities.

The PIDC wanted to address public safety issues related to speeding and collisions as well as decrease property damage. Using speed insights, it’s deploying traffic calming measures where average speeds are exceeding what’s ideal on certain segments of roadway.

NVIDIA Jetson Xavier NX to Monitor Pollution in Real Time

Traditionally, urban planners can look at satellite imagery to try to understand pollution locations, but Rekor’s vehicle recognition models, running on NVIDIA Jetson Xavier NX modules, were able to track it to the sources, taking it a step further toward mitigation.

“It’s about air quality,” said Shobhit Jain, senior vice president of product management at Rekor. “We’ve built models to be really good at that. They can know how much pollution each vehicle is putting out.”

Looking ahead, Rekor is examining how NVIDIA Omniverse might be used for digital twins development in order to simulate traffic mitigation with different strategies. Omniverse is a platform for developing OpenUSD applications for industrial digitalization and generative physical AI.

Developing digital twins with Omniverse for municipalities has enormous implications for reducing traffic, pollution and road fatalities — all areas Rekor sees as hugely beneficial to its customers.

“Our data models are granular, and we’re definitely exploring Omniverse,” said Jain. “We’d like to see how we can support those digital use cases.”

Learn about the NVIDIA AI Blueprint for building AI agents for video search and summarization.

Read More

Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing…Apple Machine Learning Research

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

This paper was accepted at the Adaptive Foundation Models (AFM) workshop at NeurIPS Workshop 2024.
Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-Directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via…Apple Machine Learning Research