Revolutionizing large language model training with Arcee and AWS Trainium

Revolutionizing large language model training with Arcee and AWS Trainium

This is a guest post by Mark McQuade, Malikeh Ehghaghi, and Shamane Siri from Arcee.

In recent years, large language models (LLMs) have gained attention for their effectiveness, leading various industries to adapt general LLMs to their data for improved results, making efficient training and hardware availability crucial. At Arcee, we focus primarily on enhancing the domain adaptation of LLMs in a client-centric manner. Arcee’s innovative continual pre-training (CPT) and model merging techniques have brought a significant leap forward in the efficient training of LLMs, with particularly strong evaluations in the medical, legal, and financial verticals. Close collaboration with AWS Trainium has also played a major role in making the Arcee platform extremely performant, not only accelerating model training but also reducing overall costs and enforcing compliance and data integrity in the secure AWS environment. In this post, we show you how efficient we make our continual pre-training by using Trainium chips.

Understanding continual pre-training

Arcee recognizes the critical importance of continual CPT [1] in tailoring models to specific domains, as evidenced by previous studies such as PMC-LLaMA [2] and ChipNeMo [3]. These projects showcase the power of domain adaptation pre-training in enhancing model performance across diverse fields, from medical applications to industrial chip design. Inspired by these endeavors, our approach to CPT involves extending the training of base models like Llama 2 using domain-specific datasets, allowing us to fine-tune models to the nuances of specialized fields. To further amplify the efficiency of our CPT process, we collaborated with the Trainium team, using their cutting-edge technology to enhance a Llama 2 [4] model using a PubMed dataset [2] comprising 88 billion tokens. This collaboration represents a significant milestone in our quest for innovation, and through this post, we’re excited to share the transformative insights we’ve gained. Join us as we unveil the future of domain-specific model adaptation and the potential of CPT with Trainium in optimizing model performance for real-world applications.

Dataset collection

We followed the methodology outlined in the PMC-Llama paper [6] to assemble our dataset, which includes PubMed papers sourced from the Semantic Scholar API and various medical texts cited within the paper, culminating in a comprehensive collection of 88 billion tokens. For further details on the dataset, the original paper offers in-depth information.

To prepare this dataset for training, we used the Llama 2 tokenizer within an AWS Glue pipeline for efficient processing. We then organized the data so that each row contained 4,096 tokens, adhering to recommendations from the Neuron Distributed tutorials.

Why Trainium?

Continual pre-training techniques like the ones described in this post require access to high-performance compute instances, which has become more difficult to get as more developers are using generative artificial intelligence (AI) and LLMs for their applications. Traditionally, these workloads have been deployed to GPUs; however, in recent years, the cost and availability of GPUs has stifled model building innovations. With the introduction of Trainium, we are able to unlock new techniques that enable us to continue model innovations that will allow us to build models more efficiently and most importantly, at lower costs. Trainium is the second-generation machine learning (ML) accelerator that AWS purpose built to help developers access high-performance model training accelerators to help lower training costs by up to 50% over comparable Amazon Elastic Compute Cloud (Amazon EC2) instances. With Trainium available in AWS Regions worldwide, developers don’t have to take expensive, long-term compute reservations just to get access to clusters of GPUs to build their models. Trainium instances offer developers the performance they need with the elasticity they want to optimize both for training efficiency and lowering model building costs.

Setting up the Trainium cluster

We used AWS ParallelCluster to build a High Performance Computing (HPC) compute environment that uses Trn1 compute nodes to run our distributed ML training job (see the GitHub tutorial). You can also use developer flows like Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Ray, or others (to learn more, see Developer Flows). After the nodes were launched, we ran a training task to confirm that the nodes were working, and used slurm commands to check the job status. In this part, we used the AWS pcluster command to run a .yaml file to generate the cluster. Our cluster consisted of 16 nodes, each equipped with a trn1n.32xlarge instance featuring 32 GB of VRAM.

We set up our ParallelCluster infrastructure as shown in the following diagram (source).

As shown in the preceding figure, inside a VPC, there are two subnets, a public one and a private one. The head node resides in the public subnet, and the compute fleet (in this case, Trn1 instances) is in the private subnet. A NAT gateway is also needed in order for nodes in the private subnet to connect to clients outside the VPC. In the following section, we describe how to set up the necessary infrastructure for Trn1 ParallelCluster.

Set up the environment

To set up your environment, complete the following steps:

  1. Install the VPC and necessary components for ParallelCluster. For instructions, see VPC setup for ParallelCluster with Trn1.
  2. Create and launch ParallelCluster in the VPC. For instructions, see Create ParallelCluster.

Now you can launch a training job to submit a model training script as a slurm job.

Deploy to Trainium

Trainium-based EC2 Trn1 instances use the AWS Neuron SDK and support common ML frameworks like PyTorch and TensorFlow. Neuron allows for effortless distributed training and has integrations with Megatron Nemo and Neuron Distributed.

When engaging with Trainium, it’s crucial to understand several key parameters:

  • Tensor parallel size – This determines the level of tensor parallelization, particularly in self-attention computations within transformers, and is crucial for optimizing memory usage (not computational time efficiency) during model loading
  • NeuronCores – Each Trainium device has two NeuronCores, and an eight-node setup equates to a substantial 256 cores
  • Mini batch – This reflects the number of examples processed in each batch as determined by the data loader
  • World size – This is the total count of nodes involved in the training operation

A deep understanding of these parameters is vital for anyone looking to harness the power of Trainium devices effectively.

Train the model

For this post, we train a Llama 2 7B model with tensor parallelism. For a streamlined and effective training process, we adhered to the following steps:

  1. Download the Llama 2 full checkpoints (model weights and tokenizer) from Hugging Face.
  2. Convert these checkpoints to a format compatible with the Neuron Distributed setup, so they can be efficiently utilized in our training infrastructure.
  3. Determine the number of steps required per epoch, incorporating the effective batch size and dataset size to tailor the training process to our specific needs.
  4. Launch the training job, carefully monitoring its progress and performance.
  5. Periodically save training checkpoints. Initially, this process may be slow due to its synchronous nature, but improvements are anticipated as the NeuronX team works on enhancements.
  6. Finally, convert the saved checkpoints back to a standard format for subsequent use, employing scripts for seamless conversion.

For more details, you can find the full implementation of the training steps in the following GitHub repository.

Clean up

Don’t forget to tear down any resources you set up in this post.

Results

Our study focused on evaluating the quality of the CPT-enhanced checkpoints. We monitored the perplexity of a held-out PubMed dataset [6] across various checkpoints obtained during training, which provided valuable insights into the model’s performance improvements over time.

Through this journey, we’ve advanced our model’s capabilities, and hope to contribute to the broader community’s understanding of effective model adaptation strategies.

The following figure shows the perplexity of the baseline Llama 2 7B checkpoint vs. its CPT-enhanced checkpoint on the PMC test dataset. Based on these findings, continual pre-training on domain-specific raw data, specifically PubMed papers in our study, resulted in an enhancement of the Llama 2 7B checkpoint, leading to improved perplexity of the model on the PMC test set.

The following figure shows the perplexity of the CPT-enhanced checkpoints of the Llama 2 7B model across varying numbers of trained tokens. The increasing number of trained tokens correlated with enhanced model performance, as measured by the perplexity metric.

The following figure shows the perplexity comparison between the baseline Llama 2 7B model and its CPT-enhanced checkpoints, with and without data mixing. This underscores the significance of data mixing, where we have added 1% of general tokens to the domain-specific dataset, wherein utilizing a CPT-enhanced checkpoint with data mixing exhibited better performance compared to both the baseline Llama 2 7B model and the CPT-enhanced checkpoint solely trained on PubMed data.

Conclusion

Arcee’s innovative approach to CPT and model merging, as demonstrated through our collaboration with Trainium, signifies a transformative advancement in the training of LLMs, particularly in specialized domains such as medical research. By using the extensive capabilities of Trainium, we have not only accelerated the model training process, but also significantly reduced costs, with an emphasis on security and compliance that provides data integrity within a secure AWS environment.

The results from our training experiments, as seen in the improved perplexity scores of domain-specific models, underscore the effectiveness of our method in enhancing the performance and applicability of LLMs across various fields. This is particularly evident from the direct comparisons of time-to-train metrics between Trainium and traditional GPU setups, where Trainium’s efficiency and cost-effectiveness shine.

Furthermore, our case study using PubMed data for domain-specific training highlights the potential of Arcee’s CPT strategies to fine-tune models to the nuances of highly specialized datasets, thereby creating more accurate and reliable tools for professionals in those fields.

As we continue to push the boundaries of what’s possible in LLM training, we encourage researchers, developers, and enterprises to take advantage of the scalability, efficiency, and enhanced security features of Trainium and Arcee’s methodologies. These technologies not only facilitate more effective model training, but also open up new avenues for innovation and practical application in AI-driven industries.

The integration of Trainium’s advanced ML capabilities with Arcee’s pioneering strategies in model training and adaptation is poised to revolutionize the landscape of LLM development, making it more accessible, economical, and tailored to meet the evolving demands of diverse industries.

To learn more about Arcee.ai, visit Arcee.ai or reach out to our team.

Additional resources

References

  1. Gupta, Kshitij, et al. “Continual Pre-Training of Large Language Models: How to (re) warm your model?.” arXiv preprint arXiv:2308.04014 (2023).
  2. Wu, Chaoyi, et al. “Pmc-LLaMA: Towards building open-source language models for medicine.” arXiv preprint arXiv:2305.10415 6 (2023).
  3. Liu, Mingjie, et al. “Chipnemo: Domain-adapted llms for chip design.” arXiv preprint arXiv:2311.00176 (2023).
  4. Touvron, Hugo, et al. “Llama 2: Open foundation and fine-tuned chat models.” arXiv preprint arXiv:2307.09288 (2023).
  5. https://aws.amazon.com/ec2/instance-types/trn1/
  6. Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. (2023). Pmc-llama: Further fine tuning llama on medical papers. arXiv preprint arXiv:2304.14454.

About the Authors

Mark McQuade is the CEO/Co-Founder at Arcee. Mark co-founded Arcee with a vision to empower enterprises with industry-specific AI solutions. This idea emerged from his time at Hugging Face, where he helped spearhead the Monetization team, collaborating with high-profile enterprises. This frontline experience exposed him to critical industry pain points: the reluctance to rely on closed source APIs and the challenges of training open source models without compromising data security.

Shamane Siri Ph.D. is the Head of Applied NLP Research at Arcee. Before joining Arcee, Shamane worked in both industry and academia, developing recommendation systems using language models to address the cold start problem, and focusing on information retrieval, multi-modal emotion recognition, and summarization. Shamane has also collaborated with the Hugging Face Transformers crew and Meta Reality Labs on cutting-edge projects. He holds a PhD from the University of Auckland, where he specialized in domain adaptation of foundational language models.

Malikeh Ehghaghi is an Applied NLP Research Engineer at Arcee. Malikeh’s research interests are NLP, domain-adaptation of LLMs, ML for healthcare, and responsible AI. She earned an MScAC degree in Computer Science from the University of Toronto. She previously collaborated with Lavita AI as a Machine Learning Consultant, developing healthcare chatbots in partnership with Dartmouth Center for Precision Health and Artificial Intelligence. She also worked as a Machine Learning Research Scientist at Cambridge Cognition Inc. and Winterlight Labs, with a focus on monitoring and detection of mental health disorders through speech and language. Malikeh has authored several publications presented at top-tier conferences such as ACL, COLING, AAAI, NAACL, IEEE-BHI, and MICCAI.

Read More

SEA.AI Navigates the Future With AI at the Helm

SEA.AI Navigates the Future With AI at the Helm

Talk about commitment. When startup SEA.AI, an NVIDIA Metropolis partner, set out to create a system that would use AI to scan the seas to enhance maritime safety, entrepreneur Raphael Biancale wasn’t afraid to take the plunge. He donned a lifejacket and jumped into the ocean.

It’s a move that demonstrates Biancale’s commitment and pioneering approach. The startup, founded in 2018 and based in Linz, Austria, with subsidiaries in France, Portugal and the US, had to build its first-of-its-kind training data from scratch in order to train an AI to help oceangoers of all kinds scan the seas.

And to do that, the company needed photos of what a person in the water looked like. That’s when Biancale, now the company’s head of research, walked the plank.

The company has come a long way since then, with a full product line powered by NVIDIA AI technology that lets commercial and recreational sailors detect objects on the seas, whether potential hazards or people needing rescue.

It’s an effort inspired by Biancale’s experience on a night sail. The lack of visibility and situational awareness illuminated the pressing need for advanced safety technologies in the maritime world that AI is bringing to the automotive industry.

AI, of course, is finding its way into all things aquatic. Startup Saildrone’s autonomous sailing vessels can help conduct data-gathering for science, fisheries, weather forecasting, ocean mapping and maritime security. Other researchers are using AI to interpret whale songs and even protect beachgoers from dangerous rip currents.

SEA.AI, however, promises to make the seas safer for everyone who steps aboard a boat. First introduced for ocean racing sailboats, SEA.AI’s system has quickly evolved into an AI-powered situational awareness system that can be deployed on everything from recreational sail and powerboats to commercial shipping vessels.

SEA.AI directly addresses one of the most significant risks for all these vessels: collisions. Thanks to SEA.AI, commercial and recreational oceangoers worldwide can travel with more confidence.

How SEA.AI Works

At the heart of SEA.AI’s approach is a proprietary database of over 9 million annotated marine objects which is growing constantly.

When combined with high-tech optical sensors and the latest AI technology from NVIDIA, SEA.AI’s systems can recognize and classify objects in real-time, significantly improving maritime safety.

SEA.AI technology can detect a person in water up to 700 meters — almost half a mile — away, a dinghy up to 3,000 meters, and motorboats up to 7,500 meters.

This capability ensures maritime operators can identify hazards before they pose a threat. It complements older marine safety systems that rely on radar and satellite signals.

SEA.AI solutions integrate with central marine display units from industry-leading manufacturers like Raymarine, B&G, Garmin and Furuno as well as Android and iOS-based mobile devices. This provides broad applicability across the maritime sector, from recreational vessels to commercial and government ships.

The NVIDIA Jetson edge AI platform is integral to SEA.AI’s success. The platform for robotics and embedded computing applications enables SEA.AI products to achieve unparalleled processing power and efficiency, setting a new standard in maritime safety by quickly detecting, analyzing and alerting operators to objects.

AI Integration and Real-Time Object Detection

SEA.AI uses NVIDIA’s AI and machine vision technology to offer real-time object detection and classification, providing maritime operators with immediate identification of potential hazards.

SEA.AI is bringing its approach to oceangoers of all kinds with three product lines.

One, SEA.AI Sentry, provides 360-degree situational awareness for commercial vessels and motor yachts with features like collision avoidance, object tracking and perimeter surveillance.

Another, SEA.AI Offshore,  provides bluewater sailors with high-tech safety and convenience with simplified installation across several editions that can suit different detection and technical needs.

The third, SEA.AI Competition, offers reliable object detection for ocean racing and performance yacht sailors. Its ultra-lightweight design ensures maximum performance when navigating at high speeds.

With a growing team of more than 60 and a distribution network spanning over 40 countries, SEA.AI is charting a course to help ensure every journey on the waves is safer.

Read More

Label-Efficient Sleep Staging Using Transformers Pre-trained with Position Prediction

Sleep staging is a clinically important task for diagnosing various sleep disorders but remains challenging to deploy at scale because it requires clinical expertise, among other reasons. Deep learning models can perform the task but at the expense of large labeled datasets, which are unfeasible to procure at scale. While self-supervised learning (SSL) can mitigate this need, recent studies on SSL for sleep staging have shown performance gains saturate after training with labeled data from only tens of subjects, hence are unable to match peak performance attained with larger datasets. We…Apple Machine Learning Research

Databricks DBRX is now available in Amazon SageMaker JumpStart

Databricks DBRX is now available in Amazon SageMaker JumpStart

Today, we are excited to announce that the DBRX model, an open, general-purpose large language model (LLM) developed by Databricks, is available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. The DBRX LLM employs a fine-grained mixture-of-experts (MoE) architecture, pre-trained on 12 trillion tokens of carefully curated data and a maximum context length of 32,000 tokens.

You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models so you can quickly get started with ML. In this post, we walk through how to discover and deploy the DBRX model.

What is the DBRX model

DBRX is a sophisticated decoder-only LLM built on transformer architecture. It employs a fine-grained MoE architecture, incorporating 132 billion total parameters, with 36 billion of these parameters being active for any given input.

The model underwent pre-training using a dataset consisting of 12 trillion tokens of text and code. In contrast to other open MoE models like Mixtral and Grok-1, DBRX features a fine-grained approach, using a higher quantity of smaller experts for optimized performance. Compared to other MoE models, DBRX has 16 experts and chooses 4.

The model is made available under the Databricks Open Model license, for use without restrictions.

What is SageMaker JumpStart

SageMaker JumpStart is a fully managed platform that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly and with ease, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is the Model Hub, which offers a vast catalog of pre-trained models, such as DBRX, for a variety of tasks.

You can now discover and deploy DBRX models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with Amazon SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping provide data security.

Discover models in SageMaker JumpStart

You can access the DBRX model through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

From the SageMaker JumpStart landing page, you can search for “DBRX” in the search box. The search results will list DBRX Instruct and DBRX Base.

You can choose the model card to view details about the model such as license, data used to train, and how to use the model. You will also find the Deploy button to deploy the model and create an endpoint.

Deploy the model in SageMaker JumpStart

Deployment starts when you choose the Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.

DBRX Base

To deploy using the SDK, we start by selecting the DBRX Base model, specified by the model_id with value huggingface-llm-dbrx-base. You can deploy any of the selected models on SageMaker with the following code. Similarly, you can deploy DBRX Instruct using its own model ID.

from sagemaker.jumpstart.model import JumpStartModel

accept_eula = True

model = JumpStartModel(model_id="huggingface-llm-dbrx-base")
predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The Eula value must be explicitly defined as True in order to accept the end-user license agreement (EULA). Also make sure you have the account-level service limit for using ml.p4d.24xlarge or ml.pde.24xlarge for endpoint usage as one or more instances. You can follow the instructions here in order to request a service quota increase.

After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    "inputs": "Hello!",
    "parameters": {
        "max_new_tokens": 10,
    },
}
predictor.predict(payload)

Example prompts

You can interact with the DBRX Base model like any standard text generation model, where the model processes an input sequence and outputs predicted next words in the sequence. In this section, we provide some example prompts and sample output.

Code generation

Using the preceding example, we can use code generation prompts as follows:

payload = { 
      "inputs": "Write a function to read a CSV file in Python using pandas library:", 
      "parameters": { 
          "max_new_tokens": 30, }, } 
           response = predictor.predict(payload)["generated_text"].strip() 
           print(response)

The following is the output:

import pandas as pd 
df = pd.read_csv("file_name.csv") 
#The above code will import pandas library and then read the CSV file using read_csv

Sentiment analysis

You can perform sentiment analysis using a prompt like the following with DBRX:

payload = {
"inputs": """
Tweet: "I am so excited for the weekend!"
Sentiment: Positive

Tweet: "Why does traffic have to be so terrible?"
Sentiment: Negative

Tweet: "Just saw a great movie, would recommend it."
Sentiment: Positive

Tweet: "According to the weather report, it will be cloudy today."
Sentiment: Neutral

Tweet: "This restaurant is absolutely terrible."
Sentiment: Negative

Tweet: "I love spending time with my family."
Sentiment:""",
"parameters": {
"max_new_tokens": 2,
},
}
response = predictor.predict(payload)["generated_text"].strip()
print(response)

The following is the output:

Positive

Question answering

You can use a question answering prompt like the following with DBRX:

# Question answering
payload = {
    "inputs": "Respond to the question: How did the development of transportation systems, such as railroads and steamships, impact global trade and cultural exchange?",
    "parameters": {
        "max_new_tokens": 225,
    },
}
response = predictor.predict(payload)["generated_text"].strip()
print(response)

The following is the output:

The development of transportation systems, such as railroads and steamships, impacted global trade and cultural exchange in a number of ways. 
The documents provided show that the development of these systems had a profound effect on the way people and goods were able to move around the world. 
One of the most significant impacts of the development of transportation systems was the way it facilitated global trade. 
The documents show that the development of railroads and steamships made it possible for goods to be transported more quickly and efficiently than ever before. 
This allowed for a greater exchange of goods between different parts of the world, which in turn led to a greater exchange of ideas and cultures. 
Another impact of the development of transportation systems was the way it facilitated cultural exchange. The documents show that the development of railroads and steamships made it possible for people to travel more easily and quickly than ever before. 
This allowed for a greater exchange of ideas and cultures between different parts of the world. Overall, the development of transportation systems, such as railroads and steamships, had a profound impact on global trade and cultural exchange.

 

DBRX Instruct

The instruction-tuned version of DBRX accepts formatted instructions where conversation roles must start with a prompt from the user and alternate between user instructions and the assistant (DBRX-instruct). The instruction format must be strictly respected, otherwise the model will generate suboptimal outputs. The template to build a prompt for the Instruct model is defined as follows:

<|im_start|>system
{system_message} <|im_end|>
<|im_start|>user
{human_message} <|im_end|>
<|im_start|>assistantn

<|im_start|> and <|im_end|> are special tokens for beginning of string (BOS) and end of string (EOS). The model can contain multiple conversation turns between system, user, and assistant, allowing for the incorporation of few-shot examples to enhance the model’s responses.

The following code shows how you can format the prompt in instruction format:

from typing import Dict, List

def format_instructions(instructions: List[Dict[str, str]]) -> List[str]:
    """Format instructions where conversation roles must alternate system/user/assistant/user/assistant/..."""
    prompt: List[str] = []
    for instruction in instructions:
        if instruction["role"] == "system":
            prompt.extend(["<|im_start|>systemn", (instruction["content"]).strip(), " <|im_end|>n"])
        elif instruction["role"] == "user":
            prompt.extend(["<|im_start|>usern", (instruction["content"]).strip(), " <|im_end|>n"])
        else:
            raise ValueError(f"Invalid role: {instruction['role']}. Role must be either 'user' or 'system'.")
    prompt.extend(["<|im_start|>assistantn"])
    return "".join(prompt)

def print_instructions(prompt: str, response: str) -> None:
    bold, unbold = '33[1m', '33[0m'
    print(f"{bold}> Input{unbold}n{prompt}nn{bold}> Output{unbold}n{response['generated_text'].strip()}n")

Knowledge retrieval

You can use the following prompt for knowledge retrieval:

instructions = [
{"role": "system", "content": "You are an expert in suggesting diet plans optimized for different sports or fitness activities, taking into account energy requirements, and recovery requirements."},
{"role": "user", "content": "I am trying to build muscle mass and increase my strength. What kind of high-protein, calorie-dense diet plan would you recommend for me?"}
]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 700, "temperature": 0.5, "do_sample": True}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

The following is the output:

> Input
<|im_start|>system
You are an expert in suggesting diet plans optimized for different sports or fitness activities, taking into account energy requirements, and recovery requirements. <|im_end|>
<|im_start|>user
I am trying to build muscle mass and increase my strength. What kind of high-protein, calorie-dense diet plan would you recommend for me? <|im_end|>
<|im_start|>assistant

> Output
Absolutely, I'd be happy to help you with that. Given your goal of building muscle mass and increasing strength, you'll want a diet that's high in protein and provides sufficient calories to support muscle growth and recovery. Here's a basic outline of what your diet plan could look like:

1. **Breakfast:**
   - 3 whole eggs
   - 3 egg whites
   - 1 cup of oatmeal
   - 1 scoop of whey protein
   - 1 banana

2. **Mid-Morning Snack:**
   - 1.5 ounces of almonds
   - 1 apple

3. **Lunch:**
   - 8 ounces of chicken breast
   - 1 cup of brown rice
   - 1 cup of mixed vegetables

4. **Afternoon Snack:**
   - 1.5 cups of Greek yogurt
   - 1 cup of mixed berries

5. **Dinner:**
   - 8 ounces of salmon or lean beef
   - 1 cup of quinoa
   - 1.5 cups of mixed vegetables

6. **Evening Snack:**
   - 1 scoop of casein protein
   - 1 tablespoon of almond butter

This is a high-protein, calorie-dense diet that includes a variety of nutrient-rich foods. It's important to note that everyone's nutritional needs are different, so you may need to adjust portion sizes or food choices based on your specific needs and preferences.

Also, remember to stay hydrated by drinking plenty of water throughout the day. Regular exercise, particularly strength training, is crucial to achieving your goals. It's also important to get enough sleep to aid in muscle recovery and growth.

Before starting any new diet or exercise program, it's a good idea to consult with a healthcare professional or a registered dietitian. They can provide personalized advice based on your health history and current lifestyle.

Code generation

DBRX models demonstrate benchmarked strengths for coding tasks. For example, see the following code:

instructions = [ 
{"role": "system", "content": "You are an expert in coding and programming. Your role is to provide examples, explanations, and insights related to programming languages, algorithms, data structures, and software development practices."}, 
{"role": "user", "content": "Write a Python script to read a CSV file containing stock prices and plot the closing prices over time using Matplotlib."} 
] 
prompt = format_instructions(instructions) 
payload = { "inputs": prompt, "parameters": {"max_new_tokens": 315, "temperature": 0.3, "do_sample": True} } 
response=predictor.predict(payload) 
print_instructions(prompt, response)

The following is the output:

> Input
<|im_start|>system
You are an expert in coding and programming. Your role is to provide examples, explanations, and insights related to programming languages, algorithms, data structures, and software development practices. <|im_end|>
<|im_start|>user
Write a Python script to read a CSV file containing stock prices and plot the closing prices over time using Matplotlib. <|im_end|>
<|im_start|>assistant

> Output
Sure, I'd be happy to help you with that. Here's a simple Python script that uses the `pandas` library to read a CSV file and `matplotlib` to plot the closing prices over time.

```python
import pandas as pd
import matplotlib.pyplot as plt

# Load the data from a CSV file
df = pd.read_csv('stock_prices.csv')

# Plot the closing prices over time
plt.figure(figsize=(12,5))
plt.plot(df['Date'], df['Close'])
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Stock Prices Over Time')
plt.show()
```

In this script, replace `'stock_prices.csv'` with the path to your CSV file. The file should have columns named 'Date' and 'Close' for this script to work correctly. If your file uses different column names, you'll need to adjust the script accordingly.

The `pandas` library is a powerful tool for data manipulation and analysis. It can read data from a variety of sources, including CSV files, and represent the data in a tabular format. The `matplotlib` library is a popular choice for creating static, animated, and interactive visualizations in Python.

This script first loads the data from the CSV file into a pandas DataFrame. It then plots the 'Close' column against the 'Date' column using matplotlib's `plot()` function. The `figure()` function is used to specify the size of the plot, and `show()` is used to display the plot.

Mathematics and reasoning

The DBRX models also report strengths in mathematic accuracy. For example, see the following code:

instructions = [
  {"role": "system", "content": "You are an expert in mathematics and reasoning. Your role is to provide examples, explanations, and insights related to mathematical concepts, problem-solving techniques, and logical reasoning."},
  {"role": "user", "content": "A car rental company charges $40 per day plus $0.25 per mile driven. If I rented a car for 3 days and drove 250 miles, what would be the total cost? Break down the calculation steps before giving the final answer."}
]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 400, "temperature": 0.2, "do_sample": True}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

DBRX can provide comprehension as shown in the following output with the math logic:

> Input
<|im_start|>system
You are an expert in mathematics and reasoning. Your role is to provide examples, explanations, and insights related to mathematical concepts, problem-solving techniques, and logical reasoning. <|im_end|>
<|im_start|>user
A car rental company charges $40 per day plus $0.25 per mile driven. If I rented a car for 3 days and drove 250 miles, what would be the total cost? Break down the calculation steps before giving the final answer. <|im_end|>
<|im_start|>assistant

> Output
Sure, I'd be happy to help with that! The problem you've presented involves both fixed costs (the daily rate) and variable costs (the cost per mile). Let's break it down:

1. First, let's calculate the total cost of the fixed daily rate. The car rental company charges $40 per day, and you rented the car for 3 days. So, the total cost for the daily rate is $40/day * 3 days = $120.

2. Next, let's calculate the total cost of the variable rate, which is the cost per mile driven. The company charges $0.25 per mile, and you drove 250 miles. So, the total cost for the mileage is $0.25/mile * 250 miles = $62.50.

3. Finally, to find the total cost of renting the car, we add the total cost of the daily rate and the total cost of the mileage. So, the total cost is $120 + $62.50 = $182.50.

So, the total cost of renting the car for 3 days and driving 250 miles would be $182.50.

Clean up

After you’re done running the notebook, make sure to delete all resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with DBRX in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.

Resources


About the Authors

Shikhar Kwatra is an AI/ML Specialist Solutions Architect at Amazon Web Services, working with a leading Global System Integrator. He has earned the title of one of the Youngest Indian Master Inventors with over 400 patents in the AI/ML and IoT domains. He has over 8 years of industry experience from startups to large-scale enterprises, from IoT Research Engineer, Data Scientist, to Data & AI Architect. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for organizations and supports GSI partners in building strategic industry

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the world with his wife.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Read More

Knowledge Bases in Amazon Bedrock now simplifies asking questions on a single document

Knowledge Bases in Amazon Bedrock now simplifies asking questions on a single document

At AWS re:Invent 2023, we announced the general availability of Knowledge Bases for Amazon Bedrock. With Knowledge Bases for Amazon Bedrock, you can securely connect foundation models (FMs) in Amazon Bedrock to your company data for fully managed Retrieval Augmented Generation (RAG).

In previous posts, we covered new capabilities like hybrid search support, metadata filtering to improve retrieval accuracy, and how Knowledge Bases for Amazon Bedrock manages the end-to-end RAG workflow.

Today, we’re introducing the new capability to chat with your document with zero setup in Knowledge Bases for Amazon Bedrock. With this new capability, you can securely ask questions on single documents, without the overhead of setting up a vector database or ingesting data, making it effortless for businesses to use their enterprise data. You only need to provide a relevant data file as input and choose your FM to get started.

But before we jump into the details of this feature, let’s start with the basics and understand what RAG is, its benefits, and how this new capability enables content retrieval and generation for temporal needs.

What is Retrieval Augmented Generation?

FM-powered artificial intelligence (AI) assistants have limitations, such as providing outdated information or struggling with context outside their training data. RAG addresses these issues by allowing FMs to cross-reference authoritative knowledge sources before generating responses.

With RAG, when a user asks a question, the system retrieves relevant context from a curated knowledge base, such as company documentation. It provides this context to the FM, which uses it to generate a more informed and precise response. RAG helps overcome FM limitations by augmenting its capabilities with an organization’s proprietary knowledge, enabling chatbots and AI assistants to provide up-to-date, context-specific information tailored to business needs without retraining the entire FM. At AWS, we recognize RAG’s potential and have worked to simplify its adoption through Knowledge Bases for Amazon Bedrock, providing a fully managed RAG experience.

Short-term and instant information needs

Although a knowledge base does all the heavy lifting and serves as a persistent large store of enterprise knowledge, you might require temporary access to data for specific tasks or analysis within isolated user sessions. Traditional RAG approaches are not optimized for these short-term, session-based data access scenarios.

Businesses incur charges for data storage and management. This may make RAG less cost-effective for organizations with highly dynamic or ephemeral information requirements, especially when data is only needed for specific, isolated tasks or analyses.

Ask questions on a single document with zero setup

This new capability to chat with your document within Knowledge Bases for Amazon Bedrock addresses the aforementioned challenges. It provides a zero-setup method to use your single document for content retrieval and generation-related tasks, along with the FMs provided by Amazon Bedrock. With this new capability, you can ask questions of your data without the overhead of setting up a vector database or ingesting data, making it effortless to use your enterprise data.

You can now interact with your documents in real time without prior data ingestion or database configuration. You don’t need to take any further data readiness steps before querying the data.

This zero-setup approach makes it straightforward to use your enterprise information assets with generative AI using Amazon Bedrock.

Use cases and benefits

Consider a recruiting firm that needs to analyze resumes and match candidates with suitable job opportunities based on their experience and skills. Previously, you would have to set up a knowledge base, invoking a data ingestion workflow to make sure only authorized recruiters can access the data. Additionally, you would need to manage cleanup when the data was no longer required for a session or candidate. In the end, you would pay more for the vector database storage and management than for the actual FM usage. This new feature in Knowledge Bases for Amazon Bedrock enables recruiters to quickly and ephemerally analyze resumes and match candidates with suitable job opportunities based on the candidate’s experience and skill set.

For another example, consider a product manager at a technology company who needs to quickly analyze customer feedback and support tickets to identify common issues and areas for improvement. With this new capability, you can simply upload a document to extract insights in no time. For example, you could ask “What are the requirements for the mobile app?” or “What are the common pain points mentioned by customers regarding our onboarding process?” This feature empowers you to rapidly synthesize this information without the hassle of data preparation or any management overhead. You can also request summaries or key takeaways, such as “What are the highlights from this requirements document?”

The benefits of this feature extend beyond cost savings and operational efficiency. By eliminating the need for vector databases and data ingestion, this new capability within Knowledge Bases for Amazon Bedrock helps secure your proprietary data, making it accessible only within the context of isolated user sessions.

Now that we’ve covered the feature benefits and the use cases it enables, let’s dive into how you can start using this new feature from Knowledge Bases for Amazon Bedrock.

Chat with your document in Knowledge Bases for Amazon Bedrock

You have multiple options to begin using this feature:

  • The Amazon Bedrock console
  • The Amazon Bedrock RetrieveAndGenerate API (SDK)

Let’s see how we can get started using the Amazon Bedrock console:

  1. On the Amazon Bedrock console, under Orchestration in the navigation pane, choose Knowledge bases.
  2. Choose Chat with your document.
  3. Under Model, choose Select model.
  4. Choose your model. For this example, we use the Claude 3 Sonnet model (we are only supporting Sonnet at the time of the launch).
  5. Choose Apply.
  6. Under Data, you can upload the document you want to chat with or point to the Amazon Simple Storage Service (Amazon S3) bucket location that contains your file. For this post, we upload a document from our computer.

The supported file formats are PDF, MD (Markdown), TXT, DOCX, HTML, CSV, XLS, and XLSX. Make that the file size does not exceed 10 MB and contains no more than 20,000 tokens. A token is considered to be a unit of text, such as a word, sub-word, number, or symbol, that is processed as a single entity. Due to the preset ingestion token limit, it is recommended to use a file under 10MB. However, a text-heavy file, that is much smaller than 10MB, can potentially breach the token limit.

You’re now ready to chat with your document.

As shown in the following screenshot, you can chat with your document in real time.

To customize your prompt, enter your prompt under System prompt.

Similarly, you can use the AWS SDK through the retrieve_and_generate API in major coding languages. In the following example, we use the AWS SDK for Python (Boto3):

import boto3

bedrock_client = boto3.client(service_name='bedrock-agent-runtime')
model_id = "your_model_id_here"    # Replace with your modelID
document_uri = "your_s3_uri_here"  # Replace with your S3 URI

def retrieveAndGenerate(input_text, sourceType, model_id, document_s3_uri=None, data=None):
    region = 'us-west-2'  
    model_arn = f'arn:aws:bedrock:{region}::foundation-model/{model_id}'

    if sourceType == "S3":
        return bedrock_client.retrieve_and_generate(
            input={'text': input_text},
            retrieveAndGenerateConfiguration={
                'type': 'EXTERNAL_SOURCES',
                'externalSourcesConfiguration': {
                    'modelArn': model_arn,
                    'sources': [
                        {
                            "sourceType": sourceType,
                            "s3Location": {
                                "uri": document_s3_uri  
                            }
                        }
                    ]
                }
            }
        )
        
    else:
        return bedrock_client.retrieve_and_generate(
            input={'text': input_text},
            retrieveAndGenerateConfiguration={
                'type': 'EXTERNAL_SOURCES',
                'externalSourcesConfiguration': {
                    'modelArn': model_arn,
                    'sources': [
                        {
                            "sourceType": sourceType,
                            "byteContent": {
                                "identifier": "testFile.txt",
                                "contentType": "text/plain",
                                "data": data  
                            }
                        }
                    ]
                }
            }
        )

response = retrieveAndGenerate(
                                input_text="What is the main topic of this document?",
                                sourceType="S3", 
                                model_id=model_id,
                                document_s3_uri=document_uri
                              )
                    
print(response['output']['text'])

Conclusion

In this post, we covered how Knowledge Bases for Amazon Bedrock now simplifies asking questions on a single document. We explored the core concepts behind RAG, the challenges this new feature addresses, and the various use cases it enables across different roles and industries. We also demonstrated how to configure and use this capability through the Amazon Bedrock console and the AWS SDK, showcasing the simplicity and flexibility of this feature, which provides a zero-setup solution to gather information from a single document, without setting up a vector database.

To further explore the capabilities of Knowledge Bases for Amazon Bedrock, refer to the following resources:

Share and learn with our generative AI community at community.aws.


About the authors

Suman Debnath is a Principal Developer Advocate for Machine Learning at Amazon Web Services. He frequently speaks at AI/ML conferences, events, and meetups around the world. He is passionate about large-scale distributed systems and is an avid fan of Python.

Sebastian Munera is a Software Engineer in the Amazon Bedrock Knowledge Bases team at AWS where he focuses on building customer solutions that leverage Generative AI and RAG applications. He has previously worked on building Generative AI-based solutions for customers to streamline their processes and Low code/No code applications. In his spare time he enjoys running, lifting and tinkering with technology.

Read More

AI Drives Future of Transportation at Asia’s Largest Automotive Show

AI Drives Future of Transportation at Asia’s Largest Automotive Show

The latest trends and technologies in the automotive industry are in the spotlight at the Beijing International Automotive Exhibition, aka Auto China, which opens to the public on Saturday, April 27.

An array of NVIDIA auto partners is embracing this year’s theme, “New Era, New Cars,” by making announcements and showcasing their latest offerings powered by NVIDIA DRIVE, the platform for AI-defined vehicles.

NVIDIA Auto Partners Announce New Vehicles and Technologies

Image courtesy of JIYUE.

Electric vehicle (EV) makers Chery (booth E107) and JIYUE (booth W206), a joint venture between Baidu and Geely (booth W204), announced they have adopted the next-generation NVIDIA DRIVE Thor centralized car computer.

DRIVE Thor will integrate the new NVIDIA Blackwell GPU architecture, designed for transformer, large language model and generative AI workloads.

In addition, a number of automakers are building next-gen vehicles on NVIDIA DRIVE Orin, including:

smart, a joint venture between Mercedes-Benz and Geely, previewed its largest and most spacious model to date, an electric SUV called #5. It will be built on its Pilot Assist 3.0 intelligent driving-assistance platform, powered by NVIDIA DRIVE Orin, which supports point-to-point automatic urban navigation. smart #5 will be available for purchase in the second half of this year. smart will be at booth E408.

NIO, a pioneer in the premium smart EV market, unveiled its updated ET7 sedan, featuring upgraded cabin intelligence and smart-driving capabilities. NIO also showcased its 2024 ET5 and ES7. All 2024 models are equipped with four NVIDIA DRIVE Orin systems-on-a-chip (SoCs). Intelligent-driving capabilities in urban areas will fully launch soon. NIO will be at booth E207.

Image courtesy of GWM.

GWM revealed the WEY Blue Mountain (Lanshan) Intelligent Driving Edition, its luxury, high-end SUV. This upgraded vehicle is built on GWM’s Coffee Pilot Ultra intelligent-driving system, powered by NVIDIA DRIVE Orin, and can support features such as urban navigate-on-autopilot (NOA) and cross-floor memory parking. GWM will be at booth E303.

XPENG, a designer and manufacturer of intelligent EVs, announced that it is streamlining the design workflow of its flagship XPENG X9 using the NVIDIA Omniverse platform. In March, XPENG announced it will adopt NVIDIA DRIVE Thor for its next-generation EV fleets. XPENG will be at booth W402.

Innovation on Display

On the exhibition floor, NVIDIA partners are showcasing their NVIDIA DRIVE-powered vehicles:

Image courtesy of BYD.

BYD, DENZA and YANGWANG are featuring their latest vehicles built on NVIDIA DRIVE Orin. The largest EV maker in the world, BYD is building both its Ocean and Dynasty series on NVIDIA DRIVE Orin. In addition, BYDE, a subsidiary of BYD, will tap into the NVIDIA Isaac and NVIDIA Omniverse platforms to develop tools and applications for virtual factory planning and retail configurators. BYD will be at booth W106, DENZA at W408 and YANGWANG at W105.

DeepRoute.ai is showcasing its new intelligent driving-platform, DeepRoute IO, and highlighting its end-to-end model. Powered by NVIDIA DRIVE Orin, the first mass-produced car built on DeepRoute IO will focus on assisted driving and parking. DeepRoute.ai will be at booth W4-W07.

Hyper, a luxury brand owned by GAC AION, is displaying its latest Hyper GT and Hyper HT models, powered by NVIDIA DRIVE Orin. These vehicles feature advanced level 2+ driving capabilities in high-speed environments. Hyper recently announced it selected DRIVE Thor for its next-generation EVs with level 4 driving capabilities. Hyper will be at booth W310.

IM Motors is exhibiting the recently launched L6 Super Intelligent Vehicle. The entire lineup of the IM L6 is equipped with NVIDIA DRIVE Orin to power intelligent driving abilities, including urban NOA features. IM Motors will be at booth W205.

Li Auto is showcasing its recently released L6 model, as well as L7, L8, L9 and MEGA. Models equipped with Li Auto’s AD Max system are powered by dual NVIDIA DRIVE Orin SoCs, which help bring ever-upgrading intelligent functionality to Li Auto’s NOA feature. Li Auto will be at booth E405.

Image courtesy of Lotus.

Lotus is featuring a full range of vehicles, including the Emeya electric hyper-GT powered by NVIDIA DRIVE Orin. Lotus will be at booth E403.

Mercedes-Benz is exhibiting its Concept CLA Class, the first car to be developed on the all-new Mercedes-Benz Modular Architecture. The Concept CLA Class fully runs on MB.OS, which handles infotainment, automated driving, comfort and charging. Mercedes-Benz will be at booth E404.

Momenta is rolling out a new NVIDIA DRIVE Orin solution to accelerate commercialization of urban NOA capabilities at scale.

Image courtesy of Polestar.

Polestar is featuring the Polestar 3, the Swedish car manufacturer’s battery electric mid-size luxury crossover SUV powered by DRIVE Orin. Polestar will be at booth E205.

SAIC R Motors is showcasing the Rising Auto R7 and F7 powered by NVIDIA DRIVE Orin at booth W406.

WeRide is exhibiting Chery’s Exeed Sterra ET SUV and ES sedan, both powered by NVIDIA DRIVE Orin. The vehicles demonstrate progress made by Bosch and WeRide on level 2 to level 3 autonomous-driving technology. WeRide will be at booth E1-W04.

Xiaomi is displaying its NVIDIA DRIVE Orin-powered SU7 and “Human x Car x Home” smart ecosystem, designed to seamlessly connect people, cars and homes, at booth W203.

ZEEKR unveiled its SEA-M architecture and is showcasing the ZEEKR 007 powered by NVIDIA DRIVE Orin at booth E101.

Auto China runs through Saturday, May 4, at the China International Exhibition Center in Beijing.

Learn more about the industry-leading designs and technologies NVIDIA is developing with its automotive partners.

Featured image courtesy of JIYUE.

Read More

Deploy a Hugging Face (PyAnnote) speaker diarization model on Amazon SageMaker as an asynchronous endpoint

Deploy a Hugging Face (PyAnnote) speaker diarization model on Amazon SageMaker as an asynchronous endpoint

Speaker diarization, an essential process in audio analysis, segments an audio file based on speaker identity. This post delves into integrating Hugging Face’s PyAnnote for speaker diarization with Amazon SageMaker asynchronous endpoints.

We provide a comprehensive guide on how to deploy speaker segmentation and clustering solutions using SageMaker on the AWS Cloud. You can use this solution for applications dealing with multi-speaker (over 100) audio recordings.

Solution overview

Amazon Transcribe is the go-to service for speaker diarization in AWS. However, for non-supported languages, you can use other models (in our case, PyAnnote) that will be deployed in SageMaker for inference. For short audio files where the inference takes up to 60 seconds, you can use real-time inference. For longer than 60 seconds, asynchronous inference should be used. The added benefit of asynchronous inference is the cost savings by auto scaling the instance count to zero when there are no requests to process.

Hugging Face is a popular open source hub for machine learning (ML) models. AWS and Hugging Face have a partnership that allows a seamless integration through SageMaker with a set of AWS Deep Learning Containers (DLCs) for training and inference in PyTorch or TensorFlow, and Hugging Face estimators and predictors for the SageMaker Python SDK. SageMaker features and capabilities help developers and data scientists get started with natural language processing (NLP) on AWS with ease.

The integration for this solution involves using Hugging Face’s pre-trained speaker diarization model using the PyAnnote library. PyAnnote is an open source toolkit written in Python for speaker diarization. This model, trained on the sample audio dataset, enables effective speaker partitioning in audio files. The model is deployed on SageMaker as an asynchronous endpoint setup, providing efficient and scalable processing of diarization tasks.

The following diagram illustrates the solution architecture.Solution architecture

For this post, we use the following audio file.

Stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels. Audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

Prerequisites

Complete the following prerequisites:

  1. Create a SageMaker domain.
  2. Make sure your AWS Identity and Access Management (IAM) user has the necessary access permissions for creating a SageMaker role.
  3. Make sure the AWS account has a service quota for hosting a SageMaker endpoint for an ml.g5.2xlarge instance.

Create a model function for accessing PyAnnote speaker diarization from Hugging Face

You can use the Hugging Face Hub to access the desired pre-trained PyAnnote speaker diarization model. You use the same script for downloading the model file when creating the SageMaker endpoint.

Hugging face

See the following code:

from PyAnnote.audio import Pipeline

def model_fn(model_dir):
# Load the model from the specified model directory
model = Pipeline.from_pretrained(
"PyAnnote/speaker-diarization-3.1",
use_auth_token="Replace-with-the-Hugging-face-auth-token")
return model

Package the model code

Prepare essential files like inference.py, which contains the inference code:

%%writefile model/code/inference.py
from PyAnnote.audio import Pipeline
import subprocess
import boto3
from urllib.parse import urlparse
import pandas as pd
from io import StringIO
import os
import torch

def model_fn(model_dir):
    # Load the model from the specified model directory
    model = Pipeline.from_pretrained(
        "PyAnnote/speaker-diarization-3.1",
        use_auth_token="hf_oBxxxxxxxxxxxx)
    return model 


def diarization_from_s3(model, s3_file, language=None):
    s3 = boto3.client("s3")
    o = urlparse(s3_file, allow_fragments=False)
    bucket = o.netloc
    key = o.path.lstrip("/")
    s3.download_file(bucket, key, "tmp.wav")
    result = model("tmp.wav")
    data = {} 
    for turn, _, speaker in result.itertracks(yield_label=True):
        data[turn] = (turn.start, turn.end, speaker)
    data_df = pd.DataFrame(data.values(), columns=["start", "end", "speaker"])
    print(data_df.shape)
    result = data_df.to_json(orient="split")
    return result


def predict_fn(data, model):
    s3_file = data.pop("s3_file")
    language = data.pop("language", None)
    result = diarization_from_s3(model, s3_file, language)
    return {
        "diarization_from_s3": result
    }

Prepare a requirements.txt file, which contains the required Python libraries necessary to run the inference:

with open("model/code/requirements.txt", "w") as f:
    f.write("transformers==4.25.1n")
    f.write("boto3n")
    f.write("PyAnnote.audion")
    f.write("soundfilen")
    f.write("librosan")
    f.write("onnxruntimen")
    f.write("wgetn")
    f.write("pandas")

Lastly, compress the inference.py and requirements.txt files and save it as model.tar.gz:

!tar zcvf model.tar.gz *

Configure a SageMaker model

Define a SageMaker model resource by specifying the image URI, model data location in Amazon Simple Storage Service (S3), and SageMaker role:

import sagemaker
import boto3

sess = sagemaker.Session()

sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

Upload the model to Amazon S3

Upload the zipped PyAnnote Hugging Face model file to an S3 bucket:

s3_location = f"s3://{sagemaker_session_bucket}/whisper/model/model.tar.gz"
!aws s3 cp model.tar.gz $s3_location

Create a SageMaker asynchronous endpoint

Configure an asynchronous endpoint for deploying the model on SageMaker using the provided asynchronous inference configuration:

from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
from sagemaker.s3 import s3_path_join
from sagemaker.utils import name_from_base

async_endpoint_name = name_from_base("custom-asyc")

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data=s3_location,  # path to your model and script
    role=role,  # iam role with permissions to create an Endpoint
    transformers_version="4.17",  # transformers version used
    pytorch_version="1.10",  # pytorch version used
    py_version="py38",  # python version used
)

# create async endpoint configuration
async_config = AsyncInferenceConfig(
    output_path=s3_path_join(
        "s3://", sagemaker_session_bucket, "async_inference/output"
    ),  # Where our results will be stored
    # Add nofitication SNS if needed
    notification_config={
        # "SuccessTopic": "PUT YOUR SUCCESS SNS TOPIC ARN",
        # "ErrorTopic": "PUT YOUR ERROR SNS TOPIC ARN",
    },  #  Notification configuration
)

env = {"MODEL_SERVER_WORKERS": "2"}

# deploy the endpoint endpoint
async_predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.xx",
    async_inference_config=async_config,
    endpoint_name=async_endpoint_name,
    env=env,
)

Test the endpoint

Evaluate the endpoint functionality by sending an audio file for diarization and retrieving the JSON output stored in the specified S3 output path:

# Replace with a path to audio object in S3
from sagemaker.async_inference import WaiterConfig
res = async_predictor.predict_async(data=data)
print(f"Response output path: {res.output_path}")
print("Start Polling to get response:")

config = WaiterConfig(
  max_attempts=10, #  number of attempts
  delay=10#  time in seconds to wait between attempts
  )
res.get_result(config)
#import waiterconfig

To deploy this solution at scale, we suggest using AWS Lambda, Amazon Simple Notification Service (Amazon SNS), or Amazon Simple Queue Service (Amazon SQS). These services are designed for scalability, event-driven architectures, and efficient resource utilization. They can help decouple the asynchronous inference process from the result processing, allowing you to scale each component independently and handle bursts of inference requests more effectively.

Results

Model output is stored at s3://sagemaker-xxxx /async_inference/output/. The output shows that the audio recording has been segmented into three columns:

  • Start (start time in seconds)
  • End (end time in seconds)
  • Speaker (speaker label)

The following code shows an example of our results:

[0.9762308998, 8.9049235993, "SPEAKER_01"]

[9.533106961, 12.1646859083, "SPEAKER_01"]

[13.1324278438, 13.9303904924, "SPEAKER_00"]

[14.3548387097, 26.1884550085, "SPEAKER_00"]

[27.2410865874, 28.2258064516, "SPEAKER_01"]

[28.3446519525, 31.298811545, "SPEAKER_01"]

Clean up

You can set a scaling policy to zero by setting MinCapacity to 0; asynchronous inference lets you auto scale to zero with no requests. You don’t need to delete the endpoint, it scales from zero when needed again, reducing costs when not in use. See the following code:

# Common class representing application autoscaling for SageMaker 
client = boto3.client('application-autoscaling') 

# This is the format in which application autoscaling references the endpoint
resource_id='endpoint/' + <endpoint_name> + '/variant/' + <'variant1'> 

# Define and register your endpoint variant
response = client.register_scalable_target(
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', # The number of EC2 instances for your Amazon SageMaker model endpoint variant.
    MinCapacity=0,
    MaxCapacity=5
)

If you want to delete the endpoint, use the following code:

async_predictor.delete_endpoint(async_endpoint_name)

Benefits of asynchronous endpoint deployment

This solution offers the following benefits:

  • The solution can efficiently handle multiple or large audio files.
  • This example uses a single instance for demonstration. If you want to use this solution for hundreds or thousands of videos and use an asynchronous endpoint to process across multiple instances, you can use an auto scaling policy, which is designed for a large number of source documents. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload.
  • The solution optimizes resources and reduces system load by separating long-running tasks from real-time inference.

Conclusion

In this post, we provided a straightforward approach to deploy Hugging Face’s speaker diarization model on SageMaker using Python scripts. Using an asynchronous endpoint provides an efficient and scalable means to deliver diarization predictions as a service, accommodating concurrent requests seamlessly.

Get started today with asynchronous speaker diarization for your audio projects. Reach out in the comments if you have any questions about getting your own asynchronous diarization endpoint up and running.


About the Authors

Sanjay Tiwary is a Specialist Solutions Architect AI/ML who spends his time working with strategic customers to define business requirements, provide L300 sessions around specific use cases, and design AI/ML applications and services that are scalable, reliable, and performant. He has helped launch and scale the AI/ML powered Amazon SageMaker service and has implemented several proofs of concept using Amazon AI services. He has also developed the advanced analytics platform as a part of the digital transformation journey.

Kiran Challapalli is a deep tech business developer with the AWS public sector. He has more than 8 years of experience in AI/ML and 23 years of overall software development and sales experience. Kiran helps public sector businesses across India explore and co-create cloud-based solutions that use AI, ML, and generative AI—including large language models—technologies.

Read More