Amazon AWS – Page 144

Train a Large Language Model on a single Amazon SageMaker GPU with Hugging Face and LoRA

June 5, 2023

by Philipp Schmid Amazon AWS

This post is co-written with Philipp Schmid from Hugging Face.

We have all heard about the progress being made in the field of large language models (LLMs) and the ever-growing number of problem sets where LLMs are providing valuable insights. Large models, when trained over massive datasets and several tasks, are also able to generalize well over tasks that they aren’t trained specifically for. Such models are called foundation models, a term first popularized by the Stanford Institute for Human-Centered Artificial Intelligence. Even though these foundation models are able to generalize well, especially with the help of prompt engineering techniques, often the use case is so domain specific, or the task is so different, that the model needs further customization. One approach to improve performance of a large model for a specific domain or task is to further train the model with a smaller, task-specific dataset. Although this approach, known as fine-tuning, successfully improves the accuracy of LLMs, it requires modifying all of the model weights. Fine-tuning is much faster than the pre-training of a model thanks to the much smaller dataset size, but still requires significant computing power and memory. Fine-tuning modifies all the parameter weights of the original model, which makes it expensive and results in a model that is the same size as the original.

To address these challenges, Hugging Face introduced the Parameter-Efficient Fine-Tuning library (PEFT). This library allows you to freeze most of the original model weights and replace or extend model layers by training an additional, much smaller, set of parameters. This makes training much less expensive in terms of required compute and memory.

In this post, we show you how to train the 7-billion-parameter BloomZ model using just a single graphics processing unit (GPU) on Amazon SageMaker, Amazon’s machine learning (ML) platform for preparing, building, training, and deploying high-quality ML models. BloomZ is a general-purpose natural language processing (NLP) model. We use PEFT to optimize this model for the specific task of summarizing messenger-like conversations. The single-GPU instance that we use is a low-cost example of the many instance types AWS provides. Training this model on a single GPU highlights AWS’s commitment to being the most cost-effective provider of AI/ML services.

The code for this walkthrough can be found on the Hugging Face notebooks GitHub repository under the sagemaker/24_train_bloom_peft_lora folder.

Prerequisites

In order to follow along, you should have the following prerequisites:

An AWS account.
A Jupyter notebook within Amazon SageMaker Studio or SageMaker notebook instances.
You will need access to the SageMaker ml.g5.2xlarge instance type, containing a single NVIDIA A10G GPU. On the AWS Management Console, navigate to Service Quotas for SageMaker and request a 1-instance increase for the following quotas: ml.g5.2xlarge for training job usage and ml.g5.2xlarge for training job usage.
After your requested quotas are applied to your account, you can use the default Studio Python 3 (Data Science) image with a ml.t3.medium instance to run the notebook code snippets. For the full list of available kernels, refer to Available Amazon SageMaker Kernels.

Set up a SageMaker session

Use the following code to set up your SageMaker session:

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it does not exist
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

Load and prepare the dataset

We use the samsum dataset, a collection of 16,000 messenger-like conversations with summaries. The conversations were created and written down by linguists fluent in English. The following is an example of the dataset:

{
  "id": "13818513",
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  "dialogue": "Amanda: I baked cookies. Do you want some?rnJerry: Sure!rnAmanda: I'll bring you tomorrow :-)"
}

To train the model, you need to convert the inputs (text) to token IDs. This is done by a Hugging Face Transformers tokenizer. For more information, refer to Chapter 6 of the Hugging Face NLP Course.

Convert the inputs with the following code:

from transformers import AutoTokenizer

model_id="bigscience/bloomz-7b1"

# Load tokenizer of BLOOMZ
tokenized = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 2048 # overwrite wrong value

Before starting training, you need to process the data. Once it’s trained, the model will take a set of text messages as the input and generate a summary as the output. You need to format the data as a prompt (the messages) with a correct response (the summary). You also need to chunk examples into longer input sequences to optimize the model training. See the following code:

from random import randint
from itertools import chain
from functools import partial

# custom instruct prompt start
prompt_template = f"Summarize the chat dialogue:n{{dialogue}}n---nSummary:n{{summary}}{{eos_token}}"

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(dialogue=sample["dialogue"],
                                            summary=sample["summary"],
                                            eos_token=tokenizer.eos_token)
    return sample


# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))

print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": []}


def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

Now you can use the FileSystem integration to upload the dataset to Amazon Simple Storage Service (Amazon S3):

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/samsum-sagemaker/train'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

In [ ]:
training_input_path="s3://sagemaker-us-east-1-558105141721/processed/samsum-sagemaker/train"

Fine-tune BLOOMZ-7B with LoRA and bitsandbytes int-8 on SageMaker

The Hugging Face BLOOMZ-7B model card indicates its initial training was distributed over 8 nodes with 8 A100 80 GB GPUs and 512 GB memory CPUs each. This computing configuration is not readily accessible, is cost-prohibitive to consumers, and requires expertise in distributed training performance optimization. SageMaker lowers the barriers to replication of this setup through its distributed training libraries; however, the cost of comparable eight on-demand ml.p4de.24xlarge instances would be $376.88 per hour. Furthermore, the fully trained model consumes about 40 GB of memory, which exceeds the available memory of many individual consumer available GPUs and requires strategies to address for large model inferencing. As a result, full fine-tuning of the model for your task over multiple model runs and deployment would require significant compute, memory, and storage costs on hardware that isn’t readily accessible to consumers.

Our goal is to find a way to adapt BLOOMZ-7B to our chat summarization use case in a more accessible and cost-effective way while maintaining accuracy. To enable our model to be fine-tuned on a SageMaker ml.g5.2xlarge instance with a single consumer-grade NVIDIA A10G GPU, we employ two techniques to reduce the compute and memory requirements for fine-tuning: LoRA and quantization.

LoRA (Low Rank Adaptation) is a technique that significantly reduces the number of model parameters and associated compute needed for fine-tuning to a new task without a loss in predictive performance. First, it freezes your original model weights and instead optimizes smaller rank-decomposition weight matrices to your new task rather than updating the full weights, and then injects these adapted weights back into the original model. Consequently, fewer weight gradient updates means less compute and GPU memory during fine-tuning. The intuition behind this approach is that LoRA allows LLMs to focus on the most important input and output tokens while ignoring redundant and less important tokens. To deepen your understanding of the LoRA technique, refer to the original paper LoRA: Low-Rank Adaptation of Large Language Models.

In addition to the LoRA technique, you use the bitsanbytes Hugging Face integration LLM.int8() method to quantize out the frozen BloomZ model, or reduce the precision of the weight and bias values, by rounding them from float16 to int8. Quantization reduces the needed memory for BloomZ by about four times, which enables you to fit the model on the A10G GPU instance without a significant loss in predictive performance. To deepen your understanding of how int8 quantization works, its implementation in the bitsandbytes library, and its integration with the Hugging Face Transformers library, see A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes.

Hugging Face has made LoRA and quantization accessible across a broad range of transformer models through the PEFT library and its integration with the bitsandbytes library. The create_peft_config() function in the prepared script run_clm.py illustrates their usage in preparing your model for training:

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_int8_training,
    )

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8, # Lora attention dimension.
        lora_alpha=32, # the alpha parameter for Lora scaling.
        lora_dropout=0.05, # the dropout probability for Lora layers.
        target_modules=["query_key_value"],
    )

    # prepare int-8 model for training
    model = prepare_model_for_int8_training(model)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    return model

With LoRA, the output from print_trainable_parameters()indicates we were able to reduce the number of model parameters from 7 billion to 3.9 million. This means that only 5.6% of the original model parameters need to be updated. This significant reduction in compute and memory requirements allows us to fit and train our model on the GPU without issues.

To create a SageMaker training job, you will need a Hugging Face estimator. The estimator handles end-to-end SageMaker training and deployment tasks. SageMaker takes care of starting and managing all the required Amazon Elastic Compute Cloud (Amazon EC2) instances for you. Additionally, it provides the correct Hugging Face training container, uploads the provided scripts, and downloads the data from our S3 bucket into the container at the path /opt/ml/input/data. Then, it starts the training job. See the following code:

import time
# define Training Job Name 
job_name = f'huggingface-peft-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                                # pre-trained model
  'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
  'epochs': 3,                                         # number of training epochs
  'per_device_train_batch_size': 1,                    # batch size for training
  'lr': 2e-4,                                          # learning rate used during training
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.2xlarge', # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # IAM role used in training job to access AWS resources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.26',            # the transformers version used in the training job
    pytorch_version      = '1.13',            # the pytorch_version version used in the training job
    py_version           = 'py39',            # the python version used in the training job
    hyperparameters      =  hyperparameters
)

You can now start your training job using the .fit() method and passing the S3 path to the training script:

# define a data input dictionary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as inputs
huggingface_estimator.fit(data, wait=True)

Using LoRA and quantization makes fine-tuning BLOOMZ-7B to our task affordable and efficient with SageMaker. When using SageMaker training jobs, you only pay for GPUs for the duration of model training. In our example, the SageMaker training job took 20,632 seconds, which is about 5.7 hours. The ml.g5.2xlarge instance we used costs $1.515 per hour for on-demand usage. As a result, the total cost for training our fine-tuned BLOOMZ-7B model was only $8.63. Comparatively, full fine-tuning of the model’s 7 billion weights would cost an estimated $600, or 6,900% more per training run, assuming linear GPU scaling on the original computing configuration outlined in the Hugging Face model card. In practice, this would further vary depending upon your training strategy, instance selection, and instance pricing.

We could also further reduce our training costs by using SageMaker managed Spot Instances. However, there is a possibility this would result in the total training time increasing due to Spot Instance interruptions. See Amazon SageMaker Pricing for instance pricing details.

Deploy the model to a SageMaker endpoint for inference

With LoRA, you previously adapted a smaller set of weights to your new task. You need a way to combine these task-specific weights with the pre-trained weights of the original model. In the run_clm.py script, the PEFT library merge_and_unload() method takes care of merging the base BLOOMZ-7B model with the updated adapter weights fine-tuned to your task to make them easier to deploy without introducing any inference latency compared to the original model.

In this section, we go through the steps to create a SageMaker model from the fine-tuned model artifact and deploy it to a SageMaker endpoint for inference. First, you can create a Hugging Face model using your new fine-tuned model artifact for deployment to a SageMaker endpoint. Because you previously trained the model with a SageMaker Hugging Face estimator, you can deploy the model immediately. You could instead upload the trained model to an S3 bucket and use them to create a model package later. See the following code:

from sagemaker.huggingface import HuggingFaceModel

# 1. create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=huggingface_estimator.model_data,
   #model_data="s3://hf-sagemaker-inference/model.tar.gz",  # Change to your model path
   role=role, 
   transformers_version="4.26", 
   pytorch_version="1.13", 
   py_version="py39",
   model_server_workers=1
)

As with any SageMaker estimator, you can deploy the model using the deploy() method from the Hugging Face estimator object, passing in the desired number and type of instances. In this example, we use the same G5 instance type equipped with a single NVIDIA A10g GPU that the model was fine-tuned on in the previous step:

# 2. deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type= "ml.g5.4xlarge"
)

It may take 5–10 minutes for the SageMaker endpoint to bring your instance online and download your model in order to be ready to accept inference requests.

When the endpoint is running, you can test it by sending a sample dialog from the dataset test split. First load the test split using the Hugging Face Datasets library. Next, select a random integer for index slicing a single test sample from the dataset array. Using string formatting, combine the test sample with a prompt template into a structured input to guide our model’s response. This structured input can then be combined with additional model input parameters into a formatted sample JSON payload. Finally, invoke the SageMaker endpoint with the formatted sample and print the model’s output summarizing the sample dialog. See the following code:

from random import randint
from datasets import load_dataset

# 1. Load dataset from the hub
test_dataset = load_dataset("samsum", split="test")

# 2. select a random test sample
sample = test_dataset[randint(0,len(test_dataset))]

# 3. format the sample
prompt_template = f"Summarize the chat dialogue:n{{dialogue}}n---nSummary:n"

fomatted_sample = {
  "inputs": prompt_template.format(dialogue=sample["dialogue"]),
  "parameters": {
    "do_sample": True, # sample output predicted probabilities
    "top_p": 0.9, # sampling technique Fan et. al (2018)
    "temperature": 0.1, # increasing the likelihood of high probability words and decreasing the likelihood of low probability words
    "max_new_tokens": 100, # 
  }
}

# 4. Invoke the SageMaker endpoint with the formatted sample
res = predictor.predict(fomatted_sample)


# 5. Print the model output
print(res[0]["generated_text"].split("Summary:")[-1])
# Sample model output: Kirsten and Alex are going bowling this Friday at 7 pm. They will meet up and then go together.

Now let’s compare the model summarized dialog output to the test sample summary:

print(sample["summary"])
# Sample model input: Kirsten reminds Alex that the youth group meets this Friday at 7 pm to go bowling.

Clean up

Now that you’ve tested your model, make sure that you clean up the associated SageMaker resources to prevent continued charges:

predictor.delete_model()
predictor.delete_endpoint()

Summary

In this post, you used the Hugging Face Transformer, PEFT, and the bitsandbytes libraries with SageMaker to fine-tune a BloomZ large language model on a single GPU for $8 and then deployed the model to a SageMaker endpoint for inference on a test sample. SageMaker offers multiple ways to use Hugging Face models; for more examples, check out the AWS Samples GitHub.

To continue using SageMaker to fine-tune foundation models, try out some of the techniques in the post Architect personalized generative AI SaaS applications on Amazon SageMaker. We also encourage you to learn more about Amazon Generative AI capabilities by exploring JumpStart, Amazon Titan models, and Amazon Bedrock.

About the Authors

Philipp Schmid is a Technical Lead at Hugging Face with the mission to democratize good machine learning through open source and open science. Philipp is passionate about productionizing cutting-edge and generative AI machine learning models. He loves to share his knowledge on AI and NLP at various meetups such as Data Science on AWS, and on his technical blog.

Robert Fisher is a Sr. Solutions Architect for Healthcare and Life Sciences customers. He works closely with customers to understand how AWS can help them solve problems, especially in the AI/ML space. Robert has many years of experience in software engineering across a range of industry verticals including medical devices, fintech, and consumer-facing applications.

Doug Kelly is an AWS Sr. Solutions Architect that serves as a trusted technical advisor to top machine learning startups in verticals ranging from machine learning platforms, autonomous vehicles, to precision agriculture. He is member of the AWS ML technical field community where he specializes in supporting customers with MLOps and ML inference workloads.

Announcing the launch of new Hugging Face LLM Inference containers on Amazon SageMaker

June 5, 2023

by Philipp Schmid Amazon AWS

This post is co-written with Philipp Schmid and Jeff Boudier from Hugging Face.

Today, as part of Amazon Web Services’ partnership with Hugging Face, we are excited to announce the release of a new Hugging Face Deep Learning Container (DLC) for inference with Large Language Models (LLMs). This new Hugging Face LLM DLC is powered by Text Generation Inference (TGI), an open source, purpose-built solution for deploying and serving Large Language Models. TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, StableLM, Llama, and T5.

Large Language Models are growing in popularity but can be difficult to deploy

LLMs have emerged as the leading edge of artificial intelligence, captivating developers and enthusiasts alike with their ability to comprehend and generate human-like text across diverse domains. These powerful models, such as those based on the GPT and T5 architectures, have experienced an unprecedented surge in popularity for a broad set of applications, including language understanding, conversational experiences, and automated writing assistance. As a result, companies across industries are seizing the opportunity to unlock their potential and offer new LLM-powered experiences in their applications.

Hosting LLMs at scale presents a unique set of complex engineering challenges. To provide an ideal user experience, an LLM hosting service should provide adequate response times while scaling to a large number of concurrent users. Given the high resource requirements of large models, general-purpose inference frameworks may not provide the optimizations required to maximize the utilization of available resources and provide the best possible performance.

Some of these optimizations include:

Tensor parallelism to distribute the computation across multiple accelerators
Model quantization to reduce the memory footprint of the model
Dynamic batching of inference requests to improve throughput, and many others.

The Hugging Face LLM DLC provides these optimizations out of the box and makes it easier to host LLM models at scale.

Hugging Face’s Text Generation Inference simplifies LLM deployment

TGI is an open source, purpose-built solution for deploying Large Language Models (LLMs). It incorporates optimizations including tensor parallelism for faster multi-GPU inference, dynamic batching to boost overall throughput, and optimized transformers code using flash-attention for popular model architectures including BLOOM, T5, GPT-NeoX, StarCoder, and LLaMa.

With the new Hugging Face LLM Inference DLCs on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly concurrent, low latency LLM experiences like HuggingChat, OpenAssistant, and Inference API for LLM models on the Hugging Face Hub, while enjoying SageMaker’s managed service capabilities, such as autoscaling, health checks, and model monitoring.

Get started with TGI on SageMaker Hosting

Let’s walk through a code example that deploys a GPT NeoX 20B parameter model on a SageMaker Endpoint. You can find our complete example notebook here.

First, make sure that the latest version of SageMaker SDK is installed:

%pip install sagemaker>=2.161.0

Then, we import the SageMaker Python SDK and instantiate a sagemaker_session to find the current region and execution role.

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

Next we retrieve the LLM image URI. We use the helper function get_huggingface_llm_image_uri() to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference. The function takes a required parameter backend and several optional parameters. The backend specifies the type of backend to use for the model, the values can be “lmi” and “huggingface”. “lmi” stands for SageMaker Large Model Inference backend and “huggingface” refers to using Hugging Face TGI backend that is used in this tutorial.

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface", # or lmi
  region=region
)

Now that we have the image uri, the next step is to configure the model object. We specify a unique name, the image_uri for the managed TGI container, and the execution role for the endpoint. Additionally, we specify a number of environment variables including the HF_MODEL_ID which corresponds to the model from the HuggingFace Hub that will be deployed, and the HF_TASK which configures the inference task to be performed by the model.

You should also define SM_NUM_GPUS, which specifies the tensor parallelism degree of the model. Tensor parallelism can be used to split the model across multiple GPUs, which is necessary when working with LLMs that are too big for a single GPU. To learn more about tensor parallelism with inference, see our previous blog post. Here, you should set SM_NUM_GPUS to the number of available GPUs on your selected instance type. For example, in this tutorial, we set SM_NUM_GPUS to 4 because our selected instance type ml.g4dn.12xlarge has 4 available GPUs.

Note that you can optionally reduce the memory and computational footprint of the model by setting the HF_MODEL_QUANTIZE environment variable to “true”, but this lower weight precision could affect the quality of the output for some models.

model_name = "gpt-neox-20b-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    'HF_MODEL_ID':'EleutherAI/gpt-neox-20b',
    'HF_TASK':'text-generation',
    'SM_NUM_GPUS':'4',
    'HF_MODEL_QUANTIZE':'true'
}

model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)

Next, we invoke the deploy method to deploy the model.

predictor = model.deploy(
  initial_instance_count=1,
  instance_type="ml.g4dn.12xlarge",
  endpoint_name=model_name
)

Once the model is deployed, we can invoke it to generate text. We pass an input prompt and run the predict method to generate a text response from the LLM running in the TGI container.

input_data = {
  "inputs": "The diamondback terrapin was the first reptile to",
  "parameters": {
    "do_sample": True,
    "max_new_tokens": 100,
    "temperature": 0.7,
    "watermark": True
  }
}

predictor.predict(input_data)

We receive the following auto-generated text response:.

[{'generated_text': 'The diamondback terrapin was the first reptile to make the list, followed by the American alligator, the American crocodile, and the American box turtle. The polecat, a ferret-like animal, and the skunk rounded out the list, both having gained their slots because they have proven to be particularly dangerous to humans.nnCalifornians also seemed to appreciate the new list, judging by the comments left after the election.nn“This is fantastic,” one commenter declared.nn“California is a very'}]

To mitigate the risk of potential exploitation of Generative AI capabilities by automated bots, the response is watermarked. Such watermarked responses can be easily detected by algorithms, promoting the responsible use of Generative AI.

Once we are done experimenting, we delete the endpoint and the model resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion and next steps

Deploying Large Language Models using Hugging Face’s Text Generation Inference and SageMaker Hosting is a straightforward solution for hosting open source models like GPT-NeoX, Flan-T5-XXL, StarCoder or LLaMa. The state of the art LLMs are deployed within the secure managed SageMaker environment, and AWS customers can benefit from Large Language Models while keeping full control over their implementation, and without sending their data over to a third-party API.

In the tutorial, we demonstrated the deployment of GPT-NeoX using the new Hugging Face LLM Inference DLC, leveraging the power of 4 GPUs on a SageMaker ml.g4dn.12xlarge instance. With this approach, users can effortlessly harness the capabilities of state-of-the-art language models, enabling a wide range of applications and advancements in natural language processing.

As a next step, you can learn more about Hugging Face LLM Inference on SageMaker with the following resources:

About the authors

Jeff Boudier builds products at Hugging Face, the #1 open platform for AI builders. Previously Jeff was a co-founder of Stupeflix, acquired by GoPro, where he served as director of Product Management, Product Marketing, Business Development and Corporate Development.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads deep learning model optimization for applications such as large model inference.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Xin Yang is a Software Development Engineer at AWS. She has been working on deploying and optimizing deep learning inference systems. Her work spans both the realms of real-time inference and scalable offline inference solutions. In her spare time, Xin enjoys reading and hiking.

Gagan Singh is a Senior Technical Account Manager at AWS helping digital native startups maximize business success. He helps customers with adoption and optimization of real-time, multi-model ML inferencing endpoints using Amazon SageMaker. In his spare time, Gagan enjoys trekking in the Himalayas and listening to music.

How epic was that shot? Opportunity Analysis brings data to the debate

June 5, 2023

by Amazon AWS

Learn about the science behind the brand-new NHL EDGE IQ stat that debuted in April 2023.Read More

ICASSP: Michael I. Jordan’s “alternative view on AI”

June 2, 2023

by Amazon AWS

In a plenary talk, the Berkeley professor and Distinguished Amazon Scholar will argue that AI research should borrow concepts from economics and focus on social collectives.Read More

A quick guide to Amazon’s 40-plus papers at ICASSP

June 2, 2023

by Amazon AWS

Topics such as code generation, commonsense reasoning, and self-learning complement the usual focus on speech recognition and acoustic-event classification.Read More

Implement a multi-object tracking solution on a custom dataset with Amazon SageMaker

June 1, 2023

by Gordon Wang Amazon AWS

The demand for multi-object tracking (MOT) in video analysis has increased significantly in many industries, such as live sports, manufacturing, and traffic monitoring. For example, in live sports, MOT can track soccer players in real time to analyze physical performance such as real-time speed and moving distance.

Since its introduction in 2021, ByteTrack remains to be one of best performing methods on various benchmark datasets, among the latest model developments in MOT application. In ByteTrack, the author proposed a simple, effective, and generic data association method (referred to as BYTE) for detection box and tracklet matching. Rather than only keep the high score detection boxes, it also keeps the low score detection boxes, which can help recover unmatched tracklets with these low score detection boxes when occlusion, motion blur, or size changing occurs. The BYTE association strategy can also be used in other Re-ID based trackers, such as FairMOT. The experiments showed improvements compared to the vanilla tracker algorithms. For example, FairMOT achieved an improvement of 1.3% on MOTA (FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking), which is one of the main metrics in the MOT task when applying BYTE in data association.

In the post Train and deploy a FairMOT model with Amazon SageMaker, we demonstrated how to train and deploy a FairMOT model with Amazon SageMaker on the MOT challenge datasets. When applying a MOT solution in real-world cases, you need to train or fine-tune a MOT model on a custom dataset. With Amazon SageMaker Ground Truth, you can effectively create labels on your own video dataset.

Following on the previous post, we have added the following contributions and modifications:

Generate labels for a custom video dataset using Ground Truth
Preprocess the Ground Truth generated label to be compatible with ByteTrack and other MOT solutions
Train the ByteTrack algorithm with a SageMaker training job (with the option to extend a pre-built container)
Deploy the trained model with various deployment options, including asynchronous inference

We also provide the code sample on GitHub, which uses SageMaker for labeling, building, training, and inference.

SageMaker is a fully managed service that provides every developer and data scientist with the ability to prepare, build, train, and deploy machine learning (ML) models quickly. SageMaker provides several built-in algorithms and container images that you can use to accelerate training and deployment of ML models. Additionally, custom algorithms such as ByteTrack can also be supported via custom-built Docker container images. For more information about deciding on the right level of engagement with containers, refer to Using Docker containers with SageMaker.

SageMaker provides plenty of options for model deployment, such as real-time inference, serverless inference, and asynchronous inference. In this post, we show how to deploy a tracking model with different deployment options, so that you can choose the suitable deployment method in your own use case.

Overview of solution

Our solution consists of the following high-level steps:

Label the dataset for tracking, with a bounding box on each object (for example, pedestrian, car, and so on). Set up the resources for ML code development and execution.
Train a ByteTrack model and tune hyperparameters on a custom dataset.
Deploy the trained ByteTrack model with different deployment options depending on your use case: real-time processing, asynchronous, or batch prediction.

The following diagram illustrates the architecture in each step.

Prerequisites

Before getting started, complete the following prerequisites:

Create an AWS account or use an existing AWS account.
We recommend running the source code in the us-east-1 Region.
Make sure that you have a minimum of one GPU instance (for example, ml.p3.2xlarge for single GPU training, or ml.p3.16xlarge) for the distributed training job. Other types of GPU instances are also supported, with various performance differences.
Make sure that you have a minimum of one GPU instance (for example, ml.p3.2xlarge) for inference endpoint.
Make sure that you have a minimum of one GPU instance (for example, ml.p3.2xlarge) for running batch prediction with processing jobs.

If this is your first time running SageMaker services on the aforementioned instance types, you may have to request a quota increase for the required instances.

Set up your resources

After you complete all the prerequisites, you’re ready to deploy the solution.

Create a SageMaker notebook instance. For this task, we recommend using the ml.t3.medium instance type. While running the code, we use docker build to extend the SageMaker training image with the ByteTrack code (the docker build command will be run locally within the notebook instance environment). Therefore, we recommend increasing the volume size to 100 GB (default volume size to 5 GB) from the advanced configuration options. For your AWS Identity and Access Management (IAM) role, choose an existing role or create a new role, and attach the AmazonS3FullAccess, AmazonSNSFullAccess, AmazonSageMakerFullAccess, and AmazonElasticContainerRegistryPublicFullAccess policies to the role.
Clone the GitHub repo to the /home/ec2-user/SageMaker folder on the notebook instance you created.
Create a new Amazon Simple Storage Service (Amazon S3) bucket or use an existing bucket.

Label the dataset

In the data-preparation.ipynb notebook, we download an MOT16 test video file and split the video file into small video files with 200 frames. Then we upload those video files to the S3 bucket as the data source for labeling.

To label the dataset for the MOT task, refer to Getting started. When the labeling job is complete, we can access the following annotation directory at the job output location in the S3 bucket.

The manifests directory should contain an output folder if we finished labeling all the files. We can see the file output.manifest in the output folder. This manifest file contains information about the video and video tracking labels that you can use later to train and test a model.

Train a ByteTrack model and tune hyperparameters on the custom dataset

To train your ByteTrack model, we use the bytetrack-training.ipynb notebook. The notebook consists of the following steps:

Initialize the SageMaker setting.
Perform data preprocessing.
Build and push the container image.
Define a training job.
Launch the training job.
Tune hyperparameters.

Especially in data preprocessing, we need to convert the labeled dataset with the Ground Truth output format to the MOT17 format dataset, and convert the MOT17 format dataset to a MSCOCO format dataset (as shown in the following figure) so that we can train a YOLOX model on the custom dataset. Because we keep both the MOT format dataset and MSCOCO format dataset, you can train other MOT algorithms without separating detection and tracking on the MOT format dataset. You can easily change the detector to other algorithms such as YOLO7 to use your existing object detection algorithm.

Deploy the trained ByteTrack model

After we train the YOLOX model, we deploy the trained model for inference. SageMaker provides several options for model deployment, such as real-time inference, asynchronous inference, serverless inference, and batch inference. In our post, we use the sample code for real-time inference, asynchronous inference, and batch inference. You can choose the suitable code from these options based on your own business requirements.

Because SageMaker batch transform requires the data to be partitioned and stored on Amazon S3 as input and the invocations are sent to the inference endpoints concurrently, it doesn’t meet the requirements in object tracking tasks where the targets need to be sent in a sequential manner. Therefore, we don’t use the SageMaker batch transform jobs to run the batch inference. In this example, we use SageMaker processing jobs to do batch inference.

The following table summarizes the configuration for our inference jobs.

Inference Type	Payload	Processing Time	Auto Scaling
Real-time	Up to 6 MB	Up to 1 minute	Minimum instance count is 1 or higher
Asynchronous	Up to 1 GB	Up to 15 minutes	Minimum instance count can be zero
Batch (with processing job)	No limit	No limit	Not supported

Deploy a real-time inference endpoint

To deploy a real-time inference endpoint, we can run the bytetrack-inference-yolox.ipynb notebook. We separate ByteTrack inference into object detection and tracking. In the inference endpoint, we only run the YOLOX model for object detection. In the notebook, we create a tracking object, receive the result of object detection from the inference endpoint, and update trackers.

We use SageMaker PyTorchModel SDK to create and deploy a ByteTrack model as follows:

from sagemaker.pytorch.model import PyTorchModel
 
pytorch_model = PyTorchModel(
    model_data=s3_model_uri,
    role=role,
    source_dir="sagemaker-serving/code",
    entry_point="inference.py",
    framework_version="1.7.1",
    py_version="py3",
)
 
endpoint_name =<endpint name>
pytorch_model.deploy(
    initial_instance_count=1,
    instance_type="ml.p3.2xlarge",
    endpoint_name=endpoint_name
)

After we deploy the model to an endpoint successfully, we can invoke the inference endpoint with the following code snippet:

with open(f"datasets/frame_{frame_id}.png", "rb") as f:
    payload = f.read()

response = sm_runtime.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/x-image", Body=payload
)
outputs = json.loads(response["Body"].read().decode())

We run the tracking task on the client side after accepting the detection result from the endpoint (see the following code). By drawing the tracking results in each frame and saving as a tracking video, you can confirm the tracking result on the tracking video.

aspect_ratio_thresh = 1.6
min_box_area = 10
tracker = BYTETracker(
        frame_rate=30,
        track_thresh=0.5,
        track_buffer=30,
        mot20=False,
        match_thresh=0.8
    )

online_targets = tracker.update(torch.as_tensor(outputs[0]), [height, width], (800, 1440))
online_tlwhs = []
online_ids = []
online_scores = []
for t in online_targets:
    tlwh = t.tlwh
    tid = t.track_id
    vertical = tlwh[2] / tlwh[3] > aspect_ratio_thresh
    if tlwh[2] * tlwh[3] > min_box_area and not vertical:
        online_tlwhs.append(tlwh)
        online_ids.append(tid)
        online_scores.append(t.score)
        results.append(
            f"{frame_id},{tid},{tlwh[0]:.2f},{tlwh[1]:.2f},{tlwh[2]:.2f},{tlwh[3]:.2f},{t.score:.2f},-1,-1,-1n"
        )
online_im = plot_tracking(
    frame, online_tlwhs, online_ids, frame_id=frame_id + 1, fps=1. / timer.average_time
)

Deploy an asynchronous inference endpoint

SageMaker asynchronous inference is the ideal option for requests with large payload sizes (up to 1 GB), long processing times (up to 1 hour), and near-real-time latency requirements. For MOT tasks, it’s common that a video file is beyond 6 MB, which is the payload limit of a real-time endpoint. Therefore, we deploy an asynchronous inference endpoint. Refer to Asynchronous inference for more details of how to deploy an asynchronous endpoint. We can reuse the model created for the real-time endpoint; for this post, we put a tracking process into the inference script so that we can get the final tracking result directly for the input video.

To use scripts related to ByteTrack on the endpoint, we need to put the tracking script and model into the same folder and compress the folder as the model.tar.gz file, and then upload it to the S3 bucket for model creation. The following diagram shows the structure of model.tar.gz.

We need to explicitly set the request size, response size, and response timeout as the environment variables, as shown in the following code. The name of the environment variable varies depending on the framework. For more details, refer to Create an Asynchronous Inference Endpoint.

pytorch_model = PyTorchModel(
    model_data=s3_model_uri,
    role=role,
    entry_point="inference.py",
    framework_version="1.7.1",
    sagemaker_session=sm_session,
    py_version="py3",
    env={
        'TS_MAX_REQUEST_SIZE': '1000000000', #default max request size is 6 Mb for torchserve, need to update it to support the 1GB input payload
        'TS_MAX_RESPONSE_SIZE': '1000000000',
        'TS_DEFAULT_RESPONSE_TIMEOUT': '900' # max timeout is 15mins (900 seconds)
    }
)

pytorch_model.create(
    instance_type="ml.p3.2xlarge",
)

When invoking the asynchronous endpoint, instead of sending the payload in the request, we send the Amazon S3 URL of the input video. When the model inference finishes processing the video, the results will be saved on the S3 output path. We can configure Amazon Simple Notification Service (Amazon SNS) topics so that when the results are ready, we can receive an SNS message as a notification.

Run batch inference with SageMaker processing

For video files bigger than 1 GB, we use a SageMaker processing job to do batch inference. We define a custom Docker container to run a SageMaker processing job (see the following code). We draw the tracking result on the input video. You can find the result video in the S3 bucket defined by s3_output.

from sagemaker.processing import ProcessingInput, ProcessingOutput
script_processor.run(
    code='./container-batch-inference/predict.py',
    inputs=[
        ProcessingInput(source=s3_input, destination="/opt/ml/processing/input"),
        ProcessingInput(source=s3_model_uri, destination="/opt/ml/processing/model"),
    ], 
    outputs=[
        ProcessingOutput(source='/opt/ml/processing/output', destination=s3_output),
    ]
)

Clean up

To avoid unnecessary costs, delete the resources you created as part of this solution, including the inference endpoint.

Conclusion

This post demonstrated how to implement a multi-object tracking solution on a custom dataset using one of the state-of-the-art algorithms on SageMaker. We also demonstrated three deployment options on SageMaker so that you can choose the optimal option for your own business scenario. If the use case requires low latency and needs a model to be deployed on an edge device, you can deploy the MOT solution at the edge with AWS Panorama.

For more information, refer to Multi Object Tracking using YOLOX + BYTE-TRACK and data analysis.

About the Authors

Gordon Wang, is a Senior AI/ML Specialist TAM at AWS. He supports strategic customers with AI/ML best practices cross many industries. He is passionate about computer vision, NLP, Generative AI and MLOps. In his spare time, he loves running and hiking.

Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building artificial intelligence powered industrial applications in computer vision, natural language processing and online user behavior prediction. At AWS, he shares the domain expertise and helps customers to unlock business potentials, and to drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

Guang Yang, is a Senior applied scientist at the Amazon ML Solutions Lab where he works with customers across various verticals and applies creative problem solving to generate value for customers with state-of-the-art ML/AI solutions.

Computing on private data

June 1, 2023

by Amazon AWS

Both secure multiparty computation and differential privacy protect the privacy of data used in computation, but each has advantages in different contexts.Read More

Translate documents in real time with Amazon Translate

May 31, 2023

by Sathya Balakrishnan Amazon AWS

A critical component of business success is the ability to connect with customers. Businesses today want to connect with their customers by offering their content across multiple languages in real time. For most customers, the content creation process is disconnected from the localization effort of translating content into multiple target languages. These disconnected processes delay the business ability to simultaneously publish content in multiple languages, inhibiting their outreach efforts which negatively impacts time to market and revenue.

Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Now, Amazon Translate offers real-time document translation to seamlessly integrate and accelerate content creation and localization. You can submit a document from the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDK and receive the translated document in real time while maintaining the format of the original document. This feature eliminates the wait for documents to be translated in asynchronous batch mode.

Real-time document translation currently supports plain text and HTML documents. You can use other Amazon Translate features such as custom terminology, profanity masking, and formality as part of the real-time document translation.

In this post, we will show you how to use this new feature.

Solution overview

This post walks you through the steps required to use real-time document translation with the console, AWS CLI, and Amazon Translate SDK. As an example, we will translate this sample text file from English to French.

Use Amazon Translate via the console

Follow these steps to try out real-time document translation on the console:

On the Amazon Translate console, choose Real-time translation in the navigation pane.
Choose the Document tab.
Specify the language of the source file as English.
Specify the language of the target file as French.

Note: Source or Target language should be English for real-time document translation.

Select Choose file and upload the file you want to translate.
Specify the document type.

Text and HTML formats are supported at the time of this writing.

Under Additional settings, you can use other Amazon Translate features in conjunction with real-time document translation.

For more information about Amazon Translate features, refer to the following resources:

- Custom terminology – Customize Amazon Translate output to meet your domain and organization specific vocabulary
- Formality – Select Formal or Informal for the translated text
- Profanity masking – Apply profanity masking in Amazon Translate
Choose Translate and Download.

The translated file is automatically saved to your browser’s downloaded folder, usually to Downloads. The target language code will be prefixed to the translated file’s name. For example, if your source file name is lang.txt and your target language is French (fr), then the translated file will be named fr.lang.txt.

Use Amazon Translate with the AWS CLI

You can translate the contents of a file using the following AWS CLI command. In this example, the contents of source-lang.txt will be translated into target-lang.txt.

aws translate translate-document --source-language-code en --target-language es 
--document-content fileb://source-lang.txt 
--document ContentType=text/plain 
--query "TranslatedDocument.Content" 
--output text | base64 
--decode > target-lang.txt

Use the Amazon Translate SDK (Python Boto3)

You can use the following Python code to invoke Amazon Translate SDK API to translate text or HTML documents synchronously:

import boto3
import argparse

# Initialize parser
parser = argparse.ArgumentParser()
parser.add_argument("SourceLanguageCode")
parser.add_argument("TargetLanguageCode")
parser.add_argument("SourceFile")
args = parser.parse_args()


translate = boto3.client('translate’)

localFile = args.SourceFile
file = open(localFile, "rb")
data = file.read()
file.close()


result = translate.translate_document(
    Document={
            "Content": data,
            "ContentType": "text/html"
        },
    SourceLanguageCode=args.SourceLanguageCode,
    TargetLanguageCode=args.TargetLanguageCode
)
if "TranslatedDocument" in result:
    fileName = localFile.split("/")[-1]
    tmpfile = f"{args.TargetLanguageCode}-{fileName}"
    with open(tmpfile,  'w', encoding='utf-8') as f:
     
    f.write(str(result["TranslatedDocument"]["Content"]))

    print("Translated document ", tmpfile)

This program accepts three arguments: source language, target language, and file path. Use the following command to invoke this program:

python syncDocumentTranslation.py en es source-lang.txt

Conclusion

The real-time document translation feature in Amazon Translate can expedite time to market by enabling easy integration with content creation and localization. Real-time document translation improves content creation and the localization process.

For more information about Amazon Translate, visit Amazon Translate resources to find video resources and blog posts, and refer to AWS Translate FAQs.

About the Authors

Sathya Balakrishnan is a Senior Consultant in the Professional Services team at AWS, specializing in data and ML solutions. He works with US federal financial clients. He is passionate about building pragmatic solutions to solve customers’ business problems. In his spare time, he enjoys watching movies and hiking with his family.

RG Thiyagarajan is a Senior Consultant in Professional Services at AWS, specializing in application migration, security, and resiliency with US federal financial clients.

Sid Padgaonkar is the Senior Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends, you will find him playing squash and exploring the food scene in the Pacific Northwest.

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

May 31, 2023

by Guilherme Ricci Amazon AWS

Running machine learning (ML) workloads with containers is becoming a common practice. Containers can fully encapsulate not just your training code, but the entire dependency stack down to the hardware libraries and drivers. What you get is an ML development environment that is consistent and portable. With containers, scaling on a cluster becomes much easier.

In late 2022, AWS announced the general availability of Amazon EC2 Trn1 instances powered by AWS Trainium accelerators, which are purpose built for high-performance deep learning training. Trn1 instances deliver up to 50% savings on training costs over other comparable Amazon Elastic Compute Cloud (Amazon EC2) instances. Also, the AWS Neuron SDK was released to improve this acceleration, giving developers tools to interact with this technology such as to compile, runtime, and profile to achieve high-performance and cost-effective model trainings.

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that simplifies your deployment, management, and scaling of containerized applications. Simply describe your application and the resources required, and Amazon ECS will launch, monitor, and scale your application across flexible compute options with automatic integrations to other supporting AWS services that your application needs.

In this post, we show you how to run your ML training jobs in a container using Amazon ECS to deploy, manage, and scale your ML workload.

Solution overview

We walk you through the following high-level steps:

Provision an ECS cluster of Trn1 instances with AWS CloudFormation.
Build a custom container image with the Neuron SDK and push it to Amazon Elastic Container Registry (Amazon ECR).
Create a task definition to define an ML training job to be run by Amazon ECS.
Run the ML task on Amazon ECS.

Prerequisites

To follow along, familiarity with core AWS services such as Amazon EC2 and Amazon ECS is implied.

Provision an ECS cluster of Trn1 instances

To get started, launch the provided CloudFormation template, which will provision required resources such as a VPC, ECS cluster, and EC2 Trainium instance.

We use the Neuron SDK to run deep learning workloads on AWS Inferentia and Trainium-based instances. It supports you in your end-to-end ML development lifecycle to create new models, optimize them, then deploy them for production. To train your model with Trainium, you need to install the Neuron SDK on the EC2 instances where the ECS tasks will run to map the NeuronDevice associated with the hardware, as well as the Docker image that will be pushed to Amazon ECR to access the commands to train your model.

Standard versions of Amazon Linux 2 or Ubuntu 20 don’t come with AWS Neuron drivers installed. Therefore, we have two different options.

The first option is to use a Deep Learning Amazon Machine Image (DLAMI) that has the Neuron SDK already installed. A sample is available on the GitHub repo. You can choose a DLAMI based on the opereating system. Then run the following command to get the AMI ID:

aws ec2 describe-images --region us-east-1 --owners amazon --filters 'Name=name,Values=Deep Learning AMI Neuron PyTorch 1.13.? (Amazon Linux 2) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text

The output will be as follows:

ami-06c40dd4f80434809

This AMI ID can change over time, so make sure to use the command to get the right AMI ID.

Now you can change this AMI ID in the CloudFormation script and use the ready-to-use Neuron SDK. To do this, look for EcsAmiId in Parameters:

"EcsAmiId": { 
    "Type": "String", 
    "Description": "AMI ID", 
    "Default": "ami-09def9404c46ac27c" 
}

The second option is to create an instance filling the userdata field during stack creation. You don’t need to install it because CloudFormation will set this up. For more information, refer to the Neuron Setup Guide.

For this post, we use option 2, in case you need to use a custom image. Complete the following steps:

Launch the provided CloudFormation template.
For KeyName, enter a name of your desired key pair, and it will preload the parameters. For this post, we use trainium-key.
Enter a name for your stack.
If you’re running in the us-east-1 Region, you can keep the values for ALBName and AZIds at their default.

To check what Availability Zone in the Region has Trn1 available, run the following command:

aws ec2 describe-instance-type-offerings --region us-east1 --location-type availability-zone --filter Name=instance-type,Values=trn1.2xlarge

Choose Next and finish creating the stack.

When the stack is complete, you can move to the next step.

Prepare and push an ECR image with the Neuron SDK

Amazon ECR is a fully managed container registry offering high-performance hosting, so you can reliably deploy application images and artifacts anywhere. We use Amazon ECR to store a custom Docker image containing our scripts and Neuron packages needed to train a model with ECS jobs running on Trn1 instances. You can create an ECR repository using the AWS Command Line Interface (AWS CLI) or AWS Management Console. For this post, we use the console. Complete the following steps:

On the Amazon ECR console, create a new repository.
For Visibility settings¸ select Private.
For Repository name, enter a name.
Choose Create repository.

Now that you have a repository, let’s build and push an image, which could be built locally (into your laptop) or in a AWS Cloud9 environment. We are training a multi-layer perceptron (MLP) model. For the original code, refer to Multi-Layer Perceptron Training Tutorial.

Copy the train.py and model.py files into a project.

It’s already compatible with Neuron, so you don’t need to change any code.

5. Create a Dockerfile that has the commands to install the Neuron SDK and training scripts:

FROM amazonlinux:2

RUN echo $'[neuron] n
name=Neuron YUM Repository n
baseurl=https://yum.repos.neuron.amazonaws.com n
enabled=1' > /etc/yum.repos.d/neuron.repo

RUN rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

RUN yum install aws-neuronx-collectives-2.* -y
RUN yum install aws-neuronx-runtime-lib-2.* -y
RUN yum install aws-neuronx-tools-2.* -y
RUN yum install -y tar gzip pip
RUN yum install -y python3 python3-pip
RUN yum install -y python3.7-venv gcc-c++
RUN python3.7 -m venv aws_neuron_venv_pytorch

# Activate Python venv
ENV PATH="/aws_neuron_venv_pytorch/bin:$PATH"
RUN python -m pip install -U pip
RUN python -m pip install wget
RUN python -m pip install awscli

RUN python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
RUN python -m pip install torchvision tqdm torch-neuronx neuronx-cc==2.* pillow
RUN mkdir -p /opt/ml/mnist_mlp
COPY model.py /opt/ml/mnist_mlp/model.py
COPY train.py /opt/ml/mnist_mlp/train.py
RUN chmod +x /opt/ml/mnist_mlp/train.py
CMD ["python3", "/opt/ml/mnist_mlp/train.py"]

To create your own Dockerfile using Neuron, refer to Develop on AWS ML accelerator instance, where you can find guides for other OS and ML frameworks.

6. Build an image and then push it to Amazon ECR using the following code (provide your Region, account ID, and ECR repository):

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin {your-account-id}.dkr.ecr.{your-region}.amazonaws.com

docker build -t mlp_trainium .

docker tag mlp_trainium:latest {your-account-id}.dkr.ecr.us-east-1.amazonaws.com/mlp_trainium:latest

docker push {your-account-id}.dkr.ecr.{your-region}.amazonaws.com/{your-ecr-repo-name}:latest

After this, your image version should be visible in the ECR repository that you created.

Run the ML training job as an ECS task

To run the ML training task on Amazon ECS, you first need to create a task definition. A task definition is required to run Docker containers in Amazon ECS.

On the Amazon ECS console, choose Task definitions in the navigation pane.
On the Create new task definition menu, choose Create new task definition with JSON.

You can use the following task definition template as a baseline. Note that in the image field, you can use the one generated in the previous step. Make sure it includes your account ID and ECR repository name.

To make sure that Neuron is installed, you can check if the volume /dev/neuron0 is mapped in the devices block. This maps to a single NeuronDevice running on the trn1.2xlarge instance with two cores.

Create your task definition using the following template:

{
    "family": "mlp_trainium",
    "containerDefinitions": [
        {
            "name": "mlp_trainium",
            "image": "{your-account-id}.dkr.ecr.us-east-1.amazonaws.com/{your-ecr-repo-name}",
            "cpu": 0,
            "memoryReservation": 1000,
            "portMappings": [],
            "essential": true,
            "environment": [],
            "mountPoints": [],
            "volumesFrom": [],
            "linuxParameters": {
                "capabilities": {
                    "add": [
                        "IPC_LOCK"
                    ]
                },
                "devices": [
                    {
                        "hostPath": "/dev/neuron0",
                        "containerPath": "/dev/neuron0",
                        "permissions": [
                            "read",
                            "write"
                        ]
                    }
                ]
            },
            ,
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "/ecs/task-logs",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        }
    ],
    "networkMode": "awsvpc",
    "placementConstraints": [
        {
            "type": "memberOf",
            "expression": "attribute:ecs.os-type == linux"
        },
        {
            "type": "memberOf",
            "expression": "attribute:ecs.instance-type == trn1.2xlarge"
        }
    ],
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "1024",
    "memory": "3072"
}

You can also complete this step on the AWS CLI using the following task definition or with the following command:

aws ecs register-task-definition 
--family mlp-trainium 
--container-definitions '[{    
    "name": "my-container-1",    
    "image": "{your-account-id}.dkr.ecr.us-east-1.amazonaws.com/{your-ecr-repo-name}",
    "cpu": 0,
    "memoryReservation": 1000,
    "portMappings": [],
    "essential": true,
    "environment": [],
    "mountPoints": [],
    "volumesFrom": [],
    "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
            "awslogs-create-group": "true",
            "awslogs-group": "/ecs/task-logs",
            "awslogs-region": "us-east-1",
            "awslogs-stream-prefix": "ecs"
        }
    },
    "linuxParameters": {
        "capabilities": {
            "add": [
                "IPC_LOCK"
            ]
        },
        "devices": [{
            "hostPath": "/dev/neuron0",
            "containerPath": "/dev/neuron0",
            "permissions": ["read", "write"]
        }]
    }
}]' 
--requires-compatibilities EC2
--cpu "8192" 
--memory "16384" 
--placement-constraints '[{
    "type": "memberOf",
    "expression": "attribute:ecs.instance-type == trn1.2xlarge"
}, {
    "type": "memberOf",
    "expression": "attribute:ecs.os-type == linux"
}]'

Run the task on Amazon ECS

After we have created the ECS cluster, pushed the image to Amazon ECR, and created the task definition, we run the task definition to train a model on Amazon ECS.

On the Amazon ECS console, choose Clusters in the navigation pane.
Open your cluster.
On the Tasks tab, choose Run new task.

For Launch type, choose EC2.

For Application type, select Task.
For Family, choose the task definition you created.

In the Networking section, specify the VPC created by the CloudFormation stack, subnet, and security group.

Choose Create.

You can monitor your task on the Amazon ECS console.

You can also run the task using the AWS CLI:

aws ecs run-task --cluster <your-cluster-name> --task-definition <your-task-name> --count 1 --network-configuration '{"awsvpcConfiguration": {"subnets": ["<your-subnet-name> "], "securityGroups": ["<your-sg-name> "] }}'

The result will look like the following screenshot.

You can also check the details of the training job through the Amazon CloudWatch log group.

After you train your models, you can store them in Amazon Simple Storage Service (Amazon S3).

Clean up

To avoid additional expenses, you can change the Auto Scaling group to Minimum capacity and Desired capacity to zero, to shut down the Trainium instances. To do a complete cleanup, delete the CloudFormation stack to remove all resources created by this template.

Conclusion

In this post, we showed how to use Amazon ECS to deploy your ML training jobs. We created a CloudFormation template to create the ECS cluster of Trn1 instances, built a custom Docker image, pushed it to Amazon ECR, and ran the ML training job on the ECS cluster using a Trainium instance.

For more information about Neuron and what you can do with Trainium, check out the following resources:

About the Authors

Guilherme Ricci is a Senior Startup Solutions Architect on Amazon Web Services, helping startups modernize and optimize the costs of their applications. With over 10 years of experience with companies in the financial sector, he is currently working with a team of AI/ML specialists.

Evandro Franco is an AI/ML Specialist Solutions Architect working on Amazon Web Services. He helps AWS customers overcome business challenges related to AI/ML on top of AWS. He has more than 15 years working with technology, from software development, infrastructure, serverless, to machine learning.

Matthew McClean leads the Annapurna ML Solution Architecture team that helps customers adopt AWS Trainium and AWS Inferentia products. He is passionate about generative AI and has been helping customers adopt AWS technologies for the last 10 years.

Host ML models on Amazon SageMaker using Triton: CV model with PyTorch backend

May 31, 2023

by Neelam Koshiya Amazon AWS

PyTorch is a machine learning (ML) framework based on the Torch library, used for applications such as computer vision and natural language processing. One of the primary reasons that customers are choosing a PyTorch framework is its simplicity and the fact that it’s designed and assembled to work with Python. PyTorch supports dynamic computational graphs, enabling network behavior to be changed at runtime. This provides a major flexibility advantage over the majority of ML frameworks, which require neural networks to be defined as static objects before runtime. In this post, we dive deep to see how Amazon SageMaker can serve these PyTorch models using NVIDIA Triton Inference Server.

SageMaker provides several options for customers who are looking to host their ML models. One of the key available features is SageMaker real-time inference endpoints. Real-time workloads can have varying levels of performance expectations and service level agreements (SLAs), which materialize as latency and throughput requirements.

With real-time endpoints, different deployment options adjust to different tiers of expected performance. For example, your business may rely on a model that must meet very strict SLAs for latency and throughput with predictable performance. In this case, SageMaker provides single-model endpoints (SMEs), allowing you to deploy a single ML model to a logical endpoint, which will use the underlying server’s networking and compute capacity. For other use cases where you need a better balance between performance and cost, multi-model endpoints (MMEs) allows you to deploy multiple models behind a logical endpoint and invoke them individually, while abstracting their loading and unloading from memory.

SageMaker provides support for single-model and multi-model endpoints through NVIDIA Triton Inference Server. Triton supports various backends as engines to power the running and serving of different framework models, like PyTorch, TensorFlow, TensorRT, or ONNX Runtime. For any Triton deployment, it’s crucial to understand how the backend behavior impacts your workload and what to expect from its unique configuration parameters. In this post, we help you understand the Triton PyTorch backend in depth.

Triton with PyTorch backend

The PyTorch backend is designed to run TorchScript models using the PyTorch C++ API. TorchScript is a static subset of Python that captures the structure of a PyTorch model. To use this backend, you need to convert your PyTorch model to TorchScript using Just-In-Time (JIT) compilation. JIT compiles the TorchScript code into an optimized intermediate representation, making it suitable for deployment in non-Python environments. Triton uses TorchScript for improved performance and flexibility.

Each model deployed with Triton requires a configuration file (config.pbtxt) that specifies model metadata, such as input and output tensors, model name, and platform. The configuration file is essential for Triton to understand how to load, run, and optimize the model. For PyTorch models, the platform field in the configuration file should be set to pytorch_libtorch. You can load Triton PyTorch models on GPU and CPU (see Multiple Model Instances) and model weights will be kept either in GPU memory/VRAM or in host memory/RAM correspondingly.

Note that only the model’s forward method will be called when using the Pytorch backend; if you rely on more complex logic to prepare, iterate, and transform your raw model’s predictions to respond to a request, you should wrap it as a custom model forward. Alternatively, you can use ensemble models or business logic scripting.

You can optimize PyTorch model performance on Triton by using a combination of available configuration-based features. Some of these are backend-agnostic, like dynamic batching and concurrent model runs (see Achieve hyperscale performance for model serving using NVIDIA Triton Inference Server on Amazon SageMaker to learn more), and some are PyTorch-specific. Let’s take a deeper look into these configuration parameters and how you should use them:

DISABLE_OPTIMIZED_EXECUTION – Use this parameter to optimize running TorchScript models. This parameter slows down the initial call to a loaded TorchScript model, and might not benefit or even hinder model performance in some cases. Set to false if your tolerance to scaling or cold start latency is very low.
INFERENCE_MODE – Use this parameter to toggle PyTorch inference mode. In inference mode, computations aren’t recorded in the backward graph, and it allows PyTorch to speed up your model. This better runtime comes with a drawback: you won’t be able to use tensors created in inference mode in computations to be recorded by autograd after exiting inference mode. Set to true if the preceding conditions apply to your use case (mostly true for inference workloads).
ENABLE_NVFUSER – Use this parameter to enable NvFuser (CUDA Graph Fuser) optimization for TorchScript models. If not specified, the default PyTorch fuser is used.
ENABLE_WEIGHT_SHARING – Use this parameter to allow model instances (copies) on the same device to share weights. This can reduce memory usage of model loading and inference. It should not be used with models that maintain state.
ENABLE_CACHE_CLEANING – Use this parameter to enable CUDA cache cleaning after each model run (only has an effect if the model is on GPU). Setting this flag to true will negatively impact the performance due to additional CUDA cache cleaning operations after each model run. You should only use this flag if you serve multiple models with Triton and encounter CUDA out of memory issues during model runs.
ENABLE_JIT_EXECUTOR, ENABLE_JIT_PROFILING, and ENABLE_TENSOR_FUSER – Use these parameters to disable certain PyTorch optimizations that can sometimes cause latency regressions in models with complex run modes and dynamic shapes.

Triton Inference on SageMaker

SageMaker allows you to deploy both SMEs and MMEs with NVIDIA Triton Inference Server. The following figure shows Triton’s high-level architecture. The model repository is a file system-based repository of the models that Triton will make available for inferencing. Inference requests arrive at the server via HTTPS and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The backend performs inferencing using the inputs provided in the batched requests and the outputs are then returned.

When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling group. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. Note that for SMEs, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For MMEs, we recommend deploying similar models behind a given endpoint to have more steady, predictable performance. In use cases where models of varying sizes and requirements are used, you may want to separate those workloads across multiple MMEs, or spend added time fine-tuning their auto scaling group policy to obtain the best cost and performance balance. See Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints for more information on auto scaling policy considerations for MMEs. (Note that although the MMS configurations don’t apply in this case, the policy considerations still do.)

For a list of NVIDIA Triton Deep Learning Containers (DLCs) supported by SageMaker inference, refer to Available Deep Learning Containers Images.

Solution overview

In the following sections, we walk through an example available on GitHub to understand how we can use Triton and SageMaker MMEs on GPU to deploy a ResNet model for image classification. For demonstration purposes, we use a pre-trained ResNet50 model that can classify images into 1,000 categories.

Prerequisites

You first need an AWS account and an AWS Identity and Access Management (IAM) administrator user. For instructions on how to set up an AWS account, see How do I create and activate a new AWS account. For instructions on how to secure your account with an IAM administrator user, see Creating your first IAM admin user and user group.

SageMaker needs access to the Amazon Simple Storage Service (Amazon S3) bucket that stores your model. Create an IAM role with a policy that gives SageMaker read access to your bucket.

If you plan to run the notebook in Amazon SageMaker Studio, refer to Get Started for setup instructions.

Set up your environment

To set up your environment, complete the following steps:

Launch a SageMaker notebook instance with a g5.xlarge instance.

You can also run this example on a Studio notebook instance.

Select Clone a public git repository to this notebook instance only and specify the GitHub repository URL.
When JupyterLab is ready, launch the resnet_pytorch_python_backend_MME.ipynb notebook with the conda_python3 conda kernel and run through this notebook step by step.

Install the dependencies and import the required library

Use the following code to install dependencies and import the required library:

!pip install nvidia-pyindex --quiet
!pip install tritonclient[http] --quiet

# imports
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import numpy as np
from PIL import Image
import tritonclient.http as httpclient
# variables
s3_client = boto3.client("s3")

# sagemaker variables
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
bucket = sagemaker_session.default_bucket()

Prepare the model artifacts

The generate_model_pytorch.sh file in the workspace directory contains scripts to load and save a PyTorch model. First, we load a pre-trained ResNet50 model using the torchvision models package. We save the model as a model.pt file in TorchScript optimized and serialized format. TorchScript needs example inputs to do a model forward pass, so we pass one instance of an RGB image with three color channels of dimension 224X224. The script for exporting this model can be found on the GitHub repo.

!docker run --gpus=all --rm -it 
-v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:23.02-py3 
            /bin/bash generate_model_pytorch.sh

Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model, as shown in the following example. The value 1 represents version 1 of our Pytorch model. Each model is run by its specific backend, so each version subdirectory must contain the model artifact required by that backend. Because we’re using a PyTorch backend, a model.pt file is required within the version directory. For more details on naming conventions for model files, refer to Model Files.

Every Triton model must also provide a config.pbtxt file describing the model configuration. To learn more about the config settings, refer to Model Configuration. Out config.pbtxt file specifies the backend as pytorch_libtorch, and defines input and output tensor shapes and data type information. We also specify that we want to run this model on the GPU via the instance_group parameter. See the following code:

name: "resnet"
platform: "pytorch_libtorch"

max_batch_size: 128
input {
  name: "INPUT__0"
  data_type: TYPE_FP32
  dims: 3
  dims: 224
  dims: 224
}
output {
  name: "OUTPUT__0"
  data_type: TYPE_FP32
  dims: 1000
}

instance_group [
{
count: 1
kind: KIND_GPU
}

For the instance_group config, when simply a count is specified, Triton loads x counts of the model on each available GPU device. If you want to control which GPU devices to load your models on, you can do so explicitly by specifying the GPU device IDs. Note that for MMEs, explicitly specifying such GPU device IDs might lead to poor memory management because multiple models may explicitly try to allocate the same GPU device.

We then tar.gz the model artifacts, which is the format expected by SageMaker:

!tar -C triton-serve-pt/ -czf resnet_pt_v0.tar.gz 
resnetmodel_uri_pt = sagemaker_session.upload_data(path="resnet_pt_v0.tar.gz", key_prefix=prefix)

Now that we have uploaded the model artifacts to Amazon S3, we can create a SageMaker multi-model endpoint.

Deploy the model

We now deploy the Triton model to a SageMaker MME. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker MME will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker will create the endpoint with MME container specifications. We set the container with an image that supports deploying MMEs with GPU (refer to the MME container images for more details). Note that the parameter mode is set to MultiModel. This is the key differentiator.

container = {"Image": mme_triton_image_uri, "ModelDataUrl": model_data_url, "Mode": "MultiModel"}

Using the SageMaker Boto3 client, create the model using the create_model API. We pass the container definition to the create_model API along with ModelName and ExecutionRoleArn:

create_model_response = sm_client.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)
print("Model Arn: " + create_model_response["ModelArn"])

Create MME configurations using the create_endpoint_config Boto3 API. Specify an accelerated GPU computing instance in InstanceType (for this post, we use a g4dn.4xlarge instance). We recommend configuring your endpoints with at least two instances. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.

create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
"InstanceType": "ml.g4dn.4xlarge",
"InitialVariantWeight": 1,
"InitialInstanceCount": 1,
"ModelName": sm_model_name,
"VariantName": "AllTraffic",
}
],
)
print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Using the preceding endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService when the deployment is successful.

create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Invoke the model and run predictions

The following method transforms a sample image we will be using for inference into the payload that can be sent for inference to the Triton server.

The tritonclient package provides utility methods to generate the payload without having to know the details of the specification. We use the following methods to convert our inference request into a binary format, which provides lower latencies for inference:

s3_client.download_file(
    "sagemaker-sample-files", "datasets/image/pets/shiba_inu_dog.jpg", "shiba_inu_dog.jpg"
)


def get_sample_image():
    image_path = "./shiba_inu_dog.jpg"
    img = Image.open(image_path).convert("RGB")
    img = img.resize((224, 224))
    img = (np.array(img).astype(np.float32) / 255) - np.array(
        [0.485, 0.456, 0.406], dtype=np.float32
    ).reshape(1, 1, 3)
    img = img / np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 1, 3)
    img = np.transpose(img, (2, 0, 1))
    return img.tolist()


def _get_sample_image_binary(input_name, output_name):
    inputs = []
    outputs = []
    inputs.append(httpclient.InferInput(input_name, [1, 3, 224, 224], "FP32"))
    input_data = np.array(get_sample_image(), dtype=np.float32)
    input_data = np.expand_dims(input_data, axis=0)
    inputs[0].set_data_from_numpy(input_data, binary_data=True)
    outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))
    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
        inputs, outputs=outputs
    )
    return request_body, header_length


def get_sample_image_binary_pt():
    return _get_sample_image_binary("INPUT__0", "OUTPUT__0")

After the endpoint is successfully created, we can send inference requests to the MME using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type:

request_body, header_length = get_sample_image_binary_pt()
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
header_length
),
Body=request_body,
TargetModel="resnet_pt_v0.tar.gz",
)
# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix) :]
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
response["Body"].read(), header_length=int(header_length_str)
)
output0_data = result.as_numpy("OUTPUT__0")
print(output0_data)

Additionally, SageMaker MMEs provide instance-level metrics to monitor using Amazon CloudWatch:

LoadedModelCount – Number of models loaded in the containers
GPUUtilization – Percentage of GPU units that are used by the containers
GPUMemoryUtilization – Percentage of GPU memory used by the containers
DiskUtilization – Percentage of disk space used by the containers

SageMaker MMEs also provides model loading metrics such as the following:

ModelLoadingWaitTime – Time interval for the model to be downloaded or loaded
ModelUnloadingTime – Time interval to unload the model from the container
ModelDownloadingTime – Time to download the model from Amazon S3
ModelCacheHit – Number of invocations to the model that are already loaded onto the container to get model invocation-level insights

For more details, refer to Monitor Amazon SageMaker with Amazon CloudWatch.

Clean up

In order to avoid incurring charges, delete the model endpoint:

sm_client.delete_model(ModelName=sm_model_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_endpoint(EndpointName=endpoint_name)

Best practices

When using the PyTorch backend, most optimization decisions will depend on your specific workload latency or throughput requirements and what model architecture you are using. In general, in order to do a data-driven comparison of configuration parameters to improve performance, you should use Triton’s Performance Analyzer. With this tool, you should adopt the following decision logic:

Experiment and check if your model architecture can be transformed into a TensorRT engine and deployed with the Triton TensorRT backend. This is the preferable way to deploy models with NVIDIA GPUs because both the TensorRT model format and runtime make the best use of the underlying hardware capabilities.
Always set INFERENCE_MODE to true for pure inference workloads where no autograd calculations are required.
If deploying SMEs, maximize hardware utilization by properly defining instance group configuration according to the available GPU memory or RAM (use the Performance Analyzer tool to find the right size).

For more MME-specific best practices, refer to Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints.

Conclusion

In this post, we dove deep into the PyTorch backend supported by Triton Inference Server, which provides acceleration for both CPU and GPU based models. We went through some of the configuration parameters you can adjust to optimize model performance. Finally, we provided a walkthrough of an example notebook to demonstrate deploying a SageMaker multi-model endpoint deployment. Be sure to try it out!

About the Authors

Neelam Koshiya is an Enterprise Solutions Architect at AWS. With a background in software engineering, she organically moved into an architecture role. Her current focus is helping enterprise customers with their cloud adoption journey for strategic business outcomes with the area of depth being AI/ML. She is passionate about innovation and inclusion. In her spare time, she enjoys reading and being outdoors.

João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model training and inference optimization, and more broadly building large-scale ML platforms on AWS. He is also an active proponent of ML-specialized hardware and low-code ML solutions.

Vivek Gangasani is a Senior Machine Learning Solutions Architect at Amazon Web Services. He works with machine learning startups to build and deploy AI/ML applications on AWS. He is currently focused on delivering solutions for MLOps, ML inference, and low-code ML. He has worked on projects in different domains, including natural language processing and computer vision.

Prerequisites

Set up a SageMaker session

Load and prepare the dataset

Fine-tune BLOOMZ-7B with LoRA and bitsandbytes int-8 on SageMaker

Deploy the model to a SageMaker endpoint for inference

Clean up

Summary

About the Authors

Large Language Models are growing in popularity but can be difficult to deploy

Hugging Face’s Text Generation Inference simplifies LLM deployment

Get started with TGI on SageMaker Hosting

Conclusion and next steps

About the authors

Overview of solution

Prerequisites

Set up your resources

Label the dataset

Train a ByteTrack model and tune hyperparameters on the custom dataset

Deploy the trained ByteTrack model

Deploy a real-time inference endpoint

Deploy an asynchronous inference endpoint

Run batch inference with SageMaker processing

Clean up

Conclusion

About the Authors

Solution overview

Use Amazon Translate via the console

Use Amazon Translate with the AWS CLI

Use the Amazon Translate SDK (Python Boto3)

Conclusion

About the Authors

Solution overview

Prerequisites

Provision an ECS cluster of Trn1 instances

Prepare and push an ECR image with the Neuron SDK

Run the ML training job as an ECS task

Run the task on Amazon ECS

Clean up

Conclusion

About the Authors

Triton with PyTorch backend

Triton Inference on SageMaker

Solution overview

Prerequisites

Set up your environment

Install the dependencies and import the required library

Prepare the model artifacts

Deploy the model

Invoke the model and run predictions

Clean up

Best practices

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.