May 2023 – Vedere AI

Translate documents in real time with Amazon Translate

A critical component of business success is the ability to connect with customers. Businesses today want to connect with their customers by offering their content across multiple languages in real time. For most customers, the content creation process is disconnected from the localization effort of translating content into multiple target languages. These disconnected processes delay the business ability to simultaneously publish content in multiple languages, inhibiting their outreach efforts which negatively impacts time to market and revenue.

Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Now, Amazon Translate offers real-time document translation to seamlessly integrate and accelerate content creation and localization. You can submit a document from the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDK and receive the translated document in real time while maintaining the format of the original document. This feature eliminates the wait for documents to be translated in asynchronous batch mode.

Real-time document translation currently supports plain text and HTML documents. You can use other Amazon Translate features such as custom terminology, profanity masking, and formality as part of the real-time document translation.

In this post, we will show you how to use this new feature.

Solution overview

This post walks you through the steps required to use real-time document translation with the console, AWS CLI, and Amazon Translate SDK. As an example, we will translate this sample text file from English to French.

Use Amazon Translate via the console

Follow these steps to try out real-time document translation on the console:

On the Amazon Translate console, choose Real-time translation in the navigation pane.
Choose the Document tab.
Specify the language of the source file as English.
Specify the language of the target file as French.

Note: Source or Target language should be English for real-time document translation.

Select Choose file and upload the file you want to translate.
Specify the document type.

Text and HTML formats are supported at the time of this writing.

Under Additional settings, you can use other Amazon Translate features in conjunction with real-time document translation.

For more information about Amazon Translate features, refer to the following resources:

- Custom terminology – Customize Amazon Translate output to meet your domain and organization specific vocabulary
- Formality – Select Formal or Informal for the translated text
- Profanity masking – Apply profanity masking in Amazon Translate
Choose Translate and Download.

The translated file is automatically saved to your browser’s downloaded folder, usually to Downloads. The target language code will be prefixed to the translated file’s name. For example, if your source file name is lang.txt and your target language is French (fr), then the translated file will be named fr.lang.txt.

Use Amazon Translate with the AWS CLI

You can translate the contents of a file using the following AWS CLI command. In this example, the contents of source-lang.txt will be translated into target-lang.txt.

aws translate translate-document --source-language-code en --target-language es 
--document-content fileb://source-lang.txt 
--document ContentType=text/plain 
--query "TranslatedDocument.Content" 
--output text | base64 
--decode > target-lang.txt

Use the Amazon Translate SDK (Python Boto3)

You can use the following Python code to invoke Amazon Translate SDK API to translate text or HTML documents synchronously:

import boto3
import argparse

# Initialize parser
parser = argparse.ArgumentParser()
parser.add_argument("SourceLanguageCode")
parser.add_argument("TargetLanguageCode")
parser.add_argument("SourceFile")
args = parser.parse_args()


translate = boto3.client('translate’)

localFile = args.SourceFile
file = open(localFile, "rb")
data = file.read()
file.close()


result = translate.translate_document(
    Document={
            "Content": data,
            "ContentType": "text/html"
        },
    SourceLanguageCode=args.SourceLanguageCode,
    TargetLanguageCode=args.TargetLanguageCode
)
if "TranslatedDocument" in result:
    fileName = localFile.split("/")[-1]
    tmpfile = f"{args.TargetLanguageCode}-{fileName}"
    with open(tmpfile,  'w', encoding='utf-8') as f:
     
    f.write(str(result["TranslatedDocument"]["Content"]))

    print("Translated document ", tmpfile)

This program accepts three arguments: source language, target language, and file path. Use the following command to invoke this program:

python syncDocumentTranslation.py en es source-lang.txt

Conclusion

The real-time document translation feature in Amazon Translate can expedite time to market by enabling easy integration with content creation and localization. Real-time document translation improves content creation and the localization process.

For more information about Amazon Translate, visit Amazon Translate resources to find video resources and blog posts, and refer to AWS Translate FAQs.

About the Authors

Sathya Balakrishnan is a Senior Consultant in the Professional Services team at AWS, specializing in data and ML solutions. He works with US federal financial clients. He is passionate about building pragmatic solutions to solve customers’ business problems. In his spare time, he enjoys watching movies and hiking with his family.

RG Thiyagarajan is a Senior Consultant in Professional Services at AWS, specializing in application migration, security, and resiliency with US federal financial clients.

Sid Padgaonkar is the Senior Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends, you will find him playing squash and exploring the food scene in the Pacific Northwest.

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

Running machine learning (ML) workloads with containers is becoming a common practice. Containers can fully encapsulate not just your training code, but the entire dependency stack down to the hardware libraries and drivers. What you get is an ML development environment that is consistent and portable. With containers, scaling on a cluster becomes much easier.

In late 2022, AWS announced the general availability of Amazon EC2 Trn1 instances powered by AWS Trainium accelerators, which are purpose built for high-performance deep learning training. Trn1 instances deliver up to 50% savings on training costs over other comparable Amazon Elastic Compute Cloud (Amazon EC2) instances. Also, the AWS Neuron SDK was released to improve this acceleration, giving developers tools to interact with this technology such as to compile, runtime, and profile to achieve high-performance and cost-effective model trainings.

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that simplifies your deployment, management, and scaling of containerized applications. Simply describe your application and the resources required, and Amazon ECS will launch, monitor, and scale your application across flexible compute options with automatic integrations to other supporting AWS services that your application needs.

In this post, we show you how to run your ML training jobs in a container using Amazon ECS to deploy, manage, and scale your ML workload.

Solution overview

We walk you through the following high-level steps:

Provision an ECS cluster of Trn1 instances with AWS CloudFormation.
Build a custom container image with the Neuron SDK and push it to Amazon Elastic Container Registry (Amazon ECR).
Create a task definition to define an ML training job to be run by Amazon ECS.
Run the ML task on Amazon ECS.

Prerequisites

To follow along, familiarity with core AWS services such as Amazon EC2 and Amazon ECS is implied.

Provision an ECS cluster of Trn1 instances

To get started, launch the provided CloudFormation template, which will provision required resources such as a VPC, ECS cluster, and EC2 Trainium instance.

We use the Neuron SDK to run deep learning workloads on AWS Inferentia and Trainium-based instances. It supports you in your end-to-end ML development lifecycle to create new models, optimize them, then deploy them for production. To train your model with Trainium, you need to install the Neuron SDK on the EC2 instances where the ECS tasks will run to map the NeuronDevice associated with the hardware, as well as the Docker image that will be pushed to Amazon ECR to access the commands to train your model.

Standard versions of Amazon Linux 2 or Ubuntu 20 don’t come with AWS Neuron drivers installed. Therefore, we have two different options.

The first option is to use a Deep Learning Amazon Machine Image (DLAMI) that has the Neuron SDK already installed. A sample is available on the GitHub repo. You can choose a DLAMI based on the opereating system. Then run the following command to get the AMI ID:

aws ec2 describe-images --region us-east-1 --owners amazon --filters 'Name=name,Values=Deep Learning AMI Neuron PyTorch 1.13.? (Amazon Linux 2) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text

The output will be as follows:

ami-06c40dd4f80434809

This AMI ID can change over time, so make sure to use the command to get the right AMI ID.

Now you can change this AMI ID in the CloudFormation script and use the ready-to-use Neuron SDK. To do this, look for EcsAmiId in Parameters:

"EcsAmiId": { 
    "Type": "String", 
    "Description": "AMI ID", 
    "Default": "ami-09def9404c46ac27c" 
}

The second option is to create an instance filling the userdata field during stack creation. You don’t need to install it because CloudFormation will set this up. For more information, refer to the Neuron Setup Guide.

For this post, we use option 2, in case you need to use a custom image. Complete the following steps:

Launch the provided CloudFormation template.
For KeyName, enter a name of your desired key pair, and it will preload the parameters. For this post, we use trainium-key.
Enter a name for your stack.
If you’re running in the us-east-1 Region, you can keep the values for ALBName and AZIds at their default.

To check what Availability Zone in the Region has Trn1 available, run the following command:

aws ec2 describe-instance-type-offerings --region us-east1 --location-type availability-zone --filter Name=instance-type,Values=trn1.2xlarge

Choose Next and finish creating the stack.

When the stack is complete, you can move to the next step.

Prepare and push an ECR image with the Neuron SDK

Amazon ECR is a fully managed container registry offering high-performance hosting, so you can reliably deploy application images and artifacts anywhere. We use Amazon ECR to store a custom Docker image containing our scripts and Neuron packages needed to train a model with ECS jobs running on Trn1 instances. You can create an ECR repository using the AWS Command Line Interface (AWS CLI) or AWS Management Console. For this post, we use the console. Complete the following steps:

On the Amazon ECR console, create a new repository.
For Visibility settings¸ select Private.
For Repository name, enter a name.
Choose Create repository.

Now that you have a repository, let’s build and push an image, which could be built locally (into your laptop) or in a AWS Cloud9 environment. We are training a multi-layer perceptron (MLP) model. For the original code, refer to Multi-Layer Perceptron Training Tutorial.

Copy the train.py and model.py files into a project.

It’s already compatible with Neuron, so you don’t need to change any code.

5. Create a Dockerfile that has the commands to install the Neuron SDK and training scripts:

FROM amazonlinux:2

RUN echo $'[neuron] n
name=Neuron YUM Repository n
baseurl=https://yum.repos.neuron.amazonaws.com n
enabled=1' > /etc/yum.repos.d/neuron.repo

RUN rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

RUN yum install aws-neuronx-collectives-2.* -y
RUN yum install aws-neuronx-runtime-lib-2.* -y
RUN yum install aws-neuronx-tools-2.* -y
RUN yum install -y tar gzip pip
RUN yum install -y python3 python3-pip
RUN yum install -y python3.7-venv gcc-c++
RUN python3.7 -m venv aws_neuron_venv_pytorch

# Activate Python venv
ENV PATH="/aws_neuron_venv_pytorch/bin:$PATH"
RUN python -m pip install -U pip
RUN python -m pip install wget
RUN python -m pip install awscli

RUN python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
RUN python -m pip install torchvision tqdm torch-neuronx neuronx-cc==2.* pillow
RUN mkdir -p /opt/ml/mnist_mlp
COPY model.py /opt/ml/mnist_mlp/model.py
COPY train.py /opt/ml/mnist_mlp/train.py
RUN chmod +x /opt/ml/mnist_mlp/train.py
CMD ["python3", "/opt/ml/mnist_mlp/train.py"]

To create your own Dockerfile using Neuron, refer to Develop on AWS ML accelerator instance, where you can find guides for other OS and ML frameworks.

6. Build an image and then push it to Amazon ECR using the following code (provide your Region, account ID, and ECR repository):

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin {your-account-id}.dkr.ecr.{your-region}.amazonaws.com

docker build -t mlp_trainium .

docker tag mlp_trainium:latest {your-account-id}.dkr.ecr.us-east-1.amazonaws.com/mlp_trainium:latest

docker push {your-account-id}.dkr.ecr.{your-region}.amazonaws.com/{your-ecr-repo-name}:latest

After this, your image version should be visible in the ECR repository that you created.

Run the ML training job as an ECS task

To run the ML training task on Amazon ECS, you first need to create a task definition. A task definition is required to run Docker containers in Amazon ECS.

On the Amazon ECS console, choose Task definitions in the navigation pane.
On the Create new task definition menu, choose Create new task definition with JSON.

You can use the following task definition template as a baseline. Note that in the image field, you can use the one generated in the previous step. Make sure it includes your account ID and ECR repository name.

To make sure that Neuron is installed, you can check if the volume /dev/neuron0 is mapped in the devices block. This maps to a single NeuronDevice running on the trn1.2xlarge instance with two cores.

Create your task definition using the following template:

{
    "family": "mlp_trainium",
    "containerDefinitions": [
        {
            "name": "mlp_trainium",
            "image": "{your-account-id}.dkr.ecr.us-east-1.amazonaws.com/{your-ecr-repo-name}",
            "cpu": 0,
            "memoryReservation": 1000,
            "portMappings": [],
            "essential": true,
            "environment": [],
            "mountPoints": [],
            "volumesFrom": [],
            "linuxParameters": {
                "capabilities": {
                    "add": [
                        "IPC_LOCK"
                    ]
                },
                "devices": [
                    {
                        "hostPath": "/dev/neuron0",
                        "containerPath": "/dev/neuron0",
                        "permissions": [
                            "read",
                            "write"
                        ]
                    }
                ]
            },
            ,
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "/ecs/task-logs",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        }
    ],
    "networkMode": "awsvpc",
    "placementConstraints": [
        {
            "type": "memberOf",
            "expression": "attribute:ecs.os-type == linux"
        },
        {
            "type": "memberOf",
            "expression": "attribute:ecs.instance-type == trn1.2xlarge"
        }
    ],
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "1024",
    "memory": "3072"
}

You can also complete this step on the AWS CLI using the following task definition or with the following command:

aws ecs register-task-definition 
--family mlp-trainium 
--container-definitions '[{    
    "name": "my-container-1",    
    "image": "{your-account-id}.dkr.ecr.us-east-1.amazonaws.com/{your-ecr-repo-name}",
    "cpu": 0,
    "memoryReservation": 1000,
    "portMappings": [],
    "essential": true,
    "environment": [],
    "mountPoints": [],
    "volumesFrom": [],
    "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
            "awslogs-create-group": "true",
            "awslogs-group": "/ecs/task-logs",
            "awslogs-region": "us-east-1",
            "awslogs-stream-prefix": "ecs"
        }
    },
    "linuxParameters": {
        "capabilities": {
            "add": [
                "IPC_LOCK"
            ]
        },
        "devices": [{
            "hostPath": "/dev/neuron0",
            "containerPath": "/dev/neuron0",
            "permissions": ["read", "write"]
        }]
    }
}]' 
--requires-compatibilities EC2
--cpu "8192" 
--memory "16384" 
--placement-constraints '[{
    "type": "memberOf",
    "expression": "attribute:ecs.instance-type == trn1.2xlarge"
}, {
    "type": "memberOf",
    "expression": "attribute:ecs.os-type == linux"
}]'

Run the task on Amazon ECS

After we have created the ECS cluster, pushed the image to Amazon ECR, and created the task definition, we run the task definition to train a model on Amazon ECS.

On the Amazon ECS console, choose Clusters in the navigation pane.
Open your cluster.
On the Tasks tab, choose Run new task.

For Launch type, choose EC2.

For Application type, select Task.
For Family, choose the task definition you created.

In the Networking section, specify the VPC created by the CloudFormation stack, subnet, and security group.

Choose Create.

You can monitor your task on the Amazon ECS console.

You can also run the task using the AWS CLI:

aws ecs run-task --cluster <your-cluster-name> --task-definition <your-task-name> --count 1 --network-configuration '{"awsvpcConfiguration": {"subnets": ["<your-subnet-name> "], "securityGroups": ["<your-sg-name> "] }}'

The result will look like the following screenshot.

You can also check the details of the training job through the Amazon CloudWatch log group.

After you train your models, you can store them in Amazon Simple Storage Service (Amazon S3).

Clean up

To avoid additional expenses, you can change the Auto Scaling group to Minimum capacity and Desired capacity to zero, to shut down the Trainium instances. To do a complete cleanup, delete the CloudFormation stack to remove all resources created by this template.

Conclusion

In this post, we showed how to use Amazon ECS to deploy your ML training jobs. We created a CloudFormation template to create the ECS cluster of Trn1 instances, built a custom Docker image, pushed it to Amazon ECR, and ran the ML training job on the ECS cluster using a Trainium instance.

For more information about Neuron and what you can do with Trainium, check out the following resources:

About the Authors

Guilherme Ricci is a Senior Startup Solutions Architect on Amazon Web Services, helping startups modernize and optimize the costs of their applications. With over 10 years of experience with companies in the financial sector, he is currently working with a team of AI/ML specialists.

Evandro Franco is an AI/ML Specialist Solutions Architect working on Amazon Web Services. He helps AWS customers overcome business challenges related to AI/ML on top of AWS. He has more than 15 years working with technology, from software development, infrastructure, serverless, to machine learning.

Matthew McClean leads the Annapurna ML Solution Architecture team that helps customers adopt AWS Trainium and AWS Inferentia products. He is passionate about generative AI and has been helping customers adopt AWS technologies for the last 10 years.

Host ML models on Amazon SageMaker using Triton: CV model with PyTorch backend

PyTorch is a machine learning (ML) framework based on the Torch library, used for applications such as computer vision and natural language processing. One of the primary reasons that customers are choosing a PyTorch framework is its simplicity and the fact that it’s designed and assembled to work with Python. PyTorch supports dynamic computational graphs, enabling network behavior to be changed at runtime. This provides a major flexibility advantage over the majority of ML frameworks, which require neural networks to be defined as static objects before runtime. In this post, we dive deep to see how Amazon SageMaker can serve these PyTorch models using NVIDIA Triton Inference Server.

SageMaker provides several options for customers who are looking to host their ML models. One of the key available features is SageMaker real-time inference endpoints. Real-time workloads can have varying levels of performance expectations and service level agreements (SLAs), which materialize as latency and throughput requirements.

With real-time endpoints, different deployment options adjust to different tiers of expected performance. For example, your business may rely on a model that must meet very strict SLAs for latency and throughput with predictable performance. In this case, SageMaker provides single-model endpoints (SMEs), allowing you to deploy a single ML model to a logical endpoint, which will use the underlying server’s networking and compute capacity. For other use cases where you need a better balance between performance and cost, multi-model endpoints (MMEs) allows you to deploy multiple models behind a logical endpoint and invoke them individually, while abstracting their loading and unloading from memory.

SageMaker provides support for single-model and multi-model endpoints through NVIDIA Triton Inference Server. Triton supports various backends as engines to power the running and serving of different framework models, like PyTorch, TensorFlow, TensorRT, or ONNX Runtime. For any Triton deployment, it’s crucial to understand how the backend behavior impacts your workload and what to expect from its unique configuration parameters. In this post, we help you understand the Triton PyTorch backend in depth.

Triton with PyTorch backend

The PyTorch backend is designed to run TorchScript models using the PyTorch C++ API. TorchScript is a static subset of Python that captures the structure of a PyTorch model. To use this backend, you need to convert your PyTorch model to TorchScript using Just-In-Time (JIT) compilation. JIT compiles the TorchScript code into an optimized intermediate representation, making it suitable for deployment in non-Python environments. Triton uses TorchScript for improved performance and flexibility.

Each model deployed with Triton requires a configuration file (config.pbtxt) that specifies model metadata, such as input and output tensors, model name, and platform. The configuration file is essential for Triton to understand how to load, run, and optimize the model. For PyTorch models, the platform field in the configuration file should be set to pytorch_libtorch. You can load Triton PyTorch models on GPU and CPU (see Multiple Model Instances) and model weights will be kept either in GPU memory/VRAM or in host memory/RAM correspondingly.

Note that only the model’s forward method will be called when using the Pytorch backend; if you rely on more complex logic to prepare, iterate, and transform your raw model’s predictions to respond to a request, you should wrap it as a custom model forward. Alternatively, you can use ensemble models or business logic scripting.

You can optimize PyTorch model performance on Triton by using a combination of available configuration-based features. Some of these are backend-agnostic, like dynamic batching and concurrent model runs (see Achieve hyperscale performance for model serving using NVIDIA Triton Inference Server on Amazon SageMaker to learn more), and some are PyTorch-specific. Let’s take a deeper look into these configuration parameters and how you should use them:

DISABLE_OPTIMIZED_EXECUTION – Use this parameter to optimize running TorchScript models. This parameter slows down the initial call to a loaded TorchScript model, and might not benefit or even hinder model performance in some cases. Set to false if your tolerance to scaling or cold start latency is very low.
INFERENCE_MODE – Use this parameter to toggle PyTorch inference mode. In inference mode, computations aren’t recorded in the backward graph, and it allows PyTorch to speed up your model. This better runtime comes with a drawback: you won’t be able to use tensors created in inference mode in computations to be recorded by autograd after exiting inference mode. Set to true if the preceding conditions apply to your use case (mostly true for inference workloads).
ENABLE_NVFUSER – Use this parameter to enable NvFuser (CUDA Graph Fuser) optimization for TorchScript models. If not specified, the default PyTorch fuser is used.
ENABLE_WEIGHT_SHARING – Use this parameter to allow model instances (copies) on the same device to share weights. This can reduce memory usage of model loading and inference. It should not be used with models that maintain state.
ENABLE_CACHE_CLEANING – Use this parameter to enable CUDA cache cleaning after each model run (only has an effect if the model is on GPU). Setting this flag to true will negatively impact the performance due to additional CUDA cache cleaning operations after each model run. You should only use this flag if you serve multiple models with Triton and encounter CUDA out of memory issues during model runs.
ENABLE_JIT_EXECUTOR, ENABLE_JIT_PROFILING, and ENABLE_TENSOR_FUSER – Use these parameters to disable certain PyTorch optimizations that can sometimes cause latency regressions in models with complex run modes and dynamic shapes.

Triton Inference on SageMaker

SageMaker allows you to deploy both SMEs and MMEs with NVIDIA Triton Inference Server. The following figure shows Triton’s high-level architecture. The model repository is a file system-based repository of the models that Triton will make available for inferencing. Inference requests arrive at the server via HTTPS and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The backend performs inferencing using the inputs provided in the batched requests and the outputs are then returned.

When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling group. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. Note that for SMEs, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For MMEs, we recommend deploying similar models behind a given endpoint to have more steady, predictable performance. In use cases where models of varying sizes and requirements are used, you may want to separate those workloads across multiple MMEs, or spend added time fine-tuning their auto scaling group policy to obtain the best cost and performance balance. See Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints for more information on auto scaling policy considerations for MMEs. (Note that although the MMS configurations don’t apply in this case, the policy considerations still do.)

For a list of NVIDIA Triton Deep Learning Containers (DLCs) supported by SageMaker inference, refer to Available Deep Learning Containers Images.

Solution overview

In the following sections, we walk through an example available on GitHub to understand how we can use Triton and SageMaker MMEs on GPU to deploy a ResNet model for image classification. For demonstration purposes, we use a pre-trained ResNet50 model that can classify images into 1,000 categories.

Prerequisites

You first need an AWS account and an AWS Identity and Access Management (IAM) administrator user. For instructions on how to set up an AWS account, see How do I create and activate a new AWS account. For instructions on how to secure your account with an IAM administrator user, see Creating your first IAM admin user and user group.

SageMaker needs access to the Amazon Simple Storage Service (Amazon S3) bucket that stores your model. Create an IAM role with a policy that gives SageMaker read access to your bucket.

If you plan to run the notebook in Amazon SageMaker Studio, refer to Get Started for setup instructions.

Set up your environment

To set up your environment, complete the following steps:

Launch a SageMaker notebook instance with a g5.xlarge instance.

You can also run this example on a Studio notebook instance.

Select Clone a public git repository to this notebook instance only and specify the GitHub repository URL.
When JupyterLab is ready, launch the resnet_pytorch_python_backend_MME.ipynb notebook with the conda_python3 conda kernel and run through this notebook step by step.

Install the dependencies and import the required library

Use the following code to install dependencies and import the required library:

!pip install nvidia-pyindex --quiet
!pip install tritonclient[http] --quiet

# imports
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import numpy as np
from PIL import Image
import tritonclient.http as httpclient
# variables
s3_client = boto3.client("s3")

# sagemaker variables
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
bucket = sagemaker_session.default_bucket()

Prepare the model artifacts

The generate_model_pytorch.sh file in the workspace directory contains scripts to load and save a PyTorch model. First, we load a pre-trained ResNet50 model using the torchvision models package. We save the model as a model.pt file in TorchScript optimized and serialized format. TorchScript needs example inputs to do a model forward pass, so we pass one instance of an RGB image with three color channels of dimension 224X224. The script for exporting this model can be found on the GitHub repo.

!docker run --gpus=all --rm -it 
-v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:23.02-py3 
            /bin/bash generate_model_pytorch.sh

Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model, as shown in the following example. The value 1 represents version 1 of our Pytorch model. Each model is run by its specific backend, so each version subdirectory must contain the model artifact required by that backend. Because we’re using a PyTorch backend, a model.pt file is required within the version directory. For more details on naming conventions for model files, refer to Model Files.

Every Triton model must also provide a config.pbtxt file describing the model configuration. To learn more about the config settings, refer to Model Configuration. Out config.pbtxt file specifies the backend as pytorch_libtorch, and defines input and output tensor shapes and data type information. We also specify that we want to run this model on the GPU via the instance_group parameter. See the following code:

name: "resnet"
platform: "pytorch_libtorch"

max_batch_size: 128
input {
  name: "INPUT__0"
  data_type: TYPE_FP32
  dims: 3
  dims: 224
  dims: 224
}
output {
  name: "OUTPUT__0"
  data_type: TYPE_FP32
  dims: 1000
}

instance_group [
{
count: 1
kind: KIND_GPU
}

For the instance_group config, when simply a count is specified, Triton loads x counts of the model on each available GPU device. If you want to control which GPU devices to load your models on, you can do so explicitly by specifying the GPU device IDs. Note that for MMEs, explicitly specifying such GPU device IDs might lead to poor memory management because multiple models may explicitly try to allocate the same GPU device.

We then tar.gz the model artifacts, which is the format expected by SageMaker:

!tar -C triton-serve-pt/ -czf resnet_pt_v0.tar.gz 
resnetmodel_uri_pt = sagemaker_session.upload_data(path="resnet_pt_v0.tar.gz", key_prefix=prefix)

Now that we have uploaded the model artifacts to Amazon S3, we can create a SageMaker multi-model endpoint.

Deploy the model

We now deploy the Triton model to a SageMaker MME. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker MME will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker will create the endpoint with MME container specifications. We set the container with an image that supports deploying MMEs with GPU (refer to the MME container images for more details). Note that the parameter mode is set to MultiModel. This is the key differentiator.

container = {"Image": mme_triton_image_uri, "ModelDataUrl": model_data_url, "Mode": "MultiModel"}

Using the SageMaker Boto3 client, create the model using the create_model API. We pass the container definition to the create_model API along with ModelName and ExecutionRoleArn:

create_model_response = sm_client.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)
print("Model Arn: " + create_model_response["ModelArn"])

Create MME configurations using the create_endpoint_config Boto3 API. Specify an accelerated GPU computing instance in InstanceType (for this post, we use a g4dn.4xlarge instance). We recommend configuring your endpoints with at least two instances. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.

create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
"InstanceType": "ml.g4dn.4xlarge",
"InitialVariantWeight": 1,
"InitialInstanceCount": 1,
"ModelName": sm_model_name,
"VariantName": "AllTraffic",
}
],
)
print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Using the preceding endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService when the deployment is successful.

create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Invoke the model and run predictions

The following method transforms a sample image we will be using for inference into the payload that can be sent for inference to the Triton server.

The tritonclient package provides utility methods to generate the payload without having to know the details of the specification. We use the following methods to convert our inference request into a binary format, which provides lower latencies for inference:

s3_client.download_file(
    "sagemaker-sample-files", "datasets/image/pets/shiba_inu_dog.jpg", "shiba_inu_dog.jpg"
)


def get_sample_image():
    image_path = "./shiba_inu_dog.jpg"
    img = Image.open(image_path).convert("RGB")
    img = img.resize((224, 224))
    img = (np.array(img).astype(np.float32) / 255) - np.array(
        [0.485, 0.456, 0.406], dtype=np.float32
    ).reshape(1, 1, 3)
    img = img / np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 1, 3)
    img = np.transpose(img, (2, 0, 1))
    return img.tolist()


def _get_sample_image_binary(input_name, output_name):
    inputs = []
    outputs = []
    inputs.append(httpclient.InferInput(input_name, [1, 3, 224, 224], "FP32"))
    input_data = np.array(get_sample_image(), dtype=np.float32)
    input_data = np.expand_dims(input_data, axis=0)
    inputs[0].set_data_from_numpy(input_data, binary_data=True)
    outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))
    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
        inputs, outputs=outputs
    )
    return request_body, header_length


def get_sample_image_binary_pt():
    return _get_sample_image_binary("INPUT__0", "OUTPUT__0")

After the endpoint is successfully created, we can send inference requests to the MME using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type:

request_body, header_length = get_sample_image_binary_pt()
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
header_length
),
Body=request_body,
TargetModel="resnet_pt_v0.tar.gz",
)
# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix) :]
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
response["Body"].read(), header_length=int(header_length_str)
)
output0_data = result.as_numpy("OUTPUT__0")
print(output0_data)

Additionally, SageMaker MMEs provide instance-level metrics to monitor using Amazon CloudWatch:

LoadedModelCount – Number of models loaded in the containers
GPUUtilization – Percentage of GPU units that are used by the containers
GPUMemoryUtilization – Percentage of GPU memory used by the containers
DiskUtilization – Percentage of disk space used by the containers

SageMaker MMEs also provides model loading metrics such as the following:

ModelLoadingWaitTime – Time interval for the model to be downloaded or loaded
ModelUnloadingTime – Time interval to unload the model from the container
ModelDownloadingTime – Time to download the model from Amazon S3
ModelCacheHit – Number of invocations to the model that are already loaded onto the container to get model invocation-level insights

For more details, refer to Monitor Amazon SageMaker with Amazon CloudWatch.

Clean up

In order to avoid incurring charges, delete the model endpoint:

sm_client.delete_model(ModelName=sm_model_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_endpoint(EndpointName=endpoint_name)

Best practices

When using the PyTorch backend, most optimization decisions will depend on your specific workload latency or throughput requirements and what model architecture you are using. In general, in order to do a data-driven comparison of configuration parameters to improve performance, you should use Triton’s Performance Analyzer. With this tool, you should adopt the following decision logic:

Experiment and check if your model architecture can be transformed into a TensorRT engine and deployed with the Triton TensorRT backend. This is the preferable way to deploy models with NVIDIA GPUs because both the TensorRT model format and runtime make the best use of the underlying hardware capabilities.
Always set INFERENCE_MODE to true for pure inference workloads where no autograd calculations are required.
If deploying SMEs, maximize hardware utilization by properly defining instance group configuration according to the available GPU memory or RAM (use the Performance Analyzer tool to find the right size).

For more MME-specific best practices, refer to Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints.

Conclusion

In this post, we dove deep into the PyTorch backend supported by Triton Inference Server, which provides acceleration for both CPU and GPU based models. We went through some of the configuration parameters you can adjust to optimize model performance. Finally, we provided a walkthrough of an example notebook to demonstrate deploying a SageMaker multi-model endpoint deployment. Be sure to try it out!

About the Authors

Neelam Koshiya is an Enterprise Solutions Architect at AWS. With a background in software engineering, she organically moved into an architecture role. Her current focus is helping enterprise customers with their cloud adoption journey for strategic business outcomes with the area of depth being AI/ML. She is passionate about innovation and inclusion. In her spare time, she enjoys reading and being outdoors.

João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model training and inference optimization, and more broadly building large-scale ML platforms on AWS. He is also an active proponent of ML-specialized hardware and low-code ML solutions.

Vivek Gangasani is a Senior Machine Learning Solutions Architect at Amazon Web Services. He works with machine learning startups to build and deploy AI/ML applications on AWS. He is currently focused on delivering solutions for MLOps, ML inference, and low-code ML. He has worked on projects in different domains, including natural language processing and computer vision.

Configure and use defaults for Amazon SageMaker resources with the SageMaker Python SDK

The Amazon SageMaker Python SDK is an open-source library for training and deploying machine learning (ML) models on Amazon SageMaker. Enterprise customers in tightly controlled industries such as healthcare and finance set up security guardrails to ensure their data is encrypted and traffic doesn’t traverse the internet. To ensure the SageMaker training and deployment of ML models follow these guardrails, it’s a common practice to set restrictions at the account or AWS Organizations level through service control policies and AWS Identity and Access Management (IAM) policies to enforce the usage of specific IAM roles, Amazon Virtual Private Cloud (Amazon VPC) configurations, and AWS Key Management Service (AWS KMS) keys. In such cases, data scientists have to provide these parameters to their ML model training and deployment code manually, by noting down subnets, security groups, and KMS keys. This puts the onus on the data scientists to remember to specify these configurations, to successfully run their jobs, and avoid getting Access Denied errors.

Starting with SageMaker Python SDK version 2.148.0, you can now configure default values for parameters such as IAM roles, VPCs, and KMS keys. Administrators and end-users can initialize AWS infrastructure primitives with defaults specified in a configuration file in YAML format. Once configured, the Python SDK automatically inherits these values and propagates them to the underlying SageMaker API calls such as CreateProcessingJob(), CreateTrainingJob(), and CreateEndpointConfig(), with no additional actions needed. The SDK also supports multiple configuration files, allowing admins to set a configuration file for all users, and users can override it via a user-level configuration that can be stored in Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS) for Amazon SageMaker Studio, or the user’s local file system.

In this post, we show you how to create and store the default configuration file in Studio and use the SDK defaults feature to create your SageMaker resources.

Solution overview

We demonstrate this new feature with an end-to-end AWS CloudFormation template that creates the required infrastructure, and creates a Studio domain in the deployed VPC. In addition, we create KMS keys for encrypting the volumes used in training and processing jobs. The steps are as follows:

Launch the CloudFormation stack in your account. Alternatively, if you want to explore this feature on an existing SageMaker domain or notebook, skip this step.
Populate the config.yaml file and save the file in the default location.
Run a sample notebook with an end-to-end ML use case, including data processing, model training, and inference.
Override the default configuration values.

Prerequisites

Before you get started, make sure you have an AWS account and an IAM user or role with administrator privileges. If you are a data scientist currently passing infrastructure parameters to resources in your notebook, you can skip the next step of setting up your environment and start creating the configuration file.

To use this feature, make sure to upgrade your SageMaker SDK version by running pip install --upgrade sagemaker.

Set up the environment

To deploy a complete infrastructure including networking and a Studio domain, complete the following steps:

Clone the GitHub repository.
Log in to your AWS account and open the AWS CloudFormation console.
To deploy the networking resources, choose Create stack.
Upload the template under setup/vpc_mode/01_networking.yaml.
Provide a name for the stack (for example, networking-stack), and complete the remaining steps to create the stack.
To deploy the Studio domain, choose Create stack again.
Upload the template under setup/vpc_mode/02_sagemaker_studio.yaml.
Provide a name for the stack (for example, sagemaker-stack), and provide the name of the networking stack when prompted for the CoreNetworkingStackName parameter.
Proceed with the remaining steps, select the acknowledgements for IAM resources, and create the stack.

When the status of both stacks update to CREATE_COMPLETE, proceed to the next step.

Create the configuration file

To use the default configuration for the SageMaker Python SDK, you create a config.yaml file in the format that the SDK expects. For the format for the config.yaml file, refer to Configuration file structure. Depending on your work environment, such as Studio notebooks, SageMaker notebook instances, or your local IDE, you can either save the configuration file at the default location or override the defaults by passing a config file location. For the default locations for other environments, refer to Configuration file locations. The following steps showcase the setup for a Studio notebook environment.

To easily create the config.yaml file, run the following cells in your Studio system terminal, replacing the placeholders with the CloudFormation stack names from the previous step:

git clone https://github.com/aws-samples/amazon-sagemaker-build-train-deploy.git
cd amazon-sagemaker-build-train-deploy
pip install boto3
python generate-defaults.py --networking-stack <network-stack-name> 
--sagemaker-stack <sagemaker-stack-name>

# save the file to the default location
mkdir .config/sagemaker
cp user-configs.yaml ~/.config/sagemaker/config.yaml

This script automatically populates the YAML file, replacing the placeholders with the infrastructure defaults, and saves the file in the home folder. Then it copies the file into the default location for Studio notebooks. The resulting config file should look similar to the following format:

SageMaker:
  Model:
    EnableNetworkIsolation: false
    VpcConfig:
      SecurityGroupIds:
      - sg-xxxx
      Subnets:
      - subnet-xxxx
      - subnet-xxxx
  ProcessingJob:
    NetworkConfig:
      EnableNetworkIsolation: false
      VpcConfig:
        SecurityGroupIds:
        - sg-xxxx
        Subnets:
        - subnet-xxxx
        - subnet-xxxx
    ProcessingOutputConfig:
      KmsKeyId: arn:aws:kms:us-east-2:0123456789:alias/kms-defaults
    RoleArn: arn:aws:iam::0123456789:role/service-role/AmazonSageMakerExecutionRole-xxx
  TrainingJob:
    EnableNetworkIsolation: false
    VpcConfig:
      SecurityGroupIds:
      - sg-xxxx
      Subnets:
      - subnet-xxxx
      - subnet-xxxx
SchemaVersion: '1.0'
something: '1.0'

If you have an existing domain and networking configuration set up, create the config.yaml file in the required format and save it in the default location for Studio notebooks.

Note that these defaults simply auto-populate the configuration values for the appropriate SageMaker SDK calls, and don’t enforce the user to any specific VPC, subnet, or role. As an administrator, if you want your users to use a specific configuration or role, use IAM condition keys to enforce the default values.

Additionally, each API call can have its own configurations. For example, in the preceding config file sample, you can specify vpc-a and subnet-a for training jobs, and specify vpc-b and subnet-c, subnet-d for processing jobs.

Run a sample notebook

Now that you have set the configuration file, you can start running your model building and training notebooks as usual, without the need to explicitly set networking and encryption parameters, for most SDK functions. See Supported APIs and parameters for a complete list of supported API calls and parameters.

In Studio, choose the File Explorer icon in the navigation pane and open 03_feature_engineering/03_feature_engineering.ipynb, as shown in the following screenshot.

Run the notebook cells one by one, and notice that you are not specifying any additional configuration. When you create the processor object, you will see the cell outputs like the following example.

As you can see in the output, the default configuration is automatically applied to the processing job, without needing any additional input from the user.

When you run the next cell to run the processor, you can also verify the defaults are set by viewing the job on the SageMaker console. Choose Processing jobs under Processing in the navigation pane, as shown in the following screenshot.

Choose the processing job with the prefix end-to-end-ml-sm-proc, and you should be able to view the networking and encryption already configured.

You can continue running the remaining notebooks to train and deploy the model, and you will notice that the infrastructure defaults are automatically applied for both training jobs and models.

Override the default configuration file

There could be cases where a user needs to override the default configuration, for example, to experiment with public internet access, or update the networking configuration if the subnet runs out of IP addresses. In such cases, the Python SDK also allows you to provide a custom location for the configuration file, either on local storage, or you can point to a location in Amazon S3. In this section, we explore an example.

Open the user-configs.yaml file on your home directory and update the EnableNetworkIsolation value to True, under the TrainingJob section.

Now, open the same notebook, and add the following cell to the beginning of the notebook:

import os
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = "~/config.yaml"

With this cell, you point the location of the config file to the SDK. Now, when you create the processor object, you’ll notice that the default config has been overridden to enable network isolation, and the processing job will fail in network isolation mode.

You can use the same override environment variable to set the location of the configuration file if you’re using your local environment such as VSCode.

Debug and retrieve defaults

For quick troubleshooting if you run into any errors when running API calls from your notebook, the cell output displays the applied default configurations as shown in the previous section. To view the exact Boto3 call created to view the attribute values passed from default config file, you can debug by turning on Boto3 logging. To turn on logging, run the following cell at the top of the notebook:

import boto3
import logging
boto3.set_stream_logger(name='botocore.endpoint', level=logging.DEBUG)

Any subsequent Boto3 calls will be logged with the complete request, visible under the body section in the log.

You can also view the collection of default configurations using the session.sagemaker_config value as shown in the following example.

Finally, if you’re using Boto3 to create your SageMaker resources, you can retrieve the default configuration values using the sagemaker_config variable. For example, to run the processing job in 03_feature_engineering.ipynb using Boto3, you can enter the contents of the following cell in the same notebook and run the cell:

import boto3
import sagemaker
session = sagemaker.Session()
client = boto3.client('sagemaker')

# get the default values
subnet_ids = session.sagemaker_config["SageMaker"]["ProcessingJob"]['NetworkConfig']["VpcConfig"]["Subnets"]
security_groups = session.sagemaker_config["SageMaker"]["ProcessingJob"]['NetworkConfig']["VpcConfig"]["SecurityGroupIds"]
kms_key = session.sagemaker_config["SageMaker"]["ProcessingJob"]["ProcessingOutputConfig"]["KmsKeyId"]
role_arn = session.sagemaker_config["SageMaker"]["ProcessingJob"]["RoleArn"]

# upload processing code
code_location = sagemaker_session.upload_data('./source_dir/preprocessor.py', 
                              bucket=s3_bucket_name, 
                              key_prefix=f'{s3_key_prefix}/code')
code_location = ('/').join(code_location.split('/')[:-1])

# create a processing job
response = client.create_processing_job(
    ProcessingJobName='end-to-end-ml-sm-proc-boto3',
    ProcessingInputs=[
        {
            'InputName': 'raw_data',
            "S3Input": {
                "S3Uri": raw_data_path,
                "LocalPath": "/opt/ml/processing/input",
                "S3DataType": "S3Prefix",
                "S3InputMode": "File",
            }
        },
        {
            "InputName": "code",
            "S3Input": {
                "S3Uri": code_location,
                "LocalPath": "/opt/ml/processing/input/code",
                "S3DataType": "S3Prefix",
                "S3InputMode": "File",
            }
        }
    ],
    ProcessingOutputConfig={
        'Outputs': [
            {
                'OutputName': 'train_data',
                'S3Output': {
                    'S3Uri': train_data_path,
                    'LocalPath': "/opt/ml/processing/train",
                    'S3UploadMode': 'EndOfJob'
                },
            },
            {
                'OutputName': 'val_data',
                'S3Output': {
                    'S3Uri': val_data_path,
                    'LocalPath': "/opt/ml/processing/val",
                    'S3UploadMode': 'EndOfJob'
                },
            },
            {
                'OutputName': 'test_data',
                'S3Output': {
                    'S3Uri': test_data_path,
                    'LocalPath': "/opt/ml/processing/test",
                    'S3UploadMode': 'EndOfJob'
                },
            },
            {
                'OutputName': 'model',
                'S3Output': {
                    'S3Uri': model_path,
                    'LocalPath': "/opt/ml/processing/model",
                    'S3UploadMode': 'EndOfJob'
                },
            },
        ],
        'KmsKeyId': kms_key
    },
    ProcessingResources={
        'ClusterConfig': {
            'InstanceCount': 1,
            'InstanceType': 'ml.m5.large',
            'VolumeSizeInGB': 30,
        }
    },
    AppSpecification={
        "ImageUri": "257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3",
        "ContainerArguments": [
            "--train-test-split-ratio",
            "0.2"
        ],
        "ContainerEntrypoint": [
            "python3",
            "/opt/ml/processing/input/code/preprocessor.py"
        ]
    },
    NetworkConfig={
        'EnableNetworkIsolation': False,
        'VpcConfig': {
            'SecurityGroupIds': security_groups,
            'Subnets': subnet_ids
        }
    },
    RoleArn=role_arn,
)

Automate config file creation

For administrators, having to create the config file and save the file to each SageMaker notebook instance or Studio user profile can be a daunting task. Although you can recommend that users use a common file stored in a default S3 location, it puts the additional overhead of specifying the override on the data scientists.

To automate this, administrators can use SageMaker Lifecycle Configurations (LCC). For Studio user profiles or notebook instances, you can attach the following sample LCC script as a default LCC for the user’s default Jupyter Server app:

# sample LCC script to set default config files on the user's folder
#!/bin/bash

set -eux

# add --endpoint-url [S3 Interface Endpoint] if using an S3 interface endpoint
aws s3 cp <s3-location-of-config-file> ~/.config/sagemaker/config.yaml
echo "config file saved in the default location"

See Use Lifecycle Configurations for Amazon SageMaker Studio or Customize a Notebook Instance for instructions on creating and setting a default lifecycle script.

Clean up

When you’re done experimenting with this feature, clean up your resources to avoid paying additional costs. If you have provisioned new resources as specified in this post, complete the following steps to clean up your resources:

Shut down your Studio apps for the user profile. See Shut Down and Update SageMaker Studio and Studio Apps for instructions. Ensure that all apps are deleted before deleting the stack.
Delete the EFS volume created for the Studio domain. You can view the EFS volume attached with the domain by using a DescribeDomain API call.
Delete the Studio domain stack.
Delete the security groups created for the Studio domain. You can find them on the Amazon Elastic Compute Cloud (Amazon EC2) console, with the names security-group-for-inbound-nfs-d-xxx and security-group-for-outbound-nfs-d-xxx
Delete the networking stack.

Conclusion

In this post, we discussed configuring and using default values for key infrastructure parameters using the SageMaker Python SDK. This allows administrators to set default configurations for data scientists, thereby saving time for users and admins, eliminating the burden of repetitively specifying parameters, and resulting in leaner and more manageable code. For the full list of supported parameters and APIs, see Configuring and using defaults with the SageMaker Python SDK. For any questions and discussions, join the Machine Learning & AI community.

About the Authors

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering an ML background, he works with customers of any size to deeply understand their business and technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, Computer Vision, NLP, and involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Bruno Pistone is an AI/ML Specialist Solutions Architect for AWS based in Milan. He works with customers of any size on helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His field of expertise are Machine Learning end to end, Machine Learning Industrialization and MLOps. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

Durga Sury is an ML Solutions Architect on the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 4 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and long walks with her 5-year-old husky.

Large sequence models for software development activities

Posted by Petros Maniatis and Daniel Tarlow, Research Scientists, Google

Software isn’t created in one dramatic step. It improves bit by bit, one little step at a time — editing, running unit tests, fixing build errors, addressing code reviews, editing some more, appeasing linters, and fixing more errors — until finally it becomes good enough to merge into a code repository. Software engineering isn’t an isolated process, but a dialogue among human developers, code reviewers, bug reporters, software architects and tools, such as compilers, unit tests, linters and static analyzers.

Today we describe DIDACT (Dynamic Integrated Developer ACTivity), which is a methodology for training large machine learning (ML) models for software development. The novelty of DIDACT is that it uses the process of software development as the source of training data for the model, rather than just the polished end state of that process, the finished code. By exposing the model to the contexts that developers see as they work, paired with the actions they take in response, the model learns about the dynamics of software development and is more aligned with how developers spend their time. We leverage instrumentation of Google’s software development to scale up the quantity and diversity of developer-activity data beyond previous works. Results are extremely promising along two dimensions: usefulness to professional software developers, and as a potential basis for imbuing ML models with general software development skills.

DIDACT is a multi-task model trained on development activities that include editing, debugging, repair, and code review.

We built and deployed internally three DIDACT tools, Comment Resolution (which we recently announced), Build Repair, and Tip Prediction, each integrated at different stages of the development workflow. All three of these tools received enthusiastic feedback from thousands of internal developers. We see this as the ultimate test of usefulness: do professional developers, who are often experts on the code base and who have carefully honed workflows, leverage the tools to improve their productivity?

Perhaps most excitingly, we demonstrate how DIDACT is a first step towards a general-purpose developer-assistance agent. We show that the trained model can be used in a variety of surprising ways, via prompting with prefixes of developer activities, and by chaining together multiple predictions to roll out longer activity trajectories. We believe DIDACT paves a promising path towards developing agents that can generally assist across the software development process.

A treasure trove of data about the software engineering process

Google’s software engineering toolchains store every operation related to code as a log of interactions among tools and developers, and have done so for decades. In principle, one could use this record to replay in detail the key episodes in the “software engineering video” of how Google’s codebase came to be, step-by-step — one code edit, compilation, comment, variable rename, etc., at a time.

Google code lives in a monorepo, a single repository of code for all tools and systems. A software developer typically experiments with code changes in a local copy-on-write workspace managed by a system called Clients in the Cloud (CitC). When the developer is ready to package a set of code changes together for a specific purpose (e.g., fixing a bug), they create a changelist (CL) in Critique, Google’s code-review system. As with other types of code-review systems, the developer engages in a dialog with a peer reviewer about functionality and style. The developer edits their CL to address reviewer comments as the dialog progresses. Eventually, the reviewer declares “LGTM!” (“looks good to me”), and the CL is merged into the code repository.

Of course, in addition to a dialog with the code reviewer, the developer also maintains a “dialog” of sorts with a plethora of other software engineering tools, such as the compiler, the testing framework, linters, static analyzers, fuzzers, etc.

An illustration of the intricate web of activities involved in developing software: small actions by the developer, interactions with a code reviewer, and invocations of tools such as compilers.

A multi-task model for software engineering

DIDACT utilizes interactions among engineers and tools to power ML models that assist Google developers, by suggesting or enhancing actions developers take — in context — while pursuing their software-engineering tasks. To do that, we have defined a number of tasks about individual developer activities: repairing a broken build, predicting a code-review comment, addressing a code-review comment, renaming a variable, editing a file, etc. We use a common formalism for each activity: it takes some State (a code file), some Intent (annotations specific to the activity, such as code-review comments or compiler errors), and produces an Action (the operation taken to address the task). This Action is like a mini programming language, and can be extended for newly added activities. It covers things like editing, adding comments, renaming variables, marking up code with errors, etc. We call this language DevScript.

The DIDACT model is prompted with a task, code snippets, and annotations related to that task, and produces development actions, e.g., edits or comments.

This state-intent-action formalism enables us to capture many different tasks in a general way. What’s more, DevScript is a concise way to express complex actions, without the need to output the whole state (the original code) as it would be after the action takes place; this makes the model more efficient and more interpretable. For example, a rename might touch a file in dozens of places, but a model can predict a single rename action.

An ML peer programmer

DIDACT does a good job on individual assistive tasks. For example, below we show DIDACT doing code clean-up after functionality is mostly done. It looks at the code along with some final comments by the code reviewer (marked with “human” in the animation), and predicts edits to address those comments (rendered as a diff).

Given an initial snippet of code and the comments that a code reviewer attached to that snippet, the Pre-Submit Cleanup task of DIDACT produces edits (insertions and deletions of text) that address those comments.

The multimodal nature of DIDACT also gives rise to some surprising capabilities, reminiscent of behaviors emerging with scale. One such capability is history augmentation, which can be enabled via prompting. Knowing what the developer did recently enables the model to make a better guess about what the developer should do next.

An illustration of history-augmented code completion in action.

A powerful such task exemplifying this capability is history-augmented code completion. In the figure below, the developer adds a new function parameter (1), and moves the cursor into the documentation (2). Conditioned on the history of developer edits and the cursor position, the model completes the line (3) by correctly predicting the docstring entry for the new parameter.

An illustration of edit prediction, over multiple chained iterations.

In an even more powerful history-augmented task, edit prediction, the model can choose where to edit next in a fashion that is historically consistent. If the developer deletes a function parameter (1), the model can use history to correctly predict an update to the docstring (2) that removes the deleted parameter (without the human developer manually placing the cursor there) and to update a statement in the function (3) in a syntactically (and — arguably — semantically) correct way. With history, the model can unambiguously decide how to continue the “editing video” correctly. Without history, the model wouldn’t know whether the missing function parameter is intentional (because the developer is in the process of a longer edit to remove it) or accidental (in which case the model should re-add it to fix the problem).

The model can go even further. For example, we started with a blank file and asked the model to successively predict what edits would come next until it had written a full code file. The astonishing part is that the model developed code in a step-by-step way that would seem natural to a developer: It started by first creating a fully working skeleton with imports, flags, and a basic main function. It then incrementally added new functionality, like reading from a file and writing results, and added functionality to filter out some lines based on a user-provided regular expression, which required changes across the file, like adding new flags.

Conclusion

DIDACT turns Google’s software development process into training demonstrations for ML developer assistants, and uses those demonstrations to train models that construct code in a step-by-step fashion, interactively with tools and code reviewers. These innovations are already powering tools enjoyed by Google developers every day. The DIDACT approach complements the great strides taken by large language models at Google and elsewhere, towards technologies that ease toil, improve productivity, and enhance the quality of work of software engineers.

Acknowledgements

This work is the result of a multi-year collaboration among Google Research, Google Core Systems and Experiences, and DeepMind. We would like to acknowledge our colleagues Jacob Austin, Pascal Lamblin, Pierre-Antoine Manzagol, and Daniel Zheng, who join us as the key drivers of this project. This work could not have happened without the significant and sustained contributions of our partners at Alphabet (Peter Choy, Henryk Michalewski, Subhodeep Moitra, Malgorzata Salawa, Vaibhav Tulsyan, and Manushree Vijayvergiya), as well as the many people who collected data, identified tasks, built products, strategized, evangelized, and helped us execute on the many facets of this agenda (Ankur Agarwal, Paige Bailey, Marc Brockschmidt, Rodrigo Damazio Bovendorp, Satish Chandra, Savinee Dancs, Matt Frazier, Alexander Frömmgen, Nimesh Ghelani, Chris Gorgolewski, Chenjie Gu, Vincent Hellendoorn, Franjo Ivančić, Marko Ivanković, Emily Johnston, Luka Kalinovcic, Lera Kharatyan, Jessica Ko, Markus Kusano, Kathy Nix, Sara Qu, Marc Rasi, Marcus Revaj, Ballie Sandhu, Michael Sloan, Tom Small, Gabriela Surita, Maxim Tabachnyk, David Tattersall, Sara Toth, Kevin Villela, Sara Wiltberger, and Donald Duo Zhao) and our extremely supportive leadership (Martín Abadi, Joelle Barral, Jeff Dean, Madhura Dudhgaonkar, Douglas Eck, Zoubin Ghahramani, Hugo Larochelle, Chandu Thekkath, and Niranjan Tulpule). Thank you!

Accelerate your learning towards AWS Certification exams with automated quiz generation using Amazon SageMaker foundations models

Getting AWS Certified can help you propel your career, whether you’re looking to find a new role, showcase your skills to take on a new project, or become your team’s go-to expert. And because AWS Certification exams are created by experts in the relevant role or technical area, preparing for one of these exams helps you build the required skills identified by skilled practitioners in the field.

Reading the FAQ page of the AWS services relevant for your certification exam is important in order to acquire a deeper understanding of the service. However, this could take quite some time. Reading FAQs of even one service can take half a day to read and understand. For example, the Amazon SageMaker FAQ contains about 33 pages (printed) of content just on SageMaker.

Wouldn’t it be an easier and more fun learning experience if you could use a system to test yourself on the AWS service FAQ pages? Actually, you can develop such a system using state-of-the-art language models and a few lines of Python.

In this post, we present a comprehensive guide of deploying a multiple-choice quiz solution for the FAQ pages of any AWS service, based on the AI21 Jurassic-2 Jumbo Instruct foundation model on Amazon SageMaker Jumpstart.

Large language models

In recent years, language models have seen a huge surge in size and popularity. In 2018, BERT-large made its debut with its 340 million parameters and innovative transformer architecture, setting the benchmark for performance on NLP tasks. In a few short years, the state-of-the-art in terms of model size has ballooned by over 500 times; OpenAI’s GPT-3 and Bloom 176 B, both with 175 billion parameters, and AI21 Jurassic-2 Jumbo Instruct with 178 billion parameters are just three examples of large language models (LLMs) raising the bar on natural language processing (NLP) accuracy.

SageMaker foundation models

SageMaker provides a range of models from popular model hubs including Hugging Face, PyTorch Hub, and TensorFlow Hub, and propriety ones from AI21, Cohere, and LightOn, which you can access within your machine learning (ML) development workflow in SageMaker. Recent advances in ML have given rise to a new class of models known as foundation models, which have billions of parameters and are trained on massive amounts of data. Those foundation models can be adapted to a wide range of use cases, such as text summarization, generating digital art, and language translation. Because these models can be expensive to train, customers want to use existing pre-trained foundation models and fine-tune them as needed, rather than train these models themselves. SageMaker provides a curated list of models that you can choose from on the SageMaker console.

With JumpStart, you can find foundation models from different providers, enabling you to get started with foundation models quickly. You can review model characteristics and usage terms, and try out these models using a test UI widget. When you’re ready to use a foundation model at scale, you can do so easily without leaving SageMaker by using pre-built notebooks from model providers. Your data, whether used for evaluating or using the model at scale, is never shared with third parties because the models are hosted and deployed on AWS.

AI21 Jurassic-2 Jumbo Instruct

Jurassic-2 Jumbo Instruct is an LLM by AI21 Labs that can be applied to any language comprehension or generation task. It’s optimized to follow natural language instructions and context, so there is no need to provide it with any examples. The endpoint comes pre-loaded with the model and ready to serve queries via an easy-to-use API and Python SDK, so you can hit the ground running. Jurassic-2 Jumbo Instruct is a top performer at HELM, particularly in tasks related to reading and writing.

Solution overview

In the following sections, we go through the steps to test the Jurassic-2 Jumbo instruct model in SageMaker:

Choose the Jurassic-2 Jumbo instruct model on the SageMaker console.
Evaluate the model using the playground.
Use a notebook associated with the foundation model to deploy it in your environment.

Access Jurassic-2 Jumbo Instruct through the SageMaker console

The first step is to log in to the SageMaker console. Under JumpStart in the navigation pane, choose Foundation models to request access to the model list.

After your account is allow listed, you can see a list of models on this page and search for the Jurassic-2 Jumbo Instruct model.

Evaluate the Jurassic-2 Jumbo Instruct model in the model playground

On the AI21 Jurassic-2 Jumbo Instruct listing, choose View Model. You will see a description of the model and the tasks that you can perform. Read through the EULA for the model before proceeding.

Let’s first try out the model to generate a test based on the SageMaker FAQ page. Navigate to the Playground tab.

On the Playground tab, you can provide sample prompts to the Jurassic-2 Jumbo Instruct model and view the output.

Note that you can use a maximum of 500 tokens. We set the Max length to 500, which is the maximum number of tokens to generate. This model has an 8,192-token context window (the length of the prompt plus completion should be at most 8,192 tokens).

To make it easier to see the prompt, you can enlarge the Prompt box.

Because we can use a maximum of 500 tokens, we take a small portion of the Amazon SageMaker FAQs page, the Low-code ML section, for our test prompt.

We use the following prompt:

Below is SageMaker Low-code ML FAQ:

##
Q: Will my data (from inference or training) be used or shared to update the base model that is offered to customers using Amazon SageMaker JumpStart?
No. Your inference and training data will not be used nor shared to update or train the base model that SageMaker JumpStart surfaces to customers.

Q: Can I see the model weights and scripts of proprietary models in preview with Amazon SageMaker JumpStart?
No. Proprietary models do not allow customers to view model weights and scripts.

Q: Which open-source models are supported with Amazon SageMaker JumpStart?
Amazon SageMaker JumpStart includes 150+ pre-trained open-source models from PyTorch Hub and TensorFlow Hub. For vision tasks such as image classification and object detection, you can use models such as ResNet, MobileNet, and Single-Shot Detector (SSD). For text tasks such as sentence classification, text classification, and question answering, you can use models such as BERT, RoBERTa, and DistilBERT.

Q: What solutions come pre-built with Amazon SageMaker JumpStart?
SageMaker JumpStart includes solutions that are preconfigured with all necessary AWS services to launch a solution into production. Solutions are fully customizable so you can easily modify them to fit your specific use case and dataset. You can use solutions for over 15 use cases including demand forecasting, fraud detection, and predictive maintenance, and readily deploy solutions with just a few clicks. For more information about all solutions available, visit the SageMaker getting started page. 

Q: What built-in algorithms are supported in Amazon SageMaker Autopilot?
Amazon SageMaker Autopilot supports 2 built-in algorithms: XGBoost and Linear Learner.

Q: Can I stop an Amazon SageMaker Autopilot job manually?
Yes. You can stop a job at any time. When an Amazon SageMaker Autopilot job is stopped, all ongoing trials will be stopped and no new trial will be started.
##

Create a multiple choice quiz on the topic of SageMaker Low-code ML FAQ consisting of 4 questions. Each question should have 4 options. Also include the correct answer for each question using the starting string 'Correct Answer:`

Prompt engineering is an iterative process. You should be clear and specific, and give the model time to think.

Here we specified the context with ## as stop sequences, which signals the model to stop generating after this character or string is generated. It’s useful when using a few-shot prompt.

Below is SageMaker Low-code ML FAQ:

##
<SageMaker Low-code ML FAQ content>
##

Next, we are clear and very specific in our prompt, asking for a multiple-choice quiz, consisting of four questions with four options. We ask the model to include the correct answer for each question using the starting string 'Correct Answer:' so we can parse it later using Python:

Create a multiple choice quiz on the topic of SageMaker Low-code ML FAQ consisting of 4 questions. Each question should have 4 options. Also include the correct answer for each question using the starting string 'Correct Answer:`

A well-designed prompt can make the model more creative and generalized so that it can easily adapt to new tasks. Prompts can also help incorporate domain knowledge on specific tasks and improve interpretability. Prompt engineering can greatly improve the performance of zero-shot and few-shot learning models. Creating high-quality prompts requires careful consideration of the task at hand, as well as a deep understanding of the model’s strengths and limitations.

In the scope of this post, we don’t cover this wide area further.

Copy the prompt and enter it in the Prompt box, then choose Generate text.

This sends the prompt to the Jurassic-2 Jumbo Instruct model for inference. Note that experimenting in the playground is free.

Also keep in mind that despite the cutting-edge nature of LLMs, they are still prone to biases, errors, and hallucinations.

After reading the model output thoroughly and carefully, we can see that the model generated quite a good quiz!

After you have played with the model, it’s time to use the notebook and deploy it as an endpoint in your environment. We use a small Python function to parse the output and simulate an interactive test.

Deploy the Jurassic-2 Jumbo Instruct foundation model from a notebook

You can use the following sample notebook to deploy Jurassic-2 Jumbo Instruct using SageMaker. Note that this example uses an ml.p4d.24xlarge instance. If your default limit for your AWS account is 0, you need to request a limit increase for this GPU instance.

Let’s create the endpoint using SageMaker inference. First, we set the necessary variables, then we deploy the model from the model package:

endpoint_name = "j2-jumbo-instruct"

content_type = "application/json"

real_time_inference_instance_type = (
"ml.p4d.24xlarge"
)

# create a deployable model from the model package.
model = ModelPackage(
role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

# Deploy the model
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=endpoint_name,
model_data_download_timeout=3600,
container_startup_health_check_timeout=600,
)

After the endpoint is deployed, you can run inference queries against the model.

After the model is deployed, you can interact with the deployed endpoint using the following code snippet:

response = ai21.Completion.execute(sm_endpoint=endpoint_name,
prompt=instruction,
maxTokens=2048,
temperature=0.7,
numResults=1,
stopSequences=['##'])

output = response['completions'][0]['data']['text']

With the Jurassic-2 Jumbo Instruct foundation model deployed on an ml.p4d.24xlarge instance SageMaker endpoint, you can use a prompt with 4,096 tokens. You can take the same prompt we used in the playground and add many more questions. In this example, we added the FAQ’s entire Low-code ML section as context into the prompt.

We can see the output of the model, which generated a multiple-choice quiz with four questions and four options for each question.

Now you can develop a Python function to parse the output and create an interactive multiple-choice quiz.

It’s quite straightforward to develop such a function with a few lines of code. You can parse the answer easily because the model created a line with “Correct Answer: ” for each question, exactly as we requested in the prompt. We don’t provide the Python code for the quiz generation in the scope of this post.

Run the quiz in the notebook

Using the Python function we created earlier and the output from the Jurassic-2 Jumbo Instruct foundation model, we run the interactive quiz in the notebook.

You can see I answered three out of four questions correctly and got a 75% grade. Perhaps I need to read the SageMaker FAQ a few more times!

Clean up

After you have tried out the endpoint, make sure to remove the SageMaker inference endpoint and the model to prevent any charges:

model.sagemaker_session.delete_endpoint(endpoint_name)
model.sagemaker_session.delete_endpoint_config(endpoint_name)

model.delete_model()

Conclusion

In this post, we showed you how you can test and use AI21’s Jurassic-2 Jumbo Instruct model using SageMaker to build an automated quiz generation system. This was achieved using a rather simple prompt with a publicly available SageMaker FAQ page’s text embedded and a few lines of Python code.

Similar to this example mentioned in the post, you can customize a foundation model for your business with just a few labeled examples. Because all the data is encrypted and doesn’t leave your AWS account, you can trust that your data will remain private and confidential.

Request access to try out the foundation model in SageMaker today, and let us know your feedback!

About the Author

Eitan Sela is a Machine Learning Specialist Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Improving Mathematical Reasoning with Process Supervision

We’ve trained a model to achieve a new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning (“process supervision”) instead of simply rewarding the correct final answer (“outcome supervision”). In addition to boosting performance relative to outcome supervision, process supervision also has an important alignment benefit: it directly trains the model to produce a chain-of-thought that is endorsed by humans.OpenAI Blog

Amazon SageMaker XGBoost now offers fully distributed GPU training

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

The SageMaker XGBoost algorithm allows you to easily run XGBoost training and inference on SageMaker. XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm performs well in ML competitions because of its robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters that you can fine-tune. You can use XGBoost for regression, classification (binary and multiclass), and ranking problems. You can use GPUs to accelerate training on large datasets.

Today, we are happy to announce that SageMaker XGBoost now offers fully distributed GPU training.

Starting with version 1.5-1 and above, you can now utilize all GPUs when using multi-GPU instances. The new feature addresses your needs to use fully distributed GPU training when dealing with large datasets. This means being able to use multiple Amazon Elastic Compute Cloud (Amazon EC2) instances (GPU) and using all GPUs per instance.

Distributed GPU training with multi-GPU instances

With SageMaker XGBoost version 1.2-2 or later, you can use one or more single-GPU instances for training. The hyperparameter tree_method needs to be set to gpu_hist. When using more than one instance (distributed setup), the data needs to be divided among instances as follows (the same as the non-GPU distributed training steps mentioned in XGBoost Algorithm). Although this option is performant and can be used in various training setups, it doesn’t extend to using all GPUs when choosing multi-GPU instances such as g5.12xlarge.

With SageMaker XGBoost version 1.5-1 and above, you can now use all GPUs on each instance when using multi-GPU instances. The ability to use all GPUs in multi-GPU instance is offered by integrating the Dask framework.

You can use this setup to complete training quickly. Apart from saving time, this option will also be useful to work around blockers such as maximum usable instance (soft) limits, or if the training job is unable to provision a large number of single-GPU instances for some reason.

The configurations to use this option are the same as the previous option, except for the following differences:

Add the new hyperparameter use_dask_gpu_training with string value true.
When creating TrainingInput, set the distribution parameter to FullyReplicated, whether using single or multiple instances. The underlying Dask framework will carry out the data load and split the data among Dask workers. This is different from the data distribution setting for all other distributed training with SageMaker XGBoost.

Note that splitting data into smaller files still applies for Parquet, where Dask will read each file as a partition. Because you’ll have a Dask worker per GPU, the number of files should be greater than instance count * GPU count per instance. Also, making each file too small and having a very large number of files can degrade performance. For more information, see Avoid Very Large Graphs. For CSV, we still recommend splitting up large files into smaller ones to reduce data download time and enable quicker reads. However, it’s not a requirement.

Currently, the supported input formats with this option are:

text/csv
application/x-parquet

The following input mode is supported:

File mode

The code will look similar to the following:

import os
import boto3
import re
import sagemaker
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from sagemaker.xgboost.estimator import XGBoost

role = sagemaker.get_execution_role()
region = sagemaker.Session().boto_region_name
session = Session()

bucket = "<Specify S3 Bucket>"
prefix = "<Specify S3 prefix>"

hyperparams = {
    "objective": "reg:squarederror",
    "num_round": "500",
    "verbosity": "3",
    "tree_method": "gpu_hist",
    "eval_metric": "rmse",
    "use_dask_gpu_training": "true"
}


output_path = "s3://{}/{}/output".format(bucket, prefix)

content_type = "application/x-parquet"
instance_type = "ml.g4dn.2xlarge"

xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")
xgb_script_mode_estimator = sagemaker.estimator.Estimator(
    image_uri=xgboost_container,
    hyperparameters=hyperparams,
    role=role,
    instance_count=1,
    instance_type=instance_type,
    output_path=output_path,
    max_run=7200,

)

test_data_uri = " <specify the S3 uri for training dataset>"
validation_data_uri = “<specify the S3 uri for validation dataset>”

train_input = TrainingInput(
    test_data_uri, content_type=content_type
)

validation_input = TrainingInput(
    validation_data_uri, content_type=content_type
)

xgb_script_mode_estimator.fit({"train": train_input, "validation": validation_input})

The following screenshots show a successful training job log from the notebook.

Benchmarks

We benchmarked evaluation metrics to ensure that the model quality didn’t deteriorate with the multi-GPU training path compared to single-GPU training. We also benchmarked on large datasets to ensure that our distributed GPU setups were performant and scalable.

Billable time refers to the absolute wall-clock time. Training time is only the XGBoost training time, measured from the train() call until the model is saved to Amazon Simple Storage Service (Amazon S3).

Performance benchmarks on large datasets

The use of multi-GPU is usually appropriate for large datasets with complex training. We created a dummy dataset with 2,497,248,278 rows and 28 features for testing. The dataset was 150 GB and composed of 1,419 files. Each file was sized between 105–115 MB. We saved the data in Parquet format in an S3 bucket. To simulate somewhat complex training, we used this dataset for a binary classification task, with 1,000 rounds, to compare performance between the single-GPU training path and the multi-GPU training path.

The following table contains the billable training time and performance comparison between the single-GPU training path and the multi-GPU training path.

Single-GPU Training Path
Instance Type	Instance Count	Billable Time / Instance(s)	Training Time(s)
g4dn.xlarge	20	Out of Memory
g4dn.2xlarge	20	Out of Memory
g4dn.4xlarge	15	1710	1551.9
	16	1592	1412.2
	17	1542	1352.2
	18	1423	1281.2
	19	1346	1220.3

Multi-GPU Training Path (with Dask)
Instance Type	Instance Count	Billable Time / Instance(s)	Training Time(s)
.	.	.	.
g4dn.12xlarge	7	Out of Memory
	8	1143	784.7
	9	1039	710.73
	10	978	676.7
	12	940	614.35

We can see that using multi-GPU instances results in low training time and low overall time. The single-GPU training path still has some advantage in downloading and reading only part of the data in each instance, and therefore low data download time. It also doesn’t suffer from Dask’s overhead. Therefore, the difference between training time and total time is smaller. However, due to using more GPUs, multi-GPU setup can cut training time significantly.

You should use an EC2 instance that has enough compute power to avoid out of memory errors when dealing with large datasets.

It’s possible to reduce total time further with the single-GPU setup by using more instances or more powerful instances. However, in terms of cost, it might be more expensive. For example, the following table shows the training time and cost comparison with a single-GPU instance g4dn.8xlarge.

Single-GPU Training Path
Instance Type	Instance Count	Billable Time / Instance(s)	Cost ($)
g4dn.8xlarge	15	1679	15.22
	17	1509	15.51
	19	1326	15.22

Multi-GPU Training Path (with Dask)
Instance Type	Instance Count	Billable Time / Instance(s)	Cost ($)
g4dn.12xlarge	8	1143	9.93
	10	978	10.63
	12	940	12.26

Cost calculation is based on the On-Demand price for each instance. For more information, refer to Amazon EC2 G4 Instances.

Model quality benchmarks

For model quality, we compared evaluation metrics between the Dask GPU option and the single-GPU option, and ran training on various instance types and instance counts. For different tasks, we used different datasets and hyperparameters, with each dataset split into training, validation, and test sets.

For a binary classification (binary:logistic) task, we used the HIGGS dataset in CSV format. The training split of the dataset has 9,348,181 rows and 28 features. The number of rounds used was 1,000. The following table summarizes the results.

Multi-GPU Training with Dask
*Instance Type*	*Num GPUs / Instance*	*Instance Count*	*Billable Time / Instance(s)*	*Accuracy %*	*F1 %*	*ROC AUC %*
g4dn.2xlarge	1	1	343	75.97	77.61	84.34
g4dn.4xlarge	1	1	413	76.16	77.75	84.51
g4dn.8xlarge	1	1	413	76.16	77.75	84.51
g4dn.12xlarge	4	1	157	76.16	77.74	84.52

For regression (reg:squarederror), we used the NYC green cab dataset (with some modifications) in Parquet format. The training split of the dataset has 72,921,051 rows and 8 features. The number of rounds used was 500. The following table shows the results.

Multi-GPU Training with Dask
*Instance Type*	*Num GPUs / Instance*	*Instance Count*	*Billable Time / Instance(s)*	*MSE*	R2	*MAE*
g4dn.2xlarge	1	1	775	21.92	0.7787	2.43
g4dn.4xlarge	1	1	770	21.92	0.7787	2.43
g4dn.8xlarge	1	1	705	21.92	0.7787	2.43
g4dn.12xlarge	4	1	253	21.93	0.7787	2.44

Model quality metrics are similar between the multi-GPU (Dask) training option and the existing training option. Model quality remains consistent when using a distributed setup with multiple instances or GPUs.

Conclusion

In this post, we gave an overview of how you can use different instance type and instance count combinations for distributed GPU training with SageMaker XGBoost. For most use cases, you can use single-GPU instances. This option provides a wide range of instances to use and is very performant. You can use multi-GPU instances for training with large datasets and lots of rounds. It can provide quick training with a smaller number of instances. Overall, you can use SageMaker XGBoost’s distributed GPU setup to immensely speed up your XGBoost training.

To learn more about SageMaker and distributed training using Dask, check out Amazon SageMaker built-in LightGBM now offers distributed training using Dask

About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Dewan Choudhury is a Software Development Engineer with Amazon Web Services. He works on Amazon SageMaker’s algorithms and JumpStart offerings. Apart from building AI/ML infrastructures, he is also passionate about building scalable distributed systems.

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.

Tony Cruz

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 5: Hosting

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support plan. Since its introduction, we have helped hundreds of customers optimize their workloads, set guardrails, and improve visibility of their machine learning (ML) workloads’ cost and usage.

In this series of posts, we share lessons learned about optimizing costs in Amazon SageMaker. In Part 1, we showed how to get started using AWS Cost Explorer to identify cost optimization opportunities in SageMaker. In this post, we focus on SageMaker inference environments: real-time inference, batch transform, asynchronous inference, and serverless inference.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage:

Part 1: Overview
Part 2: SageMaker notebooks and SageMaker Studio
Part 3: Processing and Data Wrangler jobs
Part 4: Training jobs
Part 5: Hosting

SageMaker offers multiple inference options for you to pick from based on your workload requirements:

Real-time inference for online, low latency, or high throughput requirements
Batch transform for offline, scheduled processing and when you don’t need a persistent endpoint
Asynchronous inference for when you have large payloads with long processing times and want to queue requests
Serverless inference for when you have intermittent or unpredictable traffic patterns and can tolerate cold starts

In the following sections, we discuss each inference option in more detail.

SageMaker real-time inference

When you create an endpoint, SageMaker attaches an Amazon Elastic Block Store (Amazon EBS) storage volume to the Amazon Elastic Compute Cloud (Amazon EC2) instance that hosts the endpoint. This is true for all instance types that don’t come with a SSD storage. Because the d* instance types come with an NVMe SSD storage, SageMaker doesn’t attach an EBS storage volume to these ML compute instances. Refer to Host instance storage volumes for the size of the storage volumes that SageMaker attaches for each instance type for a single endpoint and for a multi-model endpoint.

The cost of SageMaker real-time endpoints is based on the per instance-hour consumed for each instance while the endpoint is running, the cost of GB-month of provisioned storage (EBS volume), as well as the GB data processed in and out of the endpoint instance, as outlined in Amazon SageMaker Pricing. In Cost Explorer, you can view real-time endpoint costs by applying a filter on the usage type. The names of these usage types are structured as follows:

REGION-Host:instanceType (for example, USE1-Host:ml.c5.9xlarge)
REGION-Host:VolumeUsage.gp2 (for example, USE1-Host:VolumeUsage.gp2)
REGION-Hst:Data-Bytes-Out (for example, USE2-Hst:Data-Bytes-In)
REGION-Hst:Data-Bytes-Out (for example, USW2-Hst:Data-Bytes-Out)

As shown in the following screenshot, filtering by the usage type Host: will show a list of real-time hosting usage types in an account.

You can either select specific usage types or select Select All and choose Apply to display the cost breakdown of SageMaker real-time hosting usage. To see the cost and usage breakdown by instance hours, you need to de-select all the REGION-Host:VolumeUsage.gp2 usage types before applying the usage type filter. You can also apply additional filters such as account number, EC2 instance type, cost allocation tag, Region, and more. The following screenshot shows cost and usage graphs for the selected hosting usage types.

Additionally, you can explore the cost associated with one or more hosting instances by using the Instance type filter. The following screenshot shows cost and usage breakdown for hosting instance ml.p2.xlarge.

Similarly, the cost for GB data processed in and processed out can be displayed by selecting the associated usage types as an applied filter, as shown in the following screenshot.

After you have achieved your desired results with filters and groupings, you can either download your results by choosing Download as CSV or save the report by choosing Save to report library. For general guidance on using Cost Explorer, refer to AWS Cost Explorer’s New Look and Common Use Cases.

Optionally, you can enable AWS Cost and Usage Reports (AWS CUR) to gain insights into the cost and usage data for your accounts. AWS CUR contains hourly AWS consumption details. It’s stored in Amazon Simple Storage Service (Amazon S3) in the payer account, which consolidates data for all the linked accounts. You can run queries to analyze trends in your usage and take appropriate action to optimize cost. Amazon Athena is a serverless query service that you can use to analyze the data from AWS CUR in Amazon S3 using standard SQL. More information and example queries can be found in the AWS CUR Query Library.

You can also feed AWS CUR data into Amazon QuickSight, where you can slice and dice it any way you’d like for reporting or visualization purposes. For instructions, see How do I ingest and visualize the AWS Cost and Usage Report (CUR) into Amazon QuickSight.

You can obtain resource-level information such as endpoint ARN, endpoint instance types, hourly instance rate, daily usage hours, and more from AWS CUR. You can also include cost-allocation tags in your query for an additional level of granularity. The following example query returns real-time hosting resource usage for the last 3 months for the given payer account:

SELECT
      bill_payer_account_id,
      line_item_usage_account_id,
      line_item_resource_id AS endpoint_arn,
      line_item_usage_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d') AS day_line_item_usage_start_date,
      SUM(CAST(line_item_usage_amount AS DOUBLE)) AS sum_line_item_usage_amount,
      line_item_unblended_rate,
      SUM(CAST(line_item_unblended_cost AS DECIMAL(16,8))) AS sum_line_item_unblended_cost,
      line_item_blended_rate,
      SUM(CAST(line_item_blended_cost AS DECIMAL(16,8))) AS sum_line_item_blended_cost,
      line_item_line_item_description,
      line_item_line_item_type
    FROM 
      customer_all
    WHERE
      line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
      AND line_item_product_code = 'AmazonSageMaker'
      AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')
      AND line_item_usage_type like '%Host%'
        AND line_item_operation = 'RunInstance'
        AND bill_payer_account_id = 'xxxxxxxxxxxx'
    GROUP BY
      bill_payer_account_id, 
      line_item_usage_account_id,
      line_item_resource_id,
      line_item_usage_type,
      line_item_unblended_rate,
      line_item_blended_rate,
      line_item_line_item_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d'),
      line_item_line_item_description
      ORDER BY 
      line_item_resource_id, day_line_item_usage_start_date

The following screenshot shows the results obtained from running the query using Athena. For more information, refer to Querying Cost and Usage Reports using Amazon Athena.

The result of the query shows that endpoint mme-xgboost-housing with ml.x4.xlarge instance is reporting 24 hours of runtime for multiple consecutive days. The instance rate is $0.24/hour and the daily cost for running for 24 hours is $5.76.

AWS CUR results can help you identify patterns of endpoints running for consecutive days in each of the linked accounts, as well as endpoints with the highest monthly cost. This can also help you decide whether the endpoints in non-production accounts can be deleted to save cost.

Optimize costs for real-time endpoints

From a cost management perspective, it’s important to identify under-utilized (or over-sized) instances and bring the instance size and counts, if required, in line with workload requirements. Common system metrics like CPU/GPU utilization and memory utilization are written to Amazon CloudWatch for all hosting instances. For real-time endpoints, SageMaker makes several additional metrics available in CloudWatch. Some of the commonly monitored metrics include invocation counts and invocation 4xx/5xx errors. For a full list of metrics, refer to Monitor Amazon SageMaker with Amazon CloudWatch.

The metric CPUUtilization provides the sum of each individual CPU core’s utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the CPUUtilization range is 0–400%. The metric MemoryUtilization is the percentage of memory that is used by the containers on an instance. This value range is 0–100%. The following screenshot shows an example of CloudWatch metrics CPUUtilization and MemoryUtilization for an endpoint instance ml.m4.10xlarge that comes with 40 vCPUs and 160 GiB memory.

These metrics graphs show maximum CPU utilization of approximately 3,000%, which is the equivalent of 30 vCPUs. This means that this endpoint isn’t utilizing more than 30 vCPUs out of the total capacity of 40 vCPUs. Similarly, the memory utilization is below 6%. Using this information, you can possibly experiment with a smaller instance that can match this resource need. Furthermore, the CPUUtilization metric shows a classic pattern of periodic high and low CPU demand, which makes this endpoint a good candidate for auto scaling. You can start with a smaller instance and scale out first as your compute demand changes. For information, see Automatically Scale Amazon SageMaker Models.

SageMaker is great for testing new models because you can easily deploy them into an A/B testing environment using production variants, and you only pay for what you use. Each production variant runs on its own compute instance and you’re charged per instance-hour consumed for each instance while the variant is running.

SageMaker also supports shadow variants, which have the same components as a production variant and run on their own compute instance. With shadow variants, SageMaker automatically deploys the model in a test environment, routes a copy of the inference requests received by the production model to the test model in real time, and collects performance metrics such as latency and throughput. This enables you to validate any new candidate component of your model serving stack before promoting it to production.

When you’re done with your tests and aren’t using the endpoint or the variants extensively anymore, you should delete it to save cost. Because the model is stored in Amazon S3, you can recreate it as needed. You can automatically detect these endpoints and take corrective actions (such as deleting them) by using Amazon CloudWatch Events and AWS Lambda functions. For example, you can use the Invocations metric to get the total number of requests sent to a model endpoint and then detect if the endpoints have been idle for the past number of hours (with no invocations over a certain period, such as 24 hours).

If you have several under-utilized endpoint instances, consider hosting options such as multi-model endpoints (MMEs), multi-container endpoints (MCEs), and serial inference pipelines to consolidate usage to fewer endpoint instances.

For real-time and asynchronous inference model deployment, you can optimize cost and performance by deploying models on SageMaker using AWS Graviton. AWS Graviton is a family of processors designed by AWS that provide the best price performance and are more energy efficient than their x86 counterparts. For guidance on deploying an ML model to AWS Graviton-based instances and details on the price performance benefit, refer to Run machine learning inference workloads on AWS Graviton-based instances with Amazon SageMaker. SageMaker also supports AWS Inferentia accelerators through the ml.inf2 family of instances for deploying ML models for real-time and asynchronous inference. You can use these instances on SageMaker to achieve high performance at a low cost for generative artificial intelligence (AI) models, including large language models (LLMs) and vision transformers.

In addition, you can use Amazon SageMaker Inference Recommender to run load tests and evaluate the price performance benefits of deploying your model on these instances. For additional guidance on automatically detecting idle SageMaker endpoints, as well as instance right-sizing and auto scaling for SageMaker endpoints, refer to Ensure efficient compute resources on Amazon SageMaker.

SageMaker batch transform

Batch inference, or offline inference, is the process of generating predictions on a batch of observations. Offline predictions are suitable for larger datasets and in cases where you can afford to wait several minutes or hours for a response.

The cost for SageMaker batch transform is based on the per instance-hour consumed for each instance while the batch transform job is running, as outlined in Amazon SageMaker Pricing. In Cost Explorer, you can explore batch transform costs by applying a filter on the usage type. The name of this usage type is structured as REGION-Tsform:instanceType (for example, USE1-Tsform:ml.c5.9xlarge).

As shown in the following screenshot, filtering by usage type Tsform: will show a list of SageMaker batch transform usage types in an account.

You can either select specific usage types or select Select All and choose Apply to display the cost breakdown of batch transform instance usage for the selected types. As mentioned earlier, you can also apply additional filters. The following screenshot shows cost and usage graphs for the selected batch transform usage types.

Optimize costs for batch transform

SageMaker batch transform only charges you for the instances used while your jobs are running. If your data is already in Amazon S3, then there is no cost for reading input data from Amazon S3 and writing output data to Amazon S3. All output objects are attempted to be uploaded to Amazon S3. If all are successful, then the batch transform job is marked as complete. If one or more objects fail, the batch transform job is marked as failed.

Charges for batch transform jobs apply in the following scenarios:

The job is successful
Failure due to ClientError and the model container is SageMaker or a SageMaker managed framework
Failure due to AlgorithmError or ClientError and the model container is your own custom container (BYOC)

The following are some of the best practices for optimizing a SageMaker batch transform job. These recommendations can reduce the total runtime of your batch transform job, thereby lowering costs:

Set BatchStrategy to MultiRecord and SplitType to Line if you need the batch transform job to make mini batches from the input file. If it can’t automatically split the dataset into mini batches, you can divide it into mini batches by putting each batch in a separate input file, placed in the data source S3 bucket.
Make sure that the batch size fits into the memory. SageMaker usually handles this automatically; however, when dividing batches manually, this needs to be tuned based on the memory.
Batch transform partitions the S3 objects in the input by key and maps those objects to instances. When you have multiples files, one instance might process input1.csv, and another instance might process input2.csv. If you have one input file but initialize multiple compute instances, only one instance processes the input file and the rest of the instances are idle. Make sure the number of files is equal to or greater than the number of instances.
If you have a large number of small files, it may be beneficial to combine multiple files into a small number of bigger files to reduce Amazon S3 interaction time.
If you’re using the CreateTransformJob API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters such as MaxPayloadInMB, MaxConcurrentTransforms, or BatchStrategy:
- MaxConcurrentTransforms indicates the maximum number of parallel requests that can be sent to each instance in a transform job. The ideal value for MaxConcurrentTransforms is equal to the number of vCPU cores in an instance.
- MaxPayloadInMB is the maximum allowed size of the payload, in MB. The value in MaxPayloadInMB must be greater than or equal to the size of a single record. To estimate the size of a record in MB, divide the size of your dataset by the number of records. To ensure that the records fit within the maximum payload size, we recommend using a slightly larger value. The default value is 6 MB.
- MaxPayloadInMB must not be greater than 100 MB. If you specify the optional MaxConcurrentTransforms parameter, then the value of (MaxConcurrentTransforms * MaxPayloadInMB) must also not exceed 100 MB.
- For cases where the payload might be arbitrarily large and is transmitted using HTTP chunked encoding, set the MaxPayloadInMB value to 0. This feature works only in supported algorithms. Currently, SageMaker built-in algorithms do not support HTTP chunked encoding.
Batch inference tasks are usually good candidates for horizontal scaling. Each worker within a cluster can operate on a different subset of data without the need to exchange information with other workers. AWS offers multiple storage and compute options that enable horizontal scaling. If a single instance is not sufficient to meet your performance requirements, consider using multiple instances in parallel to distribute the workload. For key considerations when architecting batch transform jobs, refer to Batch Inference at Scale with Amazon SageMaker.
Continuously monitor the performance metrics of your SageMaker batch transform jobs using CloudWatch. Look for bottlenecks, such as high CPU or GPU utilization, memory usage, or network throughput, to determine if you need to adjust instance sizes or configurations.
SageMaker uses the Amazon S3 multipart upload API to upload results from a batch transform job to Amazon S3. If an error occurs, the uploaded results are removed from Amazon S3. In some cases, such as when a network outage occurs, an incomplete multipart upload might remain in Amazon S3. To avoid incurring storage charges, we recommend that you add the S3 bucket policy to the S3 bucket lifecycle rules. This policy deletes incomplete multipart uploads that might be stored in the S3 bucket. For more information, see Managing your storage lifecycle.

SageMaker asynchronous inference

Asynchronous inference is a great choice for cost-sensitive workloads with large payloads and burst traffic. Requests can take up to 1 hour to process and have payload sizes of up to 1 GB, so it’s more suitable for workloads that have relaxed latency requirements.

Invocation of asynchronous endpoints differs from real-time endpoints. Rather than passing a request payload synchronously with the request, you upload the payload to Amazon S3 and pass an S3 URI as a part of the request. Internally, SageMaker maintains a queue with these requests and processes them. During endpoint creation, you can optionally specify an Amazon Simple Notification Service (Amazon SNS) topic to receive success or error notifications. When you receive the notification that your inference request has been successfully processed, you can access the result in the output Amazon S3 location.

The cost for asynchronous inference is based on the per instance-hour consumed for each instance while the endpoint is running, cost of GB-month of provisioned storage, as well as GB data processed in and out of the endpoint instance, as outlined in Amazon SageMaker Pricing. In Cost Explorer, you can filter asynchronous inference costs by applying a filter on the usage type. The name of this usage type is structured as REGION-AsyncInf:instanceType (for example, USE1-AsyncInf:ml.c5.9xlarge). Note that GB volume and GB data processed usage types are the same as real-time endpoints, as mentioned earlier in this post.

As shown in the following screenshot, filtering by the usage type AsyncInf: in Cost Explorer displays a cost breakdown by asynchronous endpoint usage types.

To see the cost and usage breakdown by instance hours, you need to de-select all the REGION-Host:VolumeUsage.gp2 usage types before applying the usage type filter. You can also apply additional filters. Resource-level information such as endpoint ARN, endpoint instance types, hourly instance rate, and daily usage hours can be obtained from AWS CUR. The following is an example of an AWS CUR query to obtain asynchronous hosting resource usage for the last 3 months:

SELECT
      bill_payer_account_id,
      line_item_usage_account_id,
      line_item_resource_id AS endpoint_arn,
      line_item_usage_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d') AS day_line_item_usage_start_date,
      SUM(CAST(line_item_usage_amount AS DOUBLE)) AS sum_line_item_usage_amount,
      line_item_unblended_rate,
      SUM(CAST(line_item_unblended_cost AS DECIMAL(16,8))) AS sum_line_item_unblended_cost,
      line_item_blended_rate,
      SUM(CAST(line_item_blended_cost AS DECIMAL(16,8))) AS sum_line_item_blended_cost,
      line_item_line_item_description,
      line_item_line_item_type
    FROM 
      customer_all
    WHERE
      line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
      AND line_item_product_code = 'AmazonSageMaker'
      AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')
      AND line_item_usage_type like '%AsyncInf%'
      AND line_item_operation = 'RunInstance'
    GROUP BY
      bill_payer_account_id, 
      line_item_usage_account_id,
      line_item_resource_id,
      line_item_usage_type,
      line_item_unblended_rate,
      line_item_blended_rate,
      line_item_line_item_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d'),
      line_item_line_item_description
      ORDER BY 
      line_item_resource_id, day_line_item_usage_start_date

The following screenshot shows the results obtained from running the AWS CUR query using Athena.

The result of the query shows that endpoint sagemaker-abc-model-5 with ml.m5.xlarge instance is reporting 24 hours of runtime for multiple consecutive days. The instance rate is $0.23/hour and the daily cost for running for 24 hours is $5.52.

As mentioned earlier, AWS CUR results can help you identify patterns of endpoints running for consecutive days, as well as endpoints with the highest monthly cost. This can also help you decide whether the endpoints in non-production accounts can be deleted to save cost.

Optimize costs for asynchronous inference

Just like the real-time endpoints, the cost for asynchronous endpoints is based on the instance type usage. Therefore, it’s important to identify under-utilized instances and resize them based on the workload requirements. In order to monitor asynchronous endpoints, SageMaker makes several metrics such as ApproximateBacklogSize, HasBacklogWithoutCapacity, and more available in CloudWatch. These metrics can show requests in the queue for an instance and can be used for auto scaling an endpoint. SageMaker asynchronous inference also includes host-level metrics. For information on host-level metrics, see SageMaker Jobs and Endpoint Metrics. These metrics can show resource utilization that can help you right-size the instance.

SageMaker supports auto scaling for asynchronous endpoints. Unlike real-time hosted endpoints, asynchronous inference endpoints support scaling down instances to zero by setting the minimum capacity to zero. For asynchronous endpoints, SageMaker strongly recommends that you create a policy configuration for target-tracking scaling for a deployed model (variant). You need to define the scaling policy that scaled on the ApproximateBacklogPerInstance custom metric and set the MinCapacity value to zero.

Asynchronous inference enables you to save on costs by auto scaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests. Requests that are received when there are zero instances are queued for processing after the endpoint scales up. Therefore, for use cases that can tolerate a cold start penalty of a few minutes, you can optionally scale down the endpoint instance count to zero when there are no outstanding requests and scale back up as new requests arrive. Cold start time depends on the time required to launch a new endpoint from scratch. Also, if the model itself is big, then the time can be longer. If your job is expected to take longer than the 1-hour processing time, you may want to consider SageMaker batch transform.

Additionally, you may also consider your request’s queued time combined with the processing time to choose the instance type. For example, if your use case can tolerate hours of wait time, you can choose a smaller instance to save cost.

For additional guidance on instance right-sizing and auto scaling for SageMaker endpoints, refer to Ensure efficient compute resources on Amazon SageMaker.

Serverless inference

Serverless inference allows you to deploy ML models for inference without having to configure or manage the underlying infrastructure. Based on the volume of inference requests your model receives, SageMaker serverless inference automatically provisions, scales, and turns off compute capacity. As a result, you pay for only the compute time to run your inference code and the amount of data processed, not for idle time. For serverless endpoints, instance provisioning is not necessary. You need to provide the memory size and maximum concurrency. Because serverless endpoints provision compute resources on demand, your endpoint may experience a few extra seconds of latency (cold start) for the first invocation after an idle period. You pay for the compute capacity used to process inference requests, billed by the millisecond, GB-month of provisioned storage, and the amount of data processed. The compute charge depends on the memory configuration you choose.

In Cost Explorer, you can filter serverless endpoints costs by applying a filter on the usage type. The name of this usage type is structured as REGION-ServerlessInf:Mem-MemorySize (for example, USE2-ServerlessInf:Mem-4GB). Note that GB volume and GB data processed usage types are the same as real-time endpoints.

You can see the cost breakdown by applying additional filters such as account number, instance type, Region, and more. The following screenshot shows the cost breakdown by applying filters for the serverless inference usage type.

Optimize cost for serverless inference

When configuring your serverless endpoint, you can specify the memory size and maximum number of concurrent invocations. SageMaker serverless inference auto-assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. With serverless inference, you only pay for the compute capacity used to process inference requests, billed by the millisecond, and the amount of data processed. The compute charge depends on the memory configuration you choose. The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, and 6144 MB. The pricing increases with the memory size increments, as explained in Amazon SageMaker Pricing, so it’s important to select the correct memory size. As a general rule, the memory size should be at least as large as your model size. However, it’s a good practice to refer to memory utilization when deciding the endpoint memory size, in addition to the model size itself.

General best practices for optimizing SageMaker inference costs

Optimizing hosting costs isn’t a one-time event. It’s a continuous process of monitoring deployed infrastructure, usage patterns, and performance, and also keeping a keen eye on new innovative solutions that AWS releases that could impact cost. Consider the following best practices:

Choose an appropriate instance type – SageMaker supports multiple instance types, each with varying combinations of CPU, GPU, memory, and storage capacities. Based on your model’s resource requirements, choose an instance type that provides the necessary resources without over-provisioning. For information about available SageMaker instance types, their specifications, and guidance on selecting the right instance, refer to Ensure efficient compute resources on Amazon SageMaker.
Test using local mode – In order to detect failures and debug faster, it’s recommended to test the code and container (in case of BYOC) in local mode before running the inference workload on the remote SageMaker instance. Local mode is a great way to test your scripts before running them in a SageMaker managed hosting environment.
Optimize models to be more performant – Unoptimized models can lead to longer runtimes and use more resources. You can choose to use more or bigger instances to improve performance; however, this leads to higher costs. By optimizing your models to be more performant, you may be able to lower costs by using fewer or smaller instances while keeping the same or better performance characteristics. You can use Amazon SageMaker Neo with SageMaker inference to automatically optimize models. For more details and samples, see Optimize model performance using Neo.
Use tags and cost management tools – To maintain visibility into your inference workloads, it’s recommended to use tags as well as AWS cost management tools such as AWS Budgets, the AWS Billing console, and the forecasting feature of Cost Explorer. You can also explore SageMaker Savings Plans as a flexible pricing model. For more information about these options, refer to Part 1 of this series.

Conclusion

In this post, we provided guidance on cost analysis and best practices when using SageMaker inference options. As machine learning establishes itself as a powerful tool across industries, training and running ML models needs to remain cost-effective. SageMaker offers a wide and deep feature set for facilitating each step in the ML pipeline and provides cost optimization opportunities without impacting performance or agility. Reach out to your AWS team for cost guidance on your SageMaker workloads.

Refer to the following posts in this series for more information about optimizing cost for SageMaker:

Part 1: Overview
Part 2: SageMaker notebooks and SageMaker Studio
Part 3: Processing and Data Wrangler jobs
Part 4: Training jobs
Part 5: Hosting

About the Authors

Deepali Rajale is a Senior AI/ML Specialist at AWS. She works with enterprise customers providing technical guidance with best practices for deploying and maintaining AI/ML solutions in the AWS ecosystem. She has worked with a wide range of organizations on various deep learning use cases involving NLP and computer vision. She is passionate about empowering organizations to leverage generative AI to enhance their use experience. In her spare time, she enjoys movies, music, and literature.

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers on all things ML to design, build, and operate at scale. In his spare time, he enjoys cycling, hiking, and rock and roll climbing.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 4: Training jobs

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support plan. Since its introduction, we’ve helped hundreds of customers optimize their workloads, set guardrails, and improve the visibility of their machine learning (ML) workloads’ cost and usage.

In this series of posts, we share lessons learned about optimizing costs in Amazon SageMaker. In this post, we focus on SageMaker training jobs.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage:

Part 1: Overview
Part 2: SageMaker notebooks and SageMaker Studio
Part 3: Processing and Data Wrangler jobs
Part 4: Training jobs
Part 5: Hosting

SageMaker training jobs

SageMaker training jobs are asynchronous batch processes with built-in features for ML model training and optimization.

With SageMaker training jobs, you can bring your own algorithm or choose from more than 25 built-in algorithms. SageMaker supports various data sources and access patterns, distributed training including heterogenous clusters, as well as experiment management features and automatic model tuning.

The cost of a training job is based on the resources you use (instances and storage) for the duration (in seconds) that those instances are running. This includes the time training takes place and, if you’re using the warm pool feature, the keep alive period you configure. In Part 1, we showed how to get started using AWS Cost Explorer to identify cost optimization opportunities in SageMaker. You can filter training costs by applying a filter on the usage type. The names of these usage types are as follows:

REGION-Train:instanceType (for example, USE1-Train:ml.m5.large)
REGION-Train:VolumeUsage.gp2 (for example, USE1-Train:VolumeUsage.gp2)

To view a breakdown of your training costs in Cost Explorer, you can enter train: as a prefix for Usage type. If you filter only for hours used (see the following screenshot), Cost Explorer will generate two graphs: Cost and Usage. This view will help you prioritize your optimization opportunities and identify which instances are long-running and costly.

Before optimizing an existing training job, we recommend following the best practices covered in Optimizing costs for machine learning with Amazon SageMaker: test your code locally and use local mode for testing, use pre-trained models where possible, and consider managed spot training (which can optimize cost up to 90% over On-Demand instances).

When an On-Demand job is launched, it goes through five phases: Starting, Downloading, Training, Uploading, and Completed. You can see those phases and descriptions on the training job’s page on the SageMaker console.

From a pricing perspective, you are charged for Downloading, Training, and Uploading phases.

Reviewing these phases is a first step in diagnosing where to optimize your training costs. In this post, we discuss the Downloading and Training phases.

Downloading phase

In the preceding example, the Downloading phase took less than a minute. However, if data downloading is a big factor of your training cost, you should consider the data source you are using and access methods. SageMaker training jobs support three data sources natively: Amazon Elastic File System (Amazon EFS), Amazon Simple Storage Service (Amazon S3), and Amazon FSx for Lustre. For Amazon S3, SageMaker offers three managed ways that your algorithm can access the training: File mode (where data is downloaded to the instance block storage), Pipe mode (data is streamed to the instance, thereby eliminating the duration of the Downloading phase) and Fast File mode (combines the ease of use of the existing File mode with the performance of Pipe mode). For detailed guidance on choosing the right data source and access methods, refer to Choose the best data source for your Amazon SageMaker training job.

When using managed spot training, any repeated Downloading phases that occurred due to interruption are not charged (so you’re only charged for the duration of the data download one time).

It’s important to note that although SageMaker training jobs support the data sources we mentioned, they are not mandatory. In your training code, you can implement any method for downloading the training data from any source (provided that the training instance can access it). There are additional ways to speed up download time, such as using the Boto3 API with multiprocessing to download files concurrently, or using third-party libraries such as WebDataset or s5cmd for faster download from Amazon S3. For more information, refer to Parallelizing S3 Workloads with s5cmd.

Training phase

Optimizing the Training phase cost consists of optimizing two vectors: choosing the right infrastructure (instance family and size), and optimizing the training itself. We can roughly divide training instances into two categories: accelerated GPU-based, mostly for deep-learning models, and CPU-based for common ML frameworks. For guidance on selecting the right instance family for training, refer to Ensure efficient compute resources on Amazon SageMaker. If your training requires GPUs instances, we recommend referring to the video How to select Amazon EC2 GPU instances for deep learning.

As a general guidance, if your workload does require an NVIDIA GPU, we found customers gain significant cost savings with two Amazon Elastic Compute Cloud (Amazon EC2) instance types: ml.g4dn and ml.g5. The ml.g4dn is equipped with NVIDIA T4 and offers a particularly low cost per memory. The ml.g5 instance is equipped with NVIDIA A10g Tensor Core and has the lowest cost-per-CUDA flop (fp32).

AWS offers specific cost saving features for deep learning training:

Amazon SageMaker Training Compiler, which implements optimizations to reduce training time on GPU instances
Trn1 instances, powered by AWS Trainium accelerators, which are purpose built for high-performance training while offering up to 50% cost-to-train savings over comparable GPU-based instances

In order to right-size and optimize your instance, you should first look at the Amazon CloudWatch metrics the training jobs are generating. For more information, refer to SageMaker Jobs and Endpoint Metrics. You can further use CloudWatch custom algorithm metrics to monitor the training performance.

These metrics can indicate bottlenecks or over-provisioning of resources. For example, if you’re observing high CPU with low GPU utilizations, you can address the issue by using heterogeneous clusters. Another example can be seeing consistent low CPU utilization throughout the job duration—this can lead to reducing the size of the instance.

If you’re using distributed training, you should test different distribution methods (tower, Ring-AllReduce, mirrored, and so on) to validate maximum utilization and fine-tune your framework parameters accordingly (for an example, see Best practices for TensorFlow 1.x acceleration training on Amazon SageMaker). It’s important to highlight that you can use the SageMaker distribution API and libraries like SageMaker Distributed Data Parallel, SageMaker Model Parallel, and SageMaker Sharded Data Parallel, which are optimized for AWS infrastructure and help reduce training costs.

Note that distributed training doesn’t necessarily scale linearly and might introduce some overhead, which will affect the overall runtime.

For deep learning models, another optimization technique is using mixed precision. Mixed precision can speed up training, thereby reducing both training time and memory usage with minimal to no impact on model accuracy. For more information, see the Train with Data Parallel and Model Parallel section in Distributed Training in Amazon SageMaker.

Finally, optimizing framework-specific parameters can have a significant impact in optimizing the training process. SageMaker automatic model tuning finds hyperparameters that perform the best, as measured by an objective metric that you choose. Setting the training time as an objective metric and framework configuration as hyperparameters can help remove bottlenecks and reduce overall training time. For an example of optimizing the default TensorFlow settings and removing a CPU bottleneck, refer to Aerobotics improves training speed by 24 times per sample with Amazon SageMaker and TensorFlow.

Another opportunity for optimizing both download and processing time is to consider training on a subset of your data. If your data consists of multiple duplicate entries or features with low information gain, you might be able to train on a subset of data and reduce downloading and training time as well as use a smaller instance and Amazon Elastic Block Store (Amazon EBS) volume. For an example, refer to Use a data-centric approach to minimize the amount of data required to train Amazon SageMaker models. Also, Amazon SageMaker Data Wrangler can simplify the analysis and creation of training samples. For more information, refer to Create random and stratified samples of data with Amazon SageMaker Data Wrangler.

SageMaker Debugger

To ensure efficient training and resource utilization, SageMaker can profile your training job using Amazon SageMaker Debugger. Debugger offers built-in rules to alert on common issues that are affecting your training like CPU bottleneck, GPU memory increase, or I/O bottleneck, or you can create your own rules. You can access and analyze the generated report in Amazon SageMaker Studio. For more information, refer to Amazon SageMaker Debugger UI in Amazon SageMaker Studio Experiments. The following screenshot shows the Debugger view in Studio.

You can drill down into the Python operators and functions (the Top operations on GPU section) that are run to perform the training job. The Debugger built-in rules for profiling watch framework operation-related issues, including excessive training initialization time due to data downloading before training starts and step duration outliers in training loops. You should note that although using the built-in rules are free, costs for custom rules apply based on the instance that you configure for the duration of the training job and storage that is attached to it.

Conclusion

In this post, we provided guidance on cost analysis and best practices when training ML models using SageMaker training jobs. As machine learning establishes itself as a powerful tool across industries, training and running ML models needs to remain cost-effective. SageMaker offers a wide and deep feature set for facilitating each step in the ML pipeline and provides cost optimization opportunities without impacting performance or agility.

Refer to the following posts in this series for more information about optimizing cost for SageMaker:

Part 1: Overview
Part 2: SageMaker notebooks and SageMaker Studio
Part 3: Processing and Data Wrangler jobs
Part 4: Training jobs
Part 5: Hosting

Solution overview

Use Amazon Translate via the console

Use Amazon Translate with the AWS CLI

Use the Amazon Translate SDK (Python Boto3)

Conclusion

About the Authors

Solution overview

Prerequisites

Provision an ECS cluster of Trn1 instances

Prepare and push an ECR image with the Neuron SDK

Run the ML training job as an ECS task

Run the task on Amazon ECS

Clean up

Conclusion

About the Authors

Triton with PyTorch backend

Triton Inference on SageMaker

Solution overview

Prerequisites

Set up your environment

Install the dependencies and import the required library

Prepare the model artifacts

Deploy the model

Invoke the model and run predictions

Clean up

Best practices

Conclusion

About the Authors

Solution overview

Prerequisites

Set up the environment

Create the configuration file

Run a sample notebook

Override the default configuration file

Debug and retrieve defaults

Automate config file creation

Clean up

Conclusion

About the Authors

A treasure trove of data about the software engineering process

A multi-task model for software engineering

An ML peer programmer

Conclusion

Acknowledgements

Large language models

SageMaker foundation models

AI21 Jurassic-2 Jumbo Instruct

Solution overview

Access Jurassic-2 Jumbo Instruct through the SageMaker console

Evaluate the Jurassic-2 Jumbo Instruct model in the model playground

Deploy the Jurassic-2 Jumbo Instruct foundation model from a notebook

Run the quiz in the notebook

Clean up

Conclusion

About the Author

Distributed GPU training with multi-GPU instances

Benchmarks

Performance benchmarks on large datasets

Model quality benchmarks

Conclusion

About the Authors

SageMaker real-time inference

Optimize costs for real-time endpoints

SageMaker batch transform

Optimize costs for batch transform

SageMaker asynchronous inference

Optimize costs for asynchronous inference

Serverless inference

Optimize cost for serverless inference

General best practices for optimizing SageMaker inference costs

Conclusion

About the Authors

SageMaker training jobs

Downloading phase

Training phase

SageMaker Debugger

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities