Amazon AWS – Page 196

Achieve four times higher ML inference throughput at three times lower cost per inference with Amazon EC2 G5 instances for NLP and CV PyTorch models

October 3, 2022

by Ankur Srivastava Amazon AWS

Amazon Elastic Compute Cloud (Amazon EC2) G5 instances are the first and only instances in the cloud to feature NVIDIA A10G Tensor Core GPUs, which you can use for a wide range of graphics-intensive and machine learning (ML) use cases. With G5 instances, ML customers get high performance and a cost-efficient infrastructure to train and deploy larger and more sophisticated models for natural language processing (NLP), computer vision (CV), and recommender engine use cases.

The purpose of this post is to showcase the performance benefits of G5 instances for large-scale ML inference workloads. We do this by comparing the price-performance (measured as $ per million inferences) for NLP and CV models with G4dn instances. We start by describing our benchmarking approach and then present throughput vs. latency curves across batch sizes and data type precision. In comparison to G4dn instances, we find that G5 instances deliver consistently lower cost per million inferences for both full precision and mixed precision modes for the NLP and CV models while achieving higher throughput and lower latency.

Benchmarking approach

To develop a price-performance study between G5 and G4dn, we need to measure throughput, latency, and cost per million inferences as a function of batch size. We also study the impact of full precision vs. mixed precision. Both the model graph and inputs are loaded into CUDA prior to inferencing.

As shown in the following architecture diagram, we first create respective base container images with CUDA for the underlying EC2 instance (G4dn, G5). To build the base container images, we start with AWS Deep Learning Containers, which use pre-packaged Docker images to deploy deep learning environments in minutes. The images contain the required deep learning PyTorch libraries and tools. You can add your own libraries and tools on top of these images for a higher degree of control over monitoring, compliance, and data processing.

Then we build a model-specific container image that encapsulates the model configuration, model tracing, and related code to run forward passes. All container images are loaded on into Amazon ECR to allow for horizontal scaling of these models for various model configurations. We use Amazon Simple Storage Service (Amazon S3) as a common data store to download configuration and upload benchmark results for summarization. You can use this architecture to recreate and reproduce the benchmark results and repurpose to benchmark various model types (such as Hugging Face models, PyTorch models, other custom models) across EC2 instance types (CPU, GPU, Inf1).

With this experiment set up, our goal is to study latency as a function of throughput. This curve is important for application design to arrive at a cost-optimal infrastructure for the target application. To achieve this, we simulate different loads by queuing up queries from multiple threads and then measuring the round-trip time for each completed request. Throughput is measured based on the number of completed requests per unit clock time. Furthermore, you can vary the batch sizes and other variables like sequence length and full precision vs. half precision to comprehensively sweep the design space to arrive at indicative performance metrics. In our study, through a parametric sweep of batch size and queries from multi-threaded clients, the throughput vs. latency curve is determined. Every request can be batched to ensure full utilization of the accelerator, especially for small requests that may not fully utilize the compute node. You can also adopt this setup to identify the client-side batch size for optimal performance.

In summary, we can represent this problem mathematically as: (Throughput, Latency) = function of (Batch Size, Number of threads, Precision).

This means, given the exhaustive space, the number of experiments can be large. Fortunately, each experiment can be independently run. We recommend using AWS Batch to perform this horizontally scaled benchmarking in compressed time without an increase in benchmarking cost compared to a linear approach to testing. The code for replicating the results is present in the GitHub repository prepared for AWS Re:Invent 2021. The repository is comprehensive to perform benchmarking on different accelerators. You can refer to the GPU aspect of code to build the container (Dockerfile-gpu) and then refer to the code inside Container-Root for specific examples for BERT and ResNet50.

We used the preceding approach to develop performance studies across two model types: Bert-base-uncased (110 million parameters, NLP) and ResNet50 (25.6 million parameters, CV). The following table summarizes the model details.

Model Type	Model	Details
NLP	twmkn9/bert-base-uncased-squad2	110 million parameters Sequence length = 128
CV	ResNet50	25.6 million parameters

Additionally, to benchmark across data types (full, half precision), we use torch.cuda.amp, which provides convenient methods to handle mixed precision where some operations use the torch.float32 (float) data type and other operations use torch.float16 (half). For example, operators like linear layers and convolutions are much faster with float16, whereas others like reductions often require the dynamic range of float32. Automatic mixed precision tries to match each operator to its appropriate data type to optimize the network’s runtime and memory footprint.

Benchmarking results

For a fair comparison, we selected G4dn.4xlarge and G5.4xlarge instances with similar attributes, as listed in the following table.

Instance	GPUs	GPU Memory (GiB)	vCPUs	Memory (GiB)	Instance Storage (GB)	Network Performance (Gbps)	EBS Bandwidth (Gbps)	Linux On-Demand Pricing (us-east-1)
G5.4xlarge	1	24	16	64	1x 600 NVMe SSD	up to 25	8	$1.204/hour
G4dn.4xlarge	1	16	16	64	1x 225 NVMe SSD	up to 25	4.75	$1.624/hour

In the following sections, we compare ML inference performance of BERT and RESNET50 models with a grid sweep approach for specific batch sizes (32, 16, 8, 4, 1) and data type precision (full and half precision) to arrive at the throughput vs. latency curve. Additionally, we investigate the effect of throughput vs. batch size for both full and half precision. Lastly, we measure cost per million inferences as a function of batch size. The consolidated results across these experiments are summarized later in this post.

Throughput vs. latency

The following figures compare G4dn and G5 instances for NLP and CV workloads at both full and half precision. In comparison to G4dn instances, the G5 instance delivers a throughput of about five times higher (full precision) and about 2.5 times higher (half precision) for a BERT base model, and about 2–2.5 times higher for a ResNet50 model. Overall, G5 is a preferred choice, with increasing batch sizes for both models for both full and mixed precision from a performance perspective.

The following graphs compare throughput and P95 latency at full and half precision for BERT.

The following graphs compare throughput and P95 latency at full and half precision for ResNet50.

Throughput and latency vs. batch size

The following graphs show the throughput as a function of the batch size. At low batch sizes, the accelerator isn’t functioning to its fullest capacity and as the batch size increases, throughput is increased at the cost of latency. The throughput curve asymptotes to a maximum value that is a function of the accelerator performance. The curve has two distinct features: a rising section and a flat asymptotic section. For a given model, a performant accelerator (G5) is able to stretch the rising section to higher batch sizes than G4dn and asymptote at a higher throughput. Also, there is a linear trade-off between latency and batch size. Therefore, if the application is latency bound, we can use P95 latency vs. batch size to determine the optimum batch size. However, if the objective is to maximize throughput at the lowest latency, it’s better to select the batch size corresponding to the “knee” between the rising and the asymptotic sections, because any further increase in batch size would result in the same throughput at a worse latency. To achieve the best price-performance ratio, targeting higher throughput at lowest latency, you’re better off horizontally scaling this optimum through multiple inference servers rather than just increasing the batch size.

Cost vs. batch size

In this section, we present the comparative results of inference costs ($ per million inferences) versus the batch size. From the following figure, we can clearly observe that the cost (measured as $ per million inferences) is consistently lower with G5 vs. G4dn both (full and half precision).

The following table summarizes throughput, latency, and cost ($ per million inferences) comparisons for BERT and RESNET50 models across both precision modes for specific batch sizes. In spite of a higher cost per instance, G5 consistently outperforms G4dn across all aspects of inference latency, throughput, and cost ($ per million inference), for all batch sizes. Combining the different metrics into a cost ($ per million inferences), BERT model (32 batch size, full precision) with G5 is 3.7 times more favorable than G4dn, and with ResNet50 model (32 batch size, full precision), it is 1.6 times more favorable than G4dn.

Model	Batch Size	Precision	Throughput (Batch size X Requests/sec)		Latency (ms)		$/million Inferences (On-Demand)		Cost Benefit (G5 over G4dn)
.	.	.	G5	G4dn	G5	G4dn	G5	G4dn
Bert-base-uncased	32	Full	723	154	44	208	$0.6	$2.2	3.7X
	32	Mixed	870	410	37	79	$0.5	$0.8	1.6X
	16	Full	651	158	25	102	$0.7	$2.1	3.0X
	16	Mixed	762	376	21	43	$0.6	$0.9	1.5X
	8	Full	642	142	13	57	$0.7	$2.3	3.3X
	8	Mixed	681	350	12	23	$0.7	$1.0	1.4X
.	1	Full	160	116	6	9	$2.8	$2.9	1.0X
.	1	Mixed	137	102	7	10	$3.3	$3.3	1.0X
ResNet50	32	Full	941	397	34	82	$0.5	$0.8	1.6X
	32	Mixed	1533	851	21	38	$0.3	$0.4	1.3X
	16	Full	888	384	18	42	$0.5	$0.9	1.8X
	16	Mixed	1474	819	11	20	$0.3	$0.4	1.3X
	8	Full	805	340	10	24	$0.6	$1.0	1.7X
	8	Mixed	1419	772	6	10	$0.3	$0.4	1.3X
.	1	Full	202	164	5	6	$2.2	$2	0.9X
.	1	Mixed	196	180	5	6	$2.3	$1.9	0.8X

Additional inference benchmarks

In addition to the BERT base and ResNet50 results in the prior sections, we present additional benchmarking results for other commonly used large NLP and CV models in PyTorch. The performance benefit of G5 over G4dn has been presented for BERT Large models at various precision, and Yolo-v5 models for various sizes. For the code for replicating the benchmark, refer to NVIDIA Deep Learning Examples for Tensor Cores. These results show the benefit of using G5 over G4dn for a wide range of inference tasks spanning different model types.

Model	Precision	Batch Size	Sequence Length	Throughput (sent/sec)	Throughput: G4dn	Speedup Over G4dn
BERT-large	FP16	1	128	93.5	40.31	2.3
BERT-large	FP16	4	128	264.2	87.4	3.0
BERT-large	FP16	8	128	392.1	107.5	3.6
BERT-large	FP32	1	128	68.4	22.67	3.0
BERT-large		4	128	118.5	32.21	3.7
BERT-large		8	128	132.4	34.67	3.8

Model	GFLOPS	Number of Parameters	Preprocessing (ms)	Inference (ms)	Inference (Non-max-suppression) (NMS/image)
YOLOv5s	16.5	7.2M	0.2	3.6	4.5
YOLOv5m	49.1	21M	0.2	6.5	4.5
YOLOv5l	109.3	46M	0.2	9.1	3.5
YOLOv5x	205.9	86M	0.2	14.4	1.3

Conclusion

In this post, we showed that for inference with large NLP and CV PyTorch models, EC2 G5 instances are a better choice compared to G4dn instances. Although the on-demand hourly cost for G5 instances is higher than G4dn instances, its higher performance can achieve 2–5 times the throughput at any precision for NLP and CV models, which makes the cost per million inferences 1.5–3.5 times more favorable than G4dn instances. Even for latency bound applications, G5 is 2.5–5 times better than G4dn for NLP and CV models.

In summary, AWS G5 instances are an excellent choice for your inference needs from both a performance and cost per inference perspective. The universality of the CUDA framework and the scale and depth of the G5 instance pool on AWS provides you with a unique ability to perform inference at scale.

About the authors

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Sundar Ranganathan is the Head of Business Development, ML Frameworks on the Amazon EC2 team. He focuses on large-scale ML workloads across AWS services like Amazon EKS, Amazon ECS, Elastic Fabric Adapter, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.

Mahadevan Balasubramaniam is a Principal Solutions Architect for Autonomous Computing with nearly 20 years of experience in the area of physics-infused deep learning, building, and deploying digital twins for industrial systems at scale. Mahadevan obtained his PhD in Mechanical Engineering from the Massachusetts Institute of Technology and has over 25 patents and publications to his credit.

Amr Ragab is a Principal Solutions Architect for EC2 Accelerated Platforms for AWS, devoted to helping customers run computational workloads at scale. In his spare time he likes traveling and finding new ways to integrate technology into daily life.

Amazon hosts 2022 US Frontiers of Engineering Symposium

October 3, 2022

by Amazon AWS

More than 100 early career engineers heard from expert speakers in academia and industry across a wide range of topics.Read More

Celebrate over 20 years of AI/ML at Innovation Day

September 30, 2022

by Kimberly Madia Amazon AWS

Be our guest as we celebrate 20 years of AI/ML innovation on October 25, 2022, 9:00 AM – 10:30 AM PT. The first 1,500 people to register will receive $50 of AWS credits. Register here.

Over the past 20 years, Amazon has delivered many world firsts for artificial intelligence (AI) and machine learning (ML). ML is an integral part of Amazon and is used for everything from applying personalization models at checkout, to forecasting the demand for products globally, to creating autonomous flight for Amazon Prime Air drones, to natural language processing (NLP) on Alexa. And the use of ML isn’t slowing down anytime soon, because ML helps Amazon exceed customer expectations for convenience, cost, and delivery speed.

During the virtual AI/ML innovation event on October 25, we take time to reflect on what’s been done at Amazon and how we have packaged this innovation into a wide breadth and depth of AI/ML services. The AWS ML stack helps you rapidly innovate and enhance customer experiences, enable faster and better decision-making, and optimize business processes using the same technology that Amazon uses every day. With the most experience; the most reliable, scalable, and secure cloud; and the most comprehensive set of services and solutions, AWS is the best place to unlock value from your data and turn it into insight.

We will also take a moment to celebrate customer success using AWS to harness the power of data with ML and, in many cases, change the way we live for the better. Mueller Water Products, Siemens Energy, Inspire, and ResMed will show what’s possible using ML for sustainability and accessibility challenges such as water conservation, predictive maintenance for industrial plants, personalized medical care resources for patients and caregivers, and cloud-connected customized recommendations for patients and their healthcare providers.

The 90-minute session doesn’t stop there! We have special guest speaker Professor Michael Jordan, who will talk about the decision-making side of ML spanning computational, inferential, and economic perspectives. Much of the recent focus in ML has been on the pattern recognition side of the field. In Professor Jordan’s talk, he will focus on the decision-making side, where many fundamental challenges remain. Some are statistical in nature, including the challenges associated with multiple decision-making. Others are economic, involving learning systems that must cope with scarcity, competition, and incentives, and some are algorithmic, including the challenge of coordinated decision-making on distributed platforms and the need for algorithms to converge to equilibria rather than optima. He will ponder how next-generation ML platforms can provide environments that support this kind of large scale, dynamic, data-aware, and market-aware decision-making.

Finally, we wrap up the celebration with Dr. Bratin Saha, VP of AI/ML, who will explain how AWS AI/ML has grown to over 100,000 customers so quickly, including how Amazon SageMaker became one of the fastest growing services in the history of AWS. Hint—SageMaker incorporates many world firsts, including fully managed infrastructure, tools such IDEs and feature stores, workflows, AutoML, and no-code capabilities.

AWS has helped foster ML growth through capabilities that help you deploy it at scale by operationalizing processes. We have seen this play out in many different industries. For example, in the automotive industry, the assembly line has standardized automotive design and manufacturing, and launched a revolution in transportation by helping us transition from hand-assembled cars to mass production.

Similarly, the software industry went from a few specialized business applications to becoming ubiquitous in every aspect of our lives. That happened through automation, tooling, and implementing and standardizing processes—in effect through the industrialization of software. In the same way, ML services from AWS are driving this transformation. In fact, customers today are running millions of models, billons of parameters, and hundreds of billions of predictions on AWS.

Dr. Saha will also look back at the history of flagship AI services, including services for text and documents, speech, vision, healthcare, industrial, search, business processes, and DevOps. He will explain how to use the AI Use Case Explorer, where you can explore use cases, discover customer success stories, and mobilize your team around the power of AI and ML. Dr. Saha will end on his vision for AWS AI/ML services.

We can’t wait to celebrate with you, so register now! If you’re among the first 1,500 people to register, you will receive $50 of AWS credits.

Happy innovating!

About the author

Kimberly Madia is a Principal Product Marketing Manager with AWS Machine Learning. Her goal is to make it easy for customers to build, train, and deploy machine learning models using Amazon SageMaker. For fun outside work, Kimberly likes to cook, read, and run on the San Francisco Bay Trail.

AWS Panorama now supports NVIDIA JetPack SDK 4.6.2

September 30, 2022

by Vinod Raman Amazon AWS

AWS Panorama is a collection of machine learning (ML) devices and a software development kit (SDK) that brings computer vision to on-premises internet protocol (IP) cameras. AWS Panorama device options include the AWS Panorama Appliance and the Lenovo ThinkEdge SE70, powered by AWS Panorama. These device options provide you choices in price and performance, depending on your unique use case. Both AWS Panorama devices are built on NVIDIA® Jetson System on Modules (SOMs) and use the NVIDIA JetPack SDK.

AWS has released a new software update for AWS Panorama that supports NVIDIA Jetpack SDK version 4.6.2. You can download this software update and apply it to the AWS Panorama device via an over-the-air (OTA) upgrade process. For more details, see Managing an AWS Panorama Appliance.

This release is not backward compatible with previous software releases for AWS Panorama; you must rebuild and redeploy your applications. This post provides a step-by step guide to updating your application software libraries to the latest supported versions.

Overview of update

The NVIDIA Jetpack SDK version 4.6.2 includes support for newer versions of CUDA 10.2, cuDNN 8.2.1, and TensorRT 8.2.1. Other notable libraries now supported include DeepStream 6.0 and Triton Inference Server 21.07. In addition, TensorRT 8.2.1 includes an expanded list of DNN operator support, Sigmoid/Tanh INT8 support for DLA, and better integration with TensorFlow and PyTorch. Torch to TensorRT conversion is now supported, as well as TensorFlow to TensorRT, without the need to convert to ONNX as an intermediate step. For additional details, refer to NVIDIA Announces TensorRT 8.2 and Integrations with PyTorch and TensorFlow.

You can redeploy your applications by following the steps in the following sections.

Prerequisites

As a prerequisite, you need an AWS account and an AWS Panorama device.

Upgrade your AWS Panorama device

First, you upgrade your AWS Panorama device to the latest version.

On the AWS Panorama console, choose Devices in the navigation pane.
Choose an Appliance.
Choose Settings.
Under System software, choose View Software update.
Choose System Software version 5.0 or above and then proceed to install this software.

Redeploy your application

If you do not use the Open GPU access feature as part of your application, you use the Replace feature on the AWS Panorama console. The Replace function rebuilds your model for the latest software.

On the AWS Panorama console, choose Deployed applications in the navigation pane.
Select an application.
Choose Replace.

For applications using the Open GPU access feature, upgrading typically involves allowing your container access to the underlying GPU hardware and deploying and managing your own models and runtime. We recommend using NVIDIA TensorRT in your application, but you’re not limited to this.

You also need to update the libraries of your Dockerfile. Typical libraries to update include CUDA 10.2, cuDNN 8.2.1, TensorRT 8.2.1, DeepStream 6.0, OpenCV 4.1.1, and VPI 1.1. As a note, all related CUDA/NVIDIA changes in the software stack can be found at JetPack SDK 4.6.2.

Now you rebuild the models for TensorRT 8.2.1 and update your package.json file with the updated assets. You can now build your container with the updated dependencies and models and deploy the application container to your Appliance using the AWS Panorama console or APIs.

At this point your, AWS Panorama applications should be able to use the Jetpack SDK version 4.6.2. AWS Panorama’s sample applications that are compatible with this version are located on GitHub.

Conclusion

With the new update to AWS Panorama, you must rebuild and redeploy your applications. This post walked you through the steps to update your AWS Panorama application software libraries to the latest supported versions.

Please reach out to AWS Panorama with any questions at AWS re:Post.

About the Authors

Vinod Raman is a Principal Product Manager at AWS.

Steven White is a Senior Computer Vision/EdgeML Solutions Architect at AWS.

Build flexible and scalable distributed training architectures using Kubeflow on AWS and Amazon SageMaker

September 30, 2022

by Kanwaljit Khurmi Amazon AWS

In this post, we demonstrate how Kubeflow on AWS (an AWS-specific distribution of Kubeflow) used with AWS Deep Learning Containers and Amazon Elastic File System (Amazon EFS) simplifies collaboration and provides flexibility in training deep learning models at scale on both Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon SageMaker utilizing a hybrid architecture approach.

Machine learning (ML) development relies on complex and continuously evolving open-source frameworks and toolkits, as well as complex and continuously evolving hardware ecosystems. This poses a challenge when scaling out ML development to a cluster. Containers offer a solution, because they can fully encapsulate not just the training code, but the entire dependency stack down to the hardware libraries. This ensures an ML environment that is consistent and portable, and facilitates reproducibility of the training environment on each individual node of the training cluster.

Kubernetes is a widely adopted system for automating infrastructure deployment, resource scaling, and management of these containerized applications. However, Kubernetes wasn’t built with ML in mind, so it can feel counterintuitive to data scientists due to its heavy reliance on YAML specification files. There isn’t a Jupyter experience, and there aren’t many ML-specific capabilities, such as workflow management and pipelines, and other capabilities that ML experts expect, such as hyperparameter tuning, model hosting, and others. Such capabilities can be built, but Kubernetes wasn’t designed to do this as its primary objective.

The open-source community took notice and developed a layer on top of Kubernetes called Kubeflow. Kubeflow aims to make the deployment of end-to-end ML workflows on Kubernetes simple, portable, and scalable. You can use Kubeflow to deploy best-of-breed open-source systems for ML to diverse infrastructures.

Kubeflow and Kubernetes provides flexibility and control to data scientist teams. However, ensuring high utilization of training clusters running at scale with reduced operational overheads is still challenging.

This post demonstrates how customers who have on-premises restrictions or existing Kubernetes investments can address this challenge by using Amazon EKS and Kubeflow on AWS to implement an ML pipeline for distributed training based on a self-managed approach, and use fully managed SageMaker for a cost-optimized, fully managed, and production-scale training infrastructure. This includes step-by-step implementation of a hybrid distributed training architecture that allows you to choose between the two approaches at runtime, conferring maximum control and flexibility with stringent needs for your deployments. You will see how you can continue using open-source libraries in your deep learning training script and still make it compatible to run on both Kubernetes and SageMaker in a platform agnostic way.

How does Kubeflow on AWS and SageMaker help?

Neural network models built with deep learning frameworks like TensorFlow, PyTorch, MXNet, and others provide much higher accuracy by using significantly larger training datasets, especially in computer vision and natural language processing use cases. However, with large training datasets, it takes longer to train the deep learning models, which ultimately slows down the time to market. If we could scale out a cluster and bring down the model training time from weeks to days or hours, it could have a huge impact on productivity and business velocity.

Amazon EKS helps provision the managed Kubernetes control plane. You can use Amazon EKS to create large-scale training clusters with CPU and GPU instances and use the Kubeflow toolkit to provide ML-friendly, open-source tools and operationalize ML workflows that are portable and scalable using Kubeflow Pipelines to improve your team’s productivity and reduce the time to market.

However, there could be a couple of challenges with this approach:

Ensuring maximum utilization of a cluster across data science teams. For example, you should provision GPU instances on demand and ensure its high utilization for demanding production-scale tasks such as deep learning training, and use CPU instances for the less demanding tasks such data preprocessing
Ensuring high availability of heavyweight Kubeflow infrastructure components, including database, storage, and authentication, that are deployed in the Kubernetes cluster worker node. For example, the Kubeflow control plane generates artifacts (such as MySQL instances, pod logs, or MinIO storage) that grow over time and need resizable storage volumes with continuous monitoring capabilities.
Sharing the training dataset, code, and compute environments between developers, training clusters, and projects is challenging. For example, if you’re working on your own set of libraries and those libraries have strong interdependencies, it gets really hard to share and run the same piece of code between data scientists in the same team. Also, each training run requires you to download the training dataset and build the training image with new code changes.

Kubeflow on AWS helps address these challenges and provides an enterprise-grade semi-managed Kubeflow product. With Kubeflow on AWS, you can replace some Kubeflow control plane services like database, storage, monitoring, and user management with AWS managed services like Amazon Relational Database Service (Amazon RDS), Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), Amazon FSx, Amazon CloudWatch, and Amazon Cognito.

Replacing these Kubeflow components decouples critical parts of the Kubeflow control plane from Kubernetes, providing a secure, scalable, resilient, and cost-optimized design. This approach also frees up storage and compute resources from the EKS data plane, which may be needed by applications such as distributed model training or user notebook servers. Kubeflow on AWS also provides native integration of Jupyter notebooks with Deep Learning Container (DLC) images, which are pre-packaged and preconfigured with AWS optimized deep learning frameworks such as PyTorch and TensorFlow that allow you to start writing your training code right away without dealing with dependency resolutions and framework optimizations. Also, Amazon EFS integration with training clusters and the development environment allows you to share your code and processed training dataset, which avoids building the container image and loading huge datasets after every code change. These integrations with Kubeflow on AWS help you speed up the model building and training time and allow for better collaboration with easier data and code sharing.

Kubeflow on AWS helps build a highly available and robust ML platform. This platform provides flexibility to build and train deep learning models and provides access to many open-source toolkits, insights into logs, and interactive debugging for experimentation. However, achieving maximum utilization of infrastructure resources while training deep learning models on hundreds of GPUs still involves a lot of operational overheads. This could be addressed by using SageMaker, which is a fully managed service designed and optimized for handling performant and cost-optimized training clusters that are only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, thereby providing close to 100% resource utilization. You can integrate SageMaker with Kubeflow Pipelines using managed SageMaker components. This allows you to operationalize ML workflows as part of Kubeflow pipelines, where you can use Kubernetes for local training and SageMaker for product-scale training in a hybrid architecture.

Solution overview

The following architecture describes how we use Kubeflow Pipelines to build and deploy portable and scalable end-to-end ML workflows to conditionally run distributed training on Kubernetes using Kubeflow training or SageMaker based on the runtime parameter.

Kubeflow training is a group of Kubernetes Operators that add to Kubeflow the support for distributed training of ML models using different frameworks like TensorFlow, PyTorch, and others. pytorch-operator is the Kubeflow implementation of the Kubernetes custom resource (PyTorchJob) to run distributed PyTorch training jobs on Kubernetes.

We use the PyTorchJob Launcher component as part of the Kubeflow pipeline to run PyTorch distributed training during the experimentation phase when we need flexibility and access to all the underlying resources for interactive debugging and analysis.

We also use SageMaker components for Kubeflow Pipelines to run our model training at production scale. This allows us to take advantage of powerful SageMaker features such as fully managed services, distributed training jobs with maximum GPU utilization, and cost-effective training through Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances.

As part for the workflow creation process, you complete the following steps (as shown in the preceding diagram) to create this pipeline:

Use the Kubeflow manifest file to create a Kubeflow dashboard and access Jupyter notebooks from the Kubeflow central dashboard.
Use the Kubeflow pipeline SDK to create and compile Kubeflow pipelines using Python code. Pipeline compilation converts the Python function to a workflow resource, which is an Argo-compatible YAML format.
Use the Kubeflow Pipelines SDK client to call the pipeline service endpoint to run the pipeline.
The pipeline evaluates the conditional runtime variables and decides between SageMaker or Kubernetes as the target run environment.
Use the Kubeflow PyTorch Launcher component to run distributed training on the native Kubernetes environment, or use the SageMaker component to submit the training on the SageMaker managed platform.

The following figure shows the Kubeflow Pipelines components involved in the architecture that give us the flexibility to choose between Kubernetes or SageMaker distributed environments.

Use Case Workflow

We use the following step-by-step approach to install and run the use case for distributed training using Amazon EKS and SageMaker using Kubeflow on AWS.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
A machine with Docker and the AWS Command Line Interface (AWS CLI) installed.
Optionally, you can use AWS Cloud9, a cloud-based integrated development environment (IDE) that enables completing all the work from your web browser. For setup instructions, refer to Setup Cloud9 IDE. From your Cloud9 environment, choose the plus sign and open new terminal.
Create a role with the name sagemakerrole. Add managed policies AmazonSageMakerFullAccess and AmazonS3FullAccess to give SageMaker access to S3 buckets. This role is used by SageMaker job submitted as part of Kubeflow Pipelines step.
Make sure your account has SageMaker Training resource type limit for ml.p3.2xlarge increased to 2 using Service Quotas Console

1. Install Amazon EKS and Kubeflow on AWS

You can use several different approaches to build a Kubernetes cluster and deploy Kubeflow. In this post, we focus on an approach that we believe brings simplicity to the process. First, we create an EKS cluster, then we deploy Kubeflow on AWS v1.5 on it. For each of these tasks, we use a corresponding open-source project that follows the principles of the Do Framework. Rather than installing a set of prerequisites for each task, we build Docker containers that have all the necessary tools and perform the tasks from within the containers.

We use the Do Framework in this post, which automates the Kubeflow deployment with Amazon EFS as an add-on. For the official Kubeflow on AWS deployment options for production deployments, refer to Deployment.

Configure the current working directory and AWS CLI

We configure a working directory so we can refer to it as the starting point for the steps that follow:

export working_dir=$PWD

We also configure an AWS CLI profile. To do so, you need an access key ID and secret access key of an AWS Identity and Access Management (IAM) user account with administrative privileges (attach the existing managed policy) and programmatic access. See the following code:

aws configure --profile=kubeflow
AWS Access Key ID [None]: <enter access key id>
AWS Secret Access Key [None]: <enter secret access key>
Default region name [None]: us-west-2
Default output format [None]: json

# (In Cloud9, select “Cancel” and “Permanently disable” when the AWS managed temporary credentials dialog pops up)

export AWS_PROFILE=kubeflow

1.1 Create an EKS cluster

If you already have an EKS cluster available, you can skip to the next section. For this post, we use the aws-do-eks project to create our cluster.

First clone the project in your working directory

cd ${working_dir}
git clone https://github.com/aws-samples/aws-do-eks
cd aws-do-eks/

Then build and run the aws-do-eks container:
```
./build.sh
./run.sh
```
The build.sh script creates a Docker container image that has all the necessary tools and scripts for provisioning and operation of EKS clusters. The run.sh script starts a container using the created Docker image and keeps it up, so we can use it as our EKS management environment. To see the status of your aws-do-eks container, you can run ./status.sh. If the container is in Exited status, you can use the ./start.sh script to bring the container up, or to restart the container, you can run ./stop.sh followed by ./run.sh.
Open a shell in the running aws-do-eks container:
```
./exec.sh
```
To review the EKS cluster configuration for our KubeFlow deployment, run the following command:
```
vi ./eks-kubeflow.yaml
```
By default, this configuration creates a cluster named eks-kubeflow in the us-west-2 Region with six m5.xlarge nodes. Also, EBS volumes encryption is not enabled by default. You can enable it by adding "volumeEncrypted: true" to the nodegroup and it will encrypt using the default key. Modify other configurations settings if needed.
To create the cluster, run the following command:
```
export AWS_PROFILE=kubeflow
eksctl create cluster -f ./eks-kubeflow.yaml
```
The cluster provisioning process may take up to 30 minutes.

To verify that the cluster was created successfully, run the following command:

kubectl get nodes

The output from the preceding command for a cluster that was created successfully looks like the following code:

root@cdf4ecbebf62:/eks# kubectl get nodes
NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-0-166.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-13-28.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-45-240.us-west-2.compute.internal   Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-63-84.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-75-56.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-85-226.us-west-2.compute.internal   Ready    <none>   23m   v1.21.14-eks-ba74326

Create an EFS volume for the SageMaker training job

In this use case, you speed up the SageMaker training job by training deep learning models from data already stored in Amazon EFS. This choice has the benefit of directly launching your training jobs from the data in Amazon EFS with no data movement required, resulting in faster training start times.

We create an EFS volume and deploy the EFS Container Storage Interface (CSI) driver. This is accomplished by a deployment script located in /eks/deployment/csi/efs within the aws-do-eks container.

This script assumes you have one EKS cluster in your account. Set CLUSTER_NAME=<eks_cluster_name> in case you have more than one EKS cluster.

cd /eks/deployment/csi/efs
./deploy.sh

This script provisions an EFS volume and creates mount targets for the subnets of the cluster VPC. It then deploys the EFS CSI driver and creates the efs-sc storage class and efs-pv persistent volume in the EKS cluster.

Upon successful completion of the script, you should see output like the following:

Generating efs-sc.yaml ...

Applying efs-sc.yaml ...
storageclass.storage.k8s.io/efs-sc created
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
efs-sc          efs.csi.aws.com         Delete          Immediate              false                  1s
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  36m

Generating efs-pv.yaml ...
Applying efs-pv.yaml ...
persistentvolume/efs-pv created
NAME     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
efs-pv   5Gi        RWX            Retain           Available           efs-sc                  10s

Done ...

Create an Amazon S3 VPC endpoint

You use a private VPC that your SageMaker training job and EFS file system have access to. To give the SageMaker training cluster access to the S3 buckets from your private VPC, you create a VPC endpoint:

cd /eks/vpc 
export CLUSTER_NAME=<eks-cluster> 
export REGION=<region> 
./vpc-endpoint-create.sh

You may now exit the aws-do-eks container shell and proceed to the next section:

exit

root@cdf4ecbebf62:/eks/deployment/csi/efs# exit
exit
TeamRole:~/environment/aws-do-eks (main) $

1.2 Deploy Kubeflow on AWS on Amazon EKS

To deploy Kubeflow on Amazon EKS, we use the aws-do-kubeflow project.

Clone the repository using the following commands:

cd ${working_dir}
git clone https://github.com/aws-samples/aws-do-kubeflow
cd aws-do-kubeflow

Then configure the project:
```
./config.sh
```
This script opens the project configuration file in a text editor. It’s important for AWS_REGION to be set to the Region your cluster is in, as well as AWS_CLUSTER_NAME to match the name of the cluster that you created earlier. By default, your configuration is already properly set, so if you don’t need to make any changes, just close the editor.
```
./build.sh
./run.sh
./exec.sh
```
The build.sh script creates a Docker container image that has all the tools necessary to deploy and manage Kubeflow on an existing Kubernetes cluster. The run.sh script starts a container, using the Docker image, and the exec.sh script opens a command shell into the container, which we can use as our Kubeflow management environment. You can use the ./status.sh script to see if the aws-do-kubeflow container is up and running and the ./stop.sh and ./run.sh scripts to restart it as needed.
After you have a shell opened in the aws-do-eks container, you can verify that the configured cluster context is as expected:
```
root@ip-172-31-43-155:/kubeflow# kubectx
kubeflow@eks-kubeflow.us-west-2.eksctl.io
```

To deploy Kubeflow on the EKS cluster, run the deploy.sh script:

./kubeflow-deploy.sh

The deployment is successful when all pods in the kubeflow namespace enter the Running state. A typical output looks like the following code:

Waiting for all Kubeflow pods to start Running ...

Waiting for all Kubeflow pods to start Running ...

Restarting central dashboard ...
pod "centraldashboard-79f489b55-vr6lp" deleted
/kubeflow/deploy/distro/aws/kubeflow-manifests /kubeflow/deploy/distro/aws
/kubeflow/deploy/distro/aws

Kubeflow deployment succeeded
Granting cluster access to kubeflow profile user ...
Argument not provided, assuming default user namespace kubeflow-user-example-com ...
clusterrolebinding.rbac.authorization.k8s.io/kubeflow-user-example-com-cluster-admin-binding created
Setting up access to Kubeflow Pipelines ...
Argument not provided, assuming default user namespace kubeflow-user-example-com ...

Creating pod-default for namespace kubeflow-user-example-com ...
poddefault.kubeflow.org/access-ml-pipeline created

To monitor the state of the KubeFlow pods, in a separate window, you can use the following command:
```
watch kubectl -n kubeflow get pods
```
Press Ctrl+C when all pods are Running, then expose the Kubeflow dashboard outside the cluster by running the following command:
```
./kubeflow-expose.sh
```

You should see output that looks like the following code:

root@ip-172-31-43-155:/kubeflow# ./kubeflow-expose.sh
root@ip-172-31-43-155:/kubeflow# Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

This command port-forwards the Istio ingress gateway service from your cluster to your local port 8080. To access the Kubeflow dashboard, visit http://localhost:8080 and log in using the default user credentials (user@example.com/12341234). If you’re running the aws-do-kubeflow container in AWS Cloud9, then you can choose Preview, then choose Preview Running Application. If you’re running on Docker Desktop, you may need to run the ./kubeflow-expose.sh script outside of the aws-do-kubeflow container.

2. Set up the Kubeflow on AWS environment

To set up your Kubeflow on AWS environment, we create an EFS volume and a Jupyter notebook.

2.1 Create an EFS volume

To create an EFS volume, complete the following steps:

On the Kubeflow dashboard, choose Volumes in the navigation pane.
Chose New volume.
For Name, enter efs-sc-claim.
For Volume size, enter 10.
For Storage class, choose efs-sc.
For Access mode, choose ReadWriteOnce.
Choose Create.

2.2 Create a Jupyter notebook

To create a new notebook, complete the following steps:

On the Kubeflow dashboard, choose Notebooks in the navigation pane.
Choose New notebook.
For Name, enter aws-hybrid-nb.
For Jupyter Docket Image, choose the image c9e4w0g3/notebook-servers/jupyter-pytorch:1.11.0-cpu-py38-ubuntu20.04-e3-v1.1 (the latest available jupyter-pytorch DLC image).
For CPU, enter 1.
For Memory, enter 5.
For GPUs, leave as None.
Don’t make any changes to the Workspace Volume section.
In the Data Volumes section, choose Attach existing volume and expand Existing volume section
For Name, choose efs-sc-claim.
For Mount path, enter /home/jovyan/efs-sc-claim.
This mounts the EFS volume to your Jupyter notebook pod, and you can see the folder efs-sc-claim in your Jupyter lab interface. You save the training dataset and training code to this folder so the training clusters can access it without needing to rebuild the container images for testing.
Select Allow access to Kubeflow Pipelines in Configuration section.
Choose Launch.
Verify that your notebook is created successfully (it may take a couple of minutes).
On the Notebooks page, choose Connect to log in to the JupyterLab environment.
On the Git menu, choose Clone a Repository.
For Clone a repo, enter https://github.com/aws-samples/aws-do-kubeflow.

3. Run distributed training

After you set up the Jupyter notebook, you can run the entire demo using the following high-level steps from the folder aws-do-kubeflow/workshop in the cloned repository:

PyTorch Distributed Data Parallel (DDP) training Script: Refer PyTorch DDP training script cifar10-distributed-gpu-final.py, which includes a sample convolutional neural network and logic to distribute training on a multi-node CPU and GPU cluster. (Refer 3.1 for details)
Install libraries: Run the notebook 0_initialize_dependencies.ipynb to initialize all dependencies. (Refer 3.2 for details)
Run distributed PyTorch job training on Kubernetes: Run the notebook 1_submit_pytorchdist_k8s.ipynb to create and submit distributed training on one primary and two worker containers using the Kubernetes custom resource PyTorchJob YAML file using Python code. (Refer 3.3 for details)
Create a hybrid Kubeflow pipeline: Run the notebook 2_create_pipeline_k8s_sagemaker.ipynb to create the hybrid Kubeflow pipeline that runs distributed training on the either SageMaker or Amazon EKS using the runtime variable training_runtime. (Refer 3.4 for details)

Make sure you ran the notebook 1_submit_pytorchdist_k8s.ipynb before you start notebook 2_create_pipeline_k8s_sagemaker.ipynb.

In the subsequent sections, we discuss each of these steps in detail.

3.1 PyTorch Distributed Data Parallel(DDP) training script

As part of the distributed training, we train a classification model created by a simple convolutional neural network that operates on the CIFAR10 dataset. The training script cifar10-distributed-gpu-final.py contains only the open-source libraries and is compatible to run both on Kubernetes and SageMaker training clusters on either GPU devices or CPU instances. Let’s look at a few important aspects of the training script before we run our notebook examples.

We use the torch.distributed module, which contains PyTorch support and communication primitives for multi-process parallelism across nodes in the cluster:

...
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision
from torchvision import datasets, transforms
...

We create a simple image classification model using a combination of convolutional, max pooling, and linear layers to which a relu activation function is applied in the forward pass of the model training:

# Define models
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)

def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x

We use the torch DataLoader that combines the dataset and DistributedSampler (loads a subset of data in a distributed manner using torch.nn.parallel.DistributedDataParallel) and provides a single-process or multi-process iterator over the data:

# Define data loader for training dataset
def _get_train_data_loader(batch_size, training_dir, is_distributed):
logger.info("Get train data loader")

train_set = torchvision.datasets.CIFAR10(root=training_dir,
train=True,
download=False,
transform=_get_transforms())

train_sampler = (
torch.utils.data.distributed.DistributedSampler(train_set) if is_distributed else None
)

return torch.utils.data.DataLoader(
train_set,
batch_size=batch_size,
shuffle=train_sampler is None,
sampler=train_sampler)
...

If the training cluster has GPUs, the script runs the training on CUDA devices and the device variable holds the default CUDA device:

device = "cuda" if torch.cuda.is_available() else "cpu"
...

Before you run distributed training using PyTorch DistributedDataParallel to run distributed processing on multiple nodes, you need to initialize the distributed environment by calling init_process_group. This is initialized on each machine of the training cluster.

dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
...

We instantiate the classifier model and copy over the model to the target device. If distributed training is enabled to run on multiple nodes, the DistributedDataParallel class is used as a wrapper object around the model object, which allows synchronous distributed training across multiple machines. The input data is split on the batch dimension and a replica of model is placed on each machine and each device.

model = Net().to(device)

if is_distributed:
model = torch.nn.parallel.DistributedDataParallel(model)

...

3.2 Install libraries

You will install all necessary libraries to run the PyTorch distributed training example. This includes Kubeflow Pipelines SDK, Training Operator Python SDK, Python client for Kubernetes and Amazon SageMaker Python SDK.

#Please run the below commands to install necessary libraries

!pip install kfp==1.8.4

!pip install kubeflow-training

!pip install kubernetes

!pip install sagemaker

3.3 Run distributed PyTorch job training on Kubernetes

The notebook 1_submit_pytorchdist_k8s.ipynb creates the Kubernetes custom resource PyTorchJob YAML file using Kubeflow training and the Kubernetes client Python SDK. The following are a few important snippets from this notebook.

We create the PyTorchJob YAML with the primary and worker containers as shown in the following code:

# Define PyTorchJob custom resource manifest
pytorchjob = V1PyTorchJob(
api_version="kubeflow.org/v1",
kind="PyTorchJob",
metadata=V1ObjectMeta(name=pytorch_distributed_jobname,namespace=user_namespace),
spec=V1PyTorchJobSpec(
run_policy=V1RunPolicy(clean_pod_policy="None"),
pytorch_replica_specs={"Master": master,
"Worker": worker}
)
)

This is submitted to the Kubernetes control plane using PyTorchJobClient:

# Creates and Submits PyTorchJob custom resource file to Kubernetes
pytorchjob_client = PyTorchJobClient()

pytorch_job_manifest=pytorchjob_client.create(pytorchjob):

View the Kubernetes training logs

You can view the training logs either from the same Jupyter notebook using Python code or from the Kubernetes client shell.

From the notebook, run the following code with the appropriate log_type parameter value to view the primary, worker, or all logs:

#  Function Definition: read_logs(pyTorchClient: str, jobname: str, namespace: str, log_type: str) -> None:
#    log_type: all, worker:all, master:all, worker:0, worker:1

read_logs(pytorchjob_client, pytorch_distributed_jobname, user_namespace, "master:0")

From the Kubernetes client shell connected to the Kubernetes cluster, run the following commands using Kubectl to see the logs (substitute your namespace and pod names):
```
kubectl get pods -n kubeflow-user-example-com
kubectl logs <pod-name> -n kubeflow-user-example-com -f
```
We set world size – 3 because we’re distributing the training to three processes running in one primary and two worker pods. Data is split at the batch dimension and a third of the data is processed by the model in each container.

3.4 Create a hybrid Kubeflow pipeline

The notebook 2_create_pipeline_k8s_sagemaker.ipynb creates a hybrid Kubeflow pipeline based on conditional runtime variable training_runtime, as shown in the following code. The notebook uses the Kubeflow Pipelines SDK and it’s provided a set of Python packages to specify and run the ML workflow pipelines. As part of this SDK, we use the following packages:

The domain-specific language (DSL) package decorator dsl.pipeline, which decorates the Python functions to return a pipeline
The dsl.Condition package, which represents a group of operations that are only run when a certain condition is met, such as checking the training_runtime value as sagemaker or kubernetes

See the following code:

# Define your training runtime value with either 'sagemaker' or 'kubernetes'
training_runtime='sagemaker'

# Create Hybrid Pipeline using Kubeflow PyTorch Training Operators and Amazon SageMaker Service
@dsl.pipeline(name="PyTorch Training pipeline", description="Sample training job test")
def pytorch_cnn_pipeline(<training parameters>):

# Pipeline Step 1: to evaluate the condition. You can enter any logic here. For demonstration we are checking if GPU is needed for training
condition_result = check_condition_op(training_runtime)

# Pipeline Step 2: to run training on Kuberentes using PyTorch Training Operators. This will be executed if gpus are not needed
with dsl.Condition(condition_result.output == 'kubernetes', name="PyTorch_Comp"):
train_task = pytorch_job_op(
name=training_job_name,
namespace=user_namespace,
master_spec=json.dumps(master_spec_loaded), # Please refer file at pipeline_yaml_specifications/pipeline_master_spec.yml
worker_spec=json.dumps(worker_spec_loaded), # Please refer file at pipeline_yaml_specifications/pipeline_worker_spec.yml
delete_after_done=False
).after(condition_result)

# Pipeline Step 3: to run training on SageMaker using SageMaker Components for Pipeline. This will be executed if gpus are needed
with dsl.Condition(condition_result.output == 'sagemaker', name="SageMaker_Comp"):
training = sagemaker_train_op(
region=region,
image=train_image,
job_name=training_job_name,
training_input_mode=training_input_mode,
hyperparameters='{ 
"backend": "'+str(pytorch_backend)+'", 
"batch-size": "64", 
"epochs": "3", 
"lr": "'+str(learning_rate)+'", 
"model-type": "custom", 
"sagemaker_container_log_level": "20", 
"sagemaker_program": "cifar10-distributed-gpu-final.py", 
"sagemaker_region": "us-west-2", 
"sagemaker_submit_directory": "'+source_s3+'" 
}',
channels=channels,
instance_type=instance_type,
instance_count=instance_count,
volume_size=volume_size,
max_run_time=max_run_time,
model_artifact_path=f's3://{bucket_name}/jobs',
network_isolation=network_isolation,
traffic_encryption=traffic_encryption,
role=role,
vpc_subnets=subnet_id,
vpc_security_group_ids=security_group_id
).after(condition_result)

We configure SageMaker distributed training using two ml.p3.2xlarge instances.

After the pipeline is defined, you can compile the pipeline to an Argo YAML specification using the Kubeflow Pipelines SDK’s kfp.compiler package. You can run this pipeline using the Kubeflow Pipeline SDK client, which calls the Pipelines service endpoint and passes in appropriate authentication headers right from the notebook. See the following code:

# DSL Compiler that compiles pipeline functions into workflow yaml.
kfp.compiler.Compiler().compile(pytorch_cnn_pipeline, "pytorch_cnn_pipeline.yaml")

# Connect to Kubeflow Pipelines using the Kubeflow Pipelines SDK client
client = kfp.Client()

experiment = client.create_experiment(name="kubeflow")

# Run a specified pipeline
my_run = client.run_pipeline(experiment.id, "pytorch_cnn_pipeline", "pytorch_cnn_pipeline.yaml")

# Please click “Run details” link generated below this cell to view your pipeline. You can click every pipeline step to see logs.

If you get a sagemaker import error, run !pip install sagemaker and restart the kernel (on the Kernel menu, choose Restart Kernel).

Choose the Run details link under the last cell to view the Kubeflow pipeline.

Repeat the pipeline creation step with training_runtime='kubernetes' to test the pipeline run on a Kubernetes environment. The training_runtime variable can also be passed in your CI/CD pipeline in a production scenario.

View the Kubeflow pipeline run logs for the SageMaker component

The following screenshot shows our pipeline details for the SageMaker component.

Choose the training job step and on the Logs tab, choose the CloudWatch logs link to access the SageMaker logs.

The following screenshot shows the CloudWatch logs for each of the two ml.p3.2xlarge instances.

Choose any of the groups to see the logs.

View the Kubeflow pipeline run logs for the Kubeflow PyTorchJob Launcher component

The following screenshot shows the pipeline details for our Kubeflow component.

Run the following commands using Kubectl on your Kubernetes client shell connected to the Kubernetes cluster to see the logs (substitute your namespace and pod names):

kubectl get pods -n kubeflow-user-example-com
kubectl logs <pod-name> -n kubeflow-user-example-com -f

4.1 Clean up

To clean up all the resources we created in the account, we need to remove them in reverse order.

Delete the Kubeflow installation by running ./kubeflow-remove.sh in the aws-do-kubeflow container. The first set of commands are optional and can be used in case you don’t already have a command shell into your aws-do-kubeflow container open.
```
cd aws-do-kubeflow
./status.sh
./start.sh
./exec.sh

./kubeflow-remove.sh
```
From the aws-do-eks container folder, remove the EFS volume. The first set of commands is optional and can be used in case you don’t already have a command shell into your aws-do-eks container open.
```
cd aws-do-eks
./status.sh
./start.sh
./exec.sh

cd /eks/deployment/csi/efs
./delete.sh
./efs-delete.sh
```
Deleting Amazon EFS is necessary in order to release the network interface associated with the VPC we created for our cluster. Note that deleting the EFS volume destroys any data that is stored on it.
From the aws-do-eks container, run the eks-delete.sh script to delete the cluster and any other resources associated with it, including the VPC:
```
cd /eks
./eks-delete.sh
```

Summary

In this post, we discussed some of the typical challenges of distributed model training and ML workflows. We provided an overview of the Kubeflow on AWS distribution and shared two open-source projects (aws-do-eks and aws-do-kubeflow) that simplify provisioning the infrastructure and the deployment of Kubeflow on it. Finally, we described and demonstrated a hybrid architecture that enables workloads to transition seamlessly between running on a self-managed Kubernetes and fully managed SageMaker infrastructure. We encourage you to use this hybrid architecture for your own use cases.

You can follow the AWS Labs repository to track all AWS contributions to Kubeflow. You can also find us on the Kubeflow #AWS Slack Channel; your feedback there will help us prioritize the next features to contribute to the Kubeflow project.

Special thanks to Sree Arasanagatta (Software Development Manager AWS ML) and Suraj Kota (Software Dev Engineer) for their support to the launch of this post.

About the authors

Kanwaljit Khurmi is an AI/ML Specialist Solutions Architect at Amazon Web Services. He works with the AWS product, engineering and customers to provide guidance and technical assistance helping them improve the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Gautam Kumar is a Software Engineer with AWS AI Deep Learning. He has developed AWS Deep Learning Containers and AWS Deep Learning AMI. He is passionate about building tools and systems for AI. In his spare time, he enjoy biking and reading books.

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect for Self-managed Machine Learning at AWS. In his role he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open source Do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on combating climate change, democratizing AI and ML, making travel safer, healthcare better, and energy smarter.

Bundesliga Match Fact Pressure Handling: Evaluating players’ performances in high-pressure situations on AWS

September 30, 2022

by Simon Rolfes Amazon AWS

Pressing or pressure in football is a process in which a team seeks to apply stress to the opponent player who possesses the ball. A team applies pressure to limit the time an opposition player has left to make a decision, reduce passing options, and ultimately attempt to turn over ball possession. Although nearly all teams seek to apply pressure to their opponents, their strategy to do so may vary.

Some teams adopt a so-called deep press, leaving the opposition with time and room to move the ball up the pitch. However, once the ball reaches the last third of the field, defenders aim to intercept the ball by pressuring the ball carrier. A slightly less conservative approach is the middle press. Here pressure is applied around the halfway line, where defenders attempt to lead the buildup in a certain direction, blocking open players and passing lanes to ultimately force the opposition back. Borussia Dortmund under Jürgen Klopp was one of the most efficient teams to use a middle press. The most aggressive type of pressing teams apply is the high press strategy. Here a team seeks to pressure the defenders and goalkeeper, focusing on direct pressure on the ball carrier, leaving them with ample time to select the right passing options as they have to cover the ball. In this strategy, the pressing team seeks to turn over possession by challenges or intercepting sloppy passes.

In February 2021, the Bundesliga released the first insight into how teams apply pressure with the Most Pressed Player Match Fact powered by AWS. Most Pressed Player quantifies defensive pressure encountered by players in real time, allowing fans to compare how some players receive pressure vs. others. Over the last 1.5 years, this Match Fact provided fans with new insights on how much teams were applying pressure, but also resulted in new questions, such as “Was this pressure successful?” or “How is this player handling pressure?”

Introducing Pressure Handling, a new Bundesliga Match Fact that aims to evaluate the performance of a frequently pressed player using different metrics. Pressure Handling is a further development of Most Pressed Player, and adds a quality component to the number of significant pressure situations a player in ball possession finds themself in. A central statistic in this new Match Fact is Escape Rate, which indicates how often a player resolves pressure situations successfully by keeping the possession for their team. In addition, fans get insight into the passing and shooting performance of players under pressure.

This post dives deeper into how the AWS team worked closely together with the Bundesliga to bring the Pressure Handling Match Fact to life.

How does it work?

This new Bundesliga Match Fact profiles the performance of players in pressure situations. For example, an attacking player in ball possession may get pressured by opposing defenders. There is a significant probability of him losing the ball. If that player manages to resolve the pressure situation without losing the ball, they increase their performance under pressure. Not losing the ball is defined as the team retaining ball possession after the individual ball possession of the player ends. This could, for instance, be either by a successful pass to a teammate, being fouled, or obtaining a throw-in or a corner kick. In contrast, a pressed player can lose the ball through a tackle or an unsuccessful pass. We only count those ball possessions in which the player received the ball from their teammate. That way, we exclude situations where they intercept the ball and are under pressure immediately (which usually happens).

We aggregate the pressure handling performance of a player into a single KPI called escape rate. The escape rate is defined as the fraction of ball possessions by a player where they were under pressure and didn’t lost the ball. In this case, “under pressure” is defined as a pressure value of >0.6 (see our previous post for more information on the pressure value itself). The escape rate allows us to evaluate players on a per-match or per-season basis. The following heuristic is used for computing the escape rate:

We start with a series of pressure events, based on the existing Most Pressed Player Match Fact. Each event consists of a list containing all individual pressure events on the ball carrier during one individual ball possession (IBP) phase.
For each phase, we compute the maximum aggregated pressure on the ball carrier.
As mentioned earlier, a pressure phase needs to satisfy two conditions in order to be considered:
1. The previous IBP was by a player of the same team.
2. The maximum pressure on the player during the current IBP was > 0.6.
If the subsequent IBP accounts to a player of the same team, we count this as an escape. Otherwise, it’s counted as a lost ball.
For each player, we compute the escape rate by counting the number of escapes and dividing it by the number of pressure events.

Examples of escapes

To illustrate the different ways of successfully resolving pressure, the following videos show four examples of Joshua Kimmich escaping pressure situations (Matchday 5, Season 22/23 – Union Berlin vs. Bayern Munich).

Joshua Kimmich moving out of pressure and passing to the wing.

Joshua Kimmich playing a quick pass forward to escape ensuing pressure.

Joshua Kimmich escaping pressure twice. The first escape is by a sliding tackle of the opponent, which nevertheless resulted in retained team ball possession. The second escape is by being fouled and thereby retaining team ball possession.

Joshua Kimmich escapes pressure by with a quick move and a pass.

Pressure Handling findings

Let’s look at some findings.

With the Pressure Handling Match Fact, players are ranked according to their escape rate on a match basis. In order to have a fair comparison between players, we only rank players that were under pressure at least 10 times.

The following table shows the number of times a player was in the top 2 of match rankings over the first seven matchdays of the 2022/23 season. We only show players with at least three appearances in the top 2.

Number of Times in Top 2	Player	Number of Times in Ranking
4	Joshua Kimmich	5
4	Exequiel Palacios	6
3	Jude Bellingham	7
3	Alphonso Davies	6
3	Lars Stindl	3
3	Jonas Hector	6
3	Vincenzo Grifo	4
3	Kevin Stöger	7

Joshua Kimmich and Exequiel Palacios lead the pack with four appearances in the top 2 of match rankings each. A special mention may go to Lars Stindl, who appeared in the top 2 three times despite playing only three times before an injury prevent further Bundesliga starts.

How it is implemented?

The Bundesliga Match Fact Pressure Handling consumes positions and event data, as well as data from other Bundesliga Match Facts, namely xPasses and Most Pressed Player. Match Facts are independently running AWS Fargate containers inside Amazon Elastic Container Service (Amazon ECS). To guarantee the latest data is being reflected in the Pressure Handling calculations, we use Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Amazon MSK allows different Bundesliga Match Facts to send and receive the newest events and updates in real time. By consuming Kafka, we receive most up-to-date events from all systems. The following diagram illustrates the end-to-end workflow for Pressure Handling.

Pressure Handling starts its calculation after an event is received from the Most Pressed Player Match Fact. The Pressure Handling container writes the current statistics to a topic in Amazon MSK. A central AWS Lambda function consumes these messages from Amazon MSK, and writes the escape rates to an Amazon Aurora database. This data is then used for interactive near-real-time visualizations using Amazon QuickSight. Besides that, the results are also sent to a feed, which then triggers another Lambda function that sends the data to external systems where broadcasters worldwide can consume it.

Summary

In this post, we demonstrated how the new Bundesliga Match Fact Pressure Handling makes it possible to quantify and objectively compare the performance of different Bundesliga players in high-pressure situations. To do so, we build on and combine previously published Bundesliga Match Facts in real time. This allows commentators and fans to understand which players shine when pressured by their opponents.

The new Bundesliga Match Fact is the result of an in-depth analysis by the Bundesliga’s football experts and AWS data scientists. Extraordinary escape rates are shown in the live ticker of the respective matches in the official Bundesliga app. During a broadcast, escape rates are provided to commentators through the data story finder and visually shown to fans at key moments, such as when a player with a high pressure count and escape rate scores a goal, passes exceptionally well, or overcomes many challenges while staying in control of the ball.

We hope that you enjoy this brand-new Bundesliga Match Fact and that it provides you with new insights into the game. To learn more about the partnership between AWS and Bundesliga, visit Bundesliga on AWS!

We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.

About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals, and won 26 caps for Germany. Currently, Rolfes serves as Managing Director Sport at Bayer 04 Leverkusen, where he oversees and develops the pro player roster, the scouting department, and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS. There he offers his expertise as a former player, captain, and TV analyst to highlight the impact of advanced statistics and machine learning into the world of football.

Luuk Figdor is a Sports Technology Advisor in the AWS Professional Services team. He works with players, clubs, leagues, and media companies such as the Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time, he likes to learn all about the mind and the intersection between psychology, economics, and AI.

Javier Poveda-Panter is a Data Scientist for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high-quality user and fan experiences through machine learning and data science. He follows his passion for a broad range of sports, music, and AI in his spare time.

Tareq Haschemi is a consultant within AWS Professional Services. His skills and areas of expertise include application development, data science, machine learning, and big data. He supports customers in developing data-driven applications within the cloud. Prior to joining AWS, he was also a consultant in various industries such as aviation and telecommunications. He is passionate about enabling customers on their data/AI journey to the cloud.

Fotinos Kyriakides is a Consultant with AWS Professional Services. Through his work as a Data Engineer and Application Developer, he supports customers in developing applications in the cloud that leverage and innovate on insights generated from data. In his spare time, he likes to run and explore nature.

Uwe Dick is a Data Scientist at Sportec Solutions AG. He works to enable Bundesliga clubs and media to optimize their performance using advanced stats and data—before, after, and during matches. In his spare time, he settles for less and just tries to last the full 90 minutes for his recreational football team.

Bundesliga Match Fact Win Probability: Quantifying the effect of in-game events on winning chances using machine learning on AWS

September 30, 2022

by Simon Rolfes Amazon AWS

Ten years from now, the technological fitness of clubs will be a key contributor towards their success. Today we’re already witnessing the potential of technology to revolutionize the understanding of football. xGoals quantifies and allows comparison of goal scoring potential of any shooting situation, while xThreat and EPV models predict the value of any in-game moment. Ultimately, these and other advanced statistics serve one purpose: to improve the understanding of who will win and why. Enter the new Bundesliga Match Fact: Win Probability.

In Bayern’s second match against Bochum last season, the tables turned unexpectedly. Early in the match, Lewandowski scores 1:0 after just 9 minutes. The “Grey Mouse” of the league is instantly reminded of their 7:0 disaster when facing Bayern for the first time that season. But not this time: Christopher Antwi-Adjei scores his first goal for the club just 5 minutes later. After conceiving a penalty goal in the 38th minute, the team from Monaco di Bavaria seems paralyzed and things began to erupt: Gamboa nutmegs Coman and finishes with an absolute corker of a goal, and Holtmann makes it 4:1 close to halftime with a dipper from the left. Bayern hadn’t conceived this many goals in the first half since 1975, and was barely able to walk away with a 4:2 result. Who could have guessed that? Both teams played without their first keepers, which for Bayern meant missing out on their captain Manuel Neuer. Could his presence have saved them from this unexpected result?

Similarly, Cologne pulled off two extraordinary zingers in the 2020/2021 season. When they faced Dortmund, they had gone 18 matches without a win, while BVB’s Haaland was providing a master class in scoring goals that season (23 in 22 matches). The role of the favorite was clear, yet Cologne took an early lead with just 9 minutes on the clock. In the beginning of the second half, Skhiri scored a carbon-copy goal of his first one: 0:2. Dortmund subbed in attacking strength, created big chances, and scored 1:2. Of all players, Haaland missed a sitter 5 minutes into extra time and crowned Cologne with the first 3 points in Dortmund after almost 30 years.

Later in that season, Cologne—being last in the home-table—surprised RB Leipzig, who had all the motivation to close in on the championship leader Bayern. The opponent Leipzig pressured the “Billy Goats” with a team season record of 13 shots at goal in the first half, increasing their already high chances of a win. Ironically, Cologne scored the 1:0 with the first shot at goal in minute 46. After the “Red Bulls” scored a well-deserved equalizer, they slept on a throw-in just 80 seconds later, leading to Jonas Hector scoring for Cologne again. Just like Dortmund, Leipzig now put all energy into offense, but the best they managed to achieve was hitting the post in overtime.

For all of these matches, experts and novices alike would have wrongly guessed the winner, even well into the match. But what are the events that led to these surprising in-game swings of win probability? At what minute did the underdog’s chance of winning overtake the favorite’s as they ran out of time? Bundesliga and AWS have worked together to compute and illustrate the live development of winning chances throughout matches, enabling fans to see key moments of probability swings. The result is the new machine learning (ML)-powered Bundesliga Match Fact: Win Probability.

How does it work?

The new Bundesliga Match Fact Win Probability was developed by building ML models that analyzed over 1,000 historical games. The live model takes the pre-match estimates and adjusts them according to the match proceedings based on features that affect the outcome, including the following:

Goals
Penalties
Red cards
Substitutions
Time passed
Goal scoring chances created
Set-piece situations

The live model is trained using a neural network architecture and uses a Poisson distribution aplproach to predict a goals-per-minute-rate r for each team, as described in the following equation:

Those rates can be viewed as an estimation of a team’s strength and are computed using a series of dense layers based on the inputs. Based on these rates and the difference between the opponents, the probabilities of a win and a draw are computed in real time.

The input to the model is a 3-tuple of input features, current goal difference, and remaining playtime in minutes.

The first component of the three input dimensions consists of a feature set that describes the current game action in real time for both teams in performance metrics. These include various aggregated team-based xG values, with particular attention to the shots taken in the last 15 minutes before the prediction. We also process red cards, penalties, corner kicks, and the number of dangerous free kicks. A dangerous free kick is classified as a free kick closer than 25m to the opponent’s goal. During the development of the model, besides the influence of the former Bundesliga Match Fact xGoals, we also evaluated the impact of Bundesliga Match Fact Skill in the model. This means that the model reacts to substitution of top players—players with badges in the skills Finisher, Initiator, or Ball winner.

Win Probability example

Let’s look at a match from the current season (2022/2023). The following graph shows the win probability for the Bayern Munich and Stuttgart match from matchday 6.

The pre-match model calculated a win probability of 67% for Bayern, 14% for Stuttgart, and 19% for a draw. When we look at the course of the match, we see a large impact of goals scored in minute 36′, 57′, and 60′. Until the first minute of overtime, the score was 2:1 for Bayern. Only a successful penalty shot by S. Grassy in minute 90+2 secured a draw. The Win Probability Live Model therefore corrected the draw forecast from 5% to over 90%. The result is an unexpected late swing, with Bayern’s win probability decreasing from 90% to 8% in the 90+2 minute. The graph is representative of the swing in atmosphere in the Allianz Arena that day.

How it is implemented?

Win Probability consumes event data from an ongoing match (goal events, fouls, red cards, and more) as well as data produced by other Match Facts, such as xGoals. For real-time updates of probabilities, we use Amazon Managed Streaming Kafka (Amazon MSK) as a central data streaming and messaging solution. This way, event data, positions data, and outputs of different Bundesliga Match Facts can be communicated between containers real time.

The following diagram illustrates the end-to-end workflow for Win Probability.

Gathered match-related data gets ingested through an external provider (DataHub). Metadata of the match is ingested and processed in an AWS Lambda function. Positions and events data are ingested through an AWS Fargate container (MatchLink). All ingested data is then published for consumption in respective MSK topics. The heart of the Win Probability Match Fact sits in a dedicated Fargate container (BMF WinProbability), which runs for the duration of the respective match and consumes all required data obtained though Amazon MSK. The ML models (live and pre-match) are deployed on Amazon SageMaker Serverless Inference endpoints. Serverless endpoints automatically launch compute resources and scale those compute resources depending on incoming traffic, eliminating the need to choose instance types or manage scaling policies. With this pay-per-use model, Serverless Inference is ideal for workloads that have idle periods between traffic spurts. When there are no Bundesliga matches, there is no cost for idle resources.

Shortly before kick-off, we generate our initial set of features and calculate the pre-match win probabilities by calling the PreMatch SageMaker endpoint. With those PreMatch probabilities, we then initialize the live model, which reacts in real time to relevant in-game events and is continuously queried to receive current win probabilities.

The calculated probabilities are then sent back to DataHub to be provided to other MatchFacts consumers. Probabilities are also sent to the MSK cluster to a dedicated topic, to be consumed by other Bundesliga Match Facts. A Lambda function consumes all probabilities from the respective Kafka topic, and writes them to an Amazon Aurora database. This data is then used for interactive near-real-time visualizations using Amazon QuickSight.

Summary

In this post, we demonstrated how the new Bundesliga Match Fact Win Probability shows the impact of in-game events on the chances of a team winning or losing a match. To do so, we build on and combine previously published Bundesliga Match Facts in real time. This allows commentators and fans to uncover moments of probability swings and more during live matches.

The new Bundesliga Match Fact is the result of an in-depth analysis by the Bundesliga’s football experts and AWS data scientists. Win probabilities are shown in the live ticker of the respective matches in the official Bundesliga app. During a broadcast, win probabilities are provided to commentators through the data story finder and visually shown to fans at key moments, such as when the underdog takes the lead and is now most likely to win the game.

We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.

About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals, and won 26 caps for Germany. Currently, Rolfes serves as Managing Director Sport at Bayer 04 Leverkusen, where he oversees and develops the pro player roster, the scouting department and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS. There he offers his expertise as a former player, captain, and TV analyst to highlight the impact of advanced statistics and machine learning into the world of football.

Gabriel Zylka is a Machine Learning Engineer within AWS Professional Services. He works closely with customers to accelerate their cloud adoption journey. Specialized in the MLOps domain, he focuses on productionizing machine learning workloads by automating end-to-end machine learning lifecycles and helping achieve desired business outcomes.

Jakub Michalczyk is a Data Scientist at Sportec Solutions AG. Several years ago, he chose math studies over playing football, as he came to the conclusion that he wasn’t good enough at the latter. Now he combines both these passions in his professional career by applying machine learning methods to gain a better insight into this beautiful game. In his spare time, he still enjoys playing seven-a-side football, watching crime movies, and listening to film music.

Unified data preparation, model training, and deployment with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot – Part 2

September 30, 2022

by Geremy Cohen Amazon AWS

Depending on the quality and complexity of data, data scientists spend between 45–80% of their time on data preparation tasks. This implies that data preparation and cleansing take valuable time away from real data science work. After a machine learning (ML) model is trained with prepared data and readied for deployment, data scientists must often rewrite the data transformations used for preparing data for ML inference. This may stretch the time it takes to deploy a useful model that can inference and score the data from its raw shape and form.

In Part 1 of this series, we demonstrated how Data Wrangler enables a unified data preparation and model training experience with Amazon SageMaker Autopilot in just a few clicks. In this second and final part of this series, we focus on a feature that includes and reuses Amazon SageMaker Data Wrangler transforms, such as missing value imputers, ordinal or one-hot encoders, and more, along with the Autopilot models for ML inference. This feature enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference, further reducing the time required to deploy a trained model to production.

Solution overview

Data Wrangler reduces the time to aggregate and prepare data for ML from weeks to minutes, and Autopilot automatically builds, trains, and tunes the best ML models based on your data. With Autopilot, you still maintain full control and visibility of your data and model. Both services are purpose-built to make ML practitioners more productive and accelerate time to value.

The following diagram illustrates our solution architecture.

Prerequisites

Because this post is the second in a two-part series, make sure you’ve successfully read and implemented Part 1 before continuing.

Export and train the model

In Part 1, after data preparation for ML, we discussed how you can use the integrated experience in Data Wrangler to analyze datasets and easily build high-quality ML models in Autopilot.

This time, we use the Autopilot integration once again to train a model against the same training dataset, but instead of performing bulk inference, we perform real-time inference against an Amazon SageMaker inference endpoint that is created automatically for us.

In addition to the convenience provided by automatic endpoint deployment, we demonstrate how you can also deploy with all the Data Wrangler feature transforms as a SageMaker serial inference pipeline. This enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference.

Note that this feature is currently only supported for Data Wrangler flows that don’t use join, group by, concatenate, and time series transformations.

We can use the new Data Wrangler integration with Autopilot to directly train a model from the Data Wrangler data flow UI.

Choose the plus sign next to the Scale values node, and choose Train model.
For Amazon S3 location, specify the Amazon Simple Storage Service (Amazon S3) location where SageMaker exports your data.
If presented with a root bucket path by default, Data Wrangler creates a unique export sub-directory under it—you don’t need to modify this default root path unless you want to.Autopilot uses this location to automatically train a model, saving you time from having to define the output location of the Data Wrangler flow and then define the input location of the Autopilot training data. This makes for a more seamless experience.
Choose Export and train to export the transformed data to Amazon S3.

When export is successful, you’re redirected to the Create an Autopilot experiment page, with the Input data S3 location already filled in for you (it was populated from the results of the previous page).
For Experiment name, enter a name (or keep the default name).
For Target, choose Outcome as the column you want to predict.
Choose Next: Training method.

As detailed in the post Amazon SageMaker Autopilot is up to eight times faster with new ensemble training mode powered by AutoGluon, you can either let Autopilot select the training mode automatically based on the dataset size, or select the training mode manually for either ensembling or hyperparameter optimization (HPO).

The details of each option are as follows:

Auto – Autopilot automatically chooses either ensembling or HPO mode based on your dataset size. If your dataset is larger than 100 MB, Autopilot chooses HPO; otherwise it chooses ensembling.
Ensembling – Autopilot uses the AutoGluon ensembling technique to train several base models and combines their predictions using model stacking into an optimal predictive model.
Hyperparameter optimization – Autopilot finds the best version of a model by tuning hyperparameters using the Bayesian optimization technique and running training jobs on your dataset. HPO selects the algorithms most relevant to your dataset and picks the best range of hyperparameters to tune the models.For our example, we leave the default selection of Auto.

Choose Next: Deployment and advanced settings to continue.
On the Deployment and advanced settings page, select a deployment option.
It’s important to understand the deployment options in more detail; what we choose will impact whether or not the transforms we made earlier in Data Wrangler will be included in the inference pipeline:
- Auto deploy best model with transforms from Data Wrangler – With this deployment option, when you prepare data in Data Wrangler and train a model by invoking Autopilot, the trained model is deployed alongside all the Data Wrangler feature transforms as a SageMaker serial inference pipeline. This enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference. Note that the inference endpoint expects the format of your data to be in the same format as when it’s imported into the Data Wrangler flow.
- Auto deploy best model without transforms from Data Wrangler – This option deploys a real-time endpoint that doesn’t use Data Wrangler transforms. In this case, you need to apply the transforms defined in your Data Wrangler flow to your data prior to inference.
- Do not auto deploy best model – You should use this option when you don’t want to create an inference endpoint at all. It’s useful if you want to generate a best model for later use, such as locally run bulk inference. (This is the deployment option we selected in Part 1 of the series.) Note that when you select this option, the model created (from Autopilot’s best candidate via the SageMaker SDK) includes the Data Wrangler feature transforms as a SageMaker serial inference pipeline.
For this post, we use the Auto deploy best model with transforms from Data Wrangler option.
For Deployment option, select Auto deploy best model with transforms from Data Wrangler.
Leave the other settings as default.
Choose Next: Review and create to continue.
On the Review and create page, we see a summary of the settings chosen for our Autopilot experiment.
Choose Create experiment to begin the model creation process.

You’re redirected to the Autopilot job description page. The models show on the Models tab as they are generated. To confirm that the process is complete, go to the Job Profile tab and look for a Completed value for the Status field.

You can get back to this Autopilot job description page at any time from Amazon SageMaker Studio:

Choose Experiments and Trials on the SageMaker resources drop-down menu.
Select the name of the Autopilot job you created.
Choose (right-click) the experiment and choose Describe AutoML Job.

View the training and deployment

When Autopilot completes the experiment, we can view the training results and explore the best model from the Autopilot job description page.

Choose (right-click) the model labeled Best model, and choose Open in model details.

The Performance tab displays several model measurement tests, including a confusion matrix, the area under the precision/recall curve (AUCPR), and the area under the receiver operating characteristic curve (ROC). These illustrate the overall validation performance of the model, but they don’t tell us if the model will generalize well. We still need to run evaluations on unseen test data to see how accurately the model makes predictions (for this example, we predict if an individual will have diabetes).

Perform inference against the real-time endpoint

Create a new SageMaker notebook to perform real-time inference to assess the model performance. Enter the following code into a notebook to run real-time inference for validation:

import boto3

### Define required boto3 clients

sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client(service_name="sagemaker-runtime")

### Define endpoint name

endpoint_name = "<YOUR_ENDPOINT_NAME_HERE>"

### Define input data

payload_str = '5,166.0,72.0,19.0,175.0,25.8,0.587,51'
payload = payload_str.encode()
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="text/csv",
    Body=payload,
)

response["Body"].read()

After you set up the code to run in your notebook, you need to configure two variables:

endpoint_name
payload_str

Configure endpoint_name

endpoint_name represents the name of the real-time inference endpoint the deployment auto-created for us. Before we set it, we need to find its name.

Choose Endpoints on the SageMaker resources drop-down menu.
Locate the name of the endpoint that has the name of the Autopilot job you created with a random string appended to it.
Choose (right-click) the experiment, and choose Describe Endpoint.

The Endpoint Details page appears.
Highlight the full endpoint name, and press Ctrl+C to copy it the clipboard.
Enter this value (make sure its quoted) for endpoint_name in the inference notebook.

Configure payload_str

The notebook comes with a default payload string payload_str that you can use to test your endpoint, but feel free to experiment with different values, such as those from your test dataset.

To pull values from the test dataset, follow the instructions in Part 1 to export the test dataset to Amazon S3. Then on the Amazon S3 console, you can download it and select the rows to use the file from Amazon S3.

Each row in your test dataset has nine columns, with the last column being the outcome value. For this notebook code, make sure you only use a single data row (never a CSV header) for payload_str. Also make sure you only send a payload_str with eight columns, where you have removed the outcome value.

For example, if your test dataset files look like the following code, and we want to perform real-time inference of the first row:

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome 
10,115,0,0,0,35.3,0.134,29,0 
10,168,74,0,0,38.0,0.537,34,1 
1,103,30,38,83,43.3,0.183,33,0

We set payload_str to 10,115,0,0,0,35.3,0.134,29. Note how we omitted the outcome value of 0 at the end.

If by chance the target value of your dataset is not the first or last value, just remove the value with the comma structure intact. For example, assume we’re predicting bar, and our dataset looks like the following code:

foo,bar,foobar
85,17,20

In this case, we set payload_str to 85,,20.

When the notebook is run with the properly configured payload_str and endpoint_name values, you get a CSV response back in the format of outcome (0 or 1), confidence (0-1).

Cleaning Up

To make sure you don’t incur tutorial-related charges after completing this tutorial, be sure to shutdown the Data Wrangler app (https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-shut-down.html), as well as all notebook instances used to perform inference tasks. The inference endpoints created via the Auto Pilot deploy should be deleted to prevent additional charges as well.

Conclusion

In this post, we demonstrated how to integrate your data processing, featuring engineering, and model building using Data Wrangler and Autopilot. Building on Part 1 in the series, we highlighted how you can easily train, tune, and deploy a model to a real-time inference endpoint with Autopilot directly from the Data Wrangler user interface. In addition to the convenience provided by automatic endpoint deployment, we demonstrated how you can also deploy with all the Data Wrangler feature transforms as a SageMaker serial inference pipeline, providing for automatic preprocessing of the raw data, with the reuse of Data Wrangler feature transforms at the time of inference.

Low-code and AutoML solutions like Data Wrangler and Autopilot remove the need to have deep coding knowledge to build robust ML models. Get started using Data Wrangler today to experience how easy it is to build ML models using Autopilot.

About the authors

Geremy Cohen is a Solutions Architect with AWS where he helps customers build cutting-edge, cloud-based solutions. In his spare time, he enjoys short walks on the beach, exploring the bay area with his family, fixing things around the house, breaking things around the house, and BBQing.

Pradeep Reddy is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Autopilot, SageMaker Automatic Model Tuner. Outside of work, Pradeep enjoys reading, running and geeking out with palm sized computers like raspberry pi, and other home automation tech.

Dr. John He is a senior software development engineer with Amazon AI, where he focuses on machine learning and distributed computing. He holds a PhD degree from CMU.

Amazon sponsors contest on energy management in buildings

September 30, 2022

by Amazon AWS

NeurIPS competition involves reinforcement learning, with the objective of minimizing both cost and CO2 emissions.Read More

Data-driven fault analysis is key to sustainable facilities management

September 30, 2022

by Amazon AWS

How data-driven methods can help to identify fault detection and drive energy efficiencies for facilities of all sizes.Read More

Benchmarking approach

Benchmarking results

Throughput vs. latency

Throughput and latency vs. batch size

Cost vs. batch size

Additional inference benchmarks

Conclusion

About the authors

About the author

Overview of update

Prerequisites

Upgrade your AWS Panorama device

Redeploy your application

Conclusion

About the Authors

How does Kubeflow on AWS and SageMaker help?

Solution overview

Use Case Workflow

Prerequisites

1. Install Amazon EKS and Kubeflow on AWS

Configure the current working directory and AWS CLI

1.1 Create an EKS cluster

Create an EFS volume for the SageMaker training job

Create an Amazon S3 VPC endpoint

1.2 Deploy Kubeflow on AWS on Amazon EKS

2. Set up the Kubeflow on AWS environment

2.1 Create an EFS volume

2.2 Create a Jupyter notebook

3. Run distributed training

3.1 PyTorch Distributed Data Parallel(DDP) training script

3.2 Install libraries

3.3 Run distributed PyTorch job training on Kubernetes

View the Kubernetes training logs

3.4 Create a hybrid Kubeflow pipeline

View the Kubeflow pipeline run logs for the SageMaker component

View the Kubeflow pipeline run logs for the Kubeflow PyTorchJob Launcher component

4.1 Clean up

Summary

About the authors

How does it work?

Examples of escapes

Pressure Handling findings

How it is implemented?

Summary

About the Authors

How does it work?

Win Probability example

How it is implemented?

Summary

About the Authors

Solution overview

Prerequisites

Export and train the model

View the training and deployment

Perform inference against the real-time endpoint

Configure endpoint_name

Configure payload_str

Cleaning Up

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.