September 2022 – Vedere AI

Celebrate over 20 years of AI/ML at Innovation Day

Be our guest as we celebrate 20 years of AI/ML innovation on October 25, 2022, 9:00 AM – 10:30 AM PT. The first 1,500 people to register will receive $50 of AWS credits. Register here.

Over the past 20 years, Amazon has delivered many world firsts for artificial intelligence (AI) and machine learning (ML). ML is an integral part of Amazon and is used for everything from applying personalization models at checkout, to forecasting the demand for products globally, to creating autonomous flight for Amazon Prime Air drones, to natural language processing (NLP) on Alexa. And the use of ML isn’t slowing down anytime soon, because ML helps Amazon exceed customer expectations for convenience, cost, and delivery speed.

During the virtual AI/ML innovation event on October 25, we take time to reflect on what’s been done at Amazon and how we have packaged this innovation into a wide breadth and depth of AI/ML services. The AWS ML stack helps you rapidly innovate and enhance customer experiences, enable faster and better decision-making, and optimize business processes using the same technology that Amazon uses every day. With the most experience; the most reliable, scalable, and secure cloud; and the most comprehensive set of services and solutions, AWS is the best place to unlock value from your data and turn it into insight.

We will also take a moment to celebrate customer success using AWS to harness the power of data with ML and, in many cases, change the way we live for the better. Mueller Water Products, Siemens Energy, Inspire, and ResMed will show what’s possible using ML for sustainability and accessibility challenges such as water conservation, predictive maintenance for industrial plants, personalized medical care resources for patients and caregivers, and cloud-connected customized recommendations for patients and their healthcare providers.

The 90-minute session doesn’t stop there! We have special guest speaker Professor Michael Jordan, who will talk about the decision-making side of ML spanning computational, inferential, and economic perspectives. Much of the recent focus in ML has been on the pattern recognition side of the field. In Professor Jordan’s talk, he will focus on the decision-making side, where many fundamental challenges remain. Some are statistical in nature, including the challenges associated with multiple decision-making. Others are economic, involving learning systems that must cope with scarcity, competition, and incentives, and some are algorithmic, including the challenge of coordinated decision-making on distributed platforms and the need for algorithms to converge to equilibria rather than optima. He will ponder how next-generation ML platforms can provide environments that support this kind of large scale, dynamic, data-aware, and market-aware decision-making.

Finally, we wrap up the celebration with Dr. Bratin Saha, VP of AI/ML, who will explain how AWS AI/ML has grown to over 100,000 customers so quickly, including how Amazon SageMaker became one of the fastest growing services in the history of AWS. Hint—SageMaker incorporates many world firsts, including fully managed infrastructure, tools such IDEs and feature stores, workflows, AutoML, and no-code capabilities.

AWS has helped foster ML growth through capabilities that help you deploy it at scale by operationalizing processes. We have seen this play out in many different industries. For example, in the automotive industry, the assembly line has standardized automotive design and manufacturing, and launched a revolution in transportation by helping us transition from hand-assembled cars to mass production.

Similarly, the software industry went from a few specialized business applications to becoming ubiquitous in every aspect of our lives. That happened through automation, tooling, and implementing and standardizing processes—in effect through the industrialization of software. In the same way, ML services from AWS are driving this transformation. In fact, customers today are running millions of models, billons of parameters, and hundreds of billions of predictions on AWS.

Dr. Saha will also look back at the history of flagship AI services, including services for text and documents, speech, vision, healthcare, industrial, search, business processes, and DevOps. He will explain how to use the AI Use Case Explorer, where you can explore use cases, discover customer success stories, and mobilize your team around the power of AI and ML. Dr. Saha will end on his vision for AWS AI/ML services.

We can’t wait to celebrate with you, so register now! If you’re among the first 1,500 people to register, you will receive $50 of AWS credits.

Happy innovating!

About the author

Kimberly Madia is a Principal Product Marketing Manager with AWS Machine Learning. Her goal is to make it easy for customers to build, train, and deploy machine learning models using Amazon SageMaker. For fun outside work, Kimberly likes to cook, read, and run on the San Francisco Bay Trail.

AWS Panorama now supports NVIDIA JetPack SDK 4.6.2

AWS Panorama is a collection of machine learning (ML) devices and a software development kit (SDK) that brings computer vision to on-premises internet protocol (IP) cameras. AWS Panorama device options include the AWS Panorama Appliance and the Lenovo ThinkEdge SE70, powered by AWS Panorama. These device options provide you choices in price and performance, depending on your unique use case. Both AWS Panorama devices are built on NVIDIA® Jetson System on Modules (SOMs) and use the NVIDIA JetPack SDK.

AWS has released a new software update for AWS Panorama that supports NVIDIA Jetpack SDK version 4.6.2. You can download this software update and apply it to the AWS Panorama device via an over-the-air (OTA) upgrade process. For more details, see Managing an AWS Panorama Appliance.

This release is not backward compatible with previous software releases for AWS Panorama; you must rebuild and redeploy your applications. This post provides a step-by step guide to updating your application software libraries to the latest supported versions.

Overview of update

The NVIDIA Jetpack SDK version 4.6.2 includes support for newer versions of CUDA 10.2, cuDNN 8.2.1, and TensorRT 8.2.1. Other notable libraries now supported include DeepStream 6.0 and Triton Inference Server 21.07. In addition, TensorRT 8.2.1 includes an expanded list of DNN operator support, Sigmoid/Tanh INT8 support for DLA, and better integration with TensorFlow and PyTorch. Torch to TensorRT conversion is now supported, as well as TensorFlow to TensorRT, without the need to convert to ONNX as an intermediate step. For additional details, refer to NVIDIA Announces TensorRT 8.2 and Integrations with PyTorch and TensorFlow.

You can redeploy your applications by following the steps in the following sections.

Prerequisites

As a prerequisite, you need an AWS account and an AWS Panorama device.

Upgrade your AWS Panorama device

First, you upgrade your AWS Panorama device to the latest version.

On the AWS Panorama console, choose Devices in the navigation pane.
Choose an Appliance.
Choose Settings.
Under System software, choose View Software update.
Choose System Software version 5.0 or above and then proceed to install this software.

Redeploy your application

If you do not use the Open GPU access feature as part of your application, you use the Replace feature on the AWS Panorama console. The Replace function rebuilds your model for the latest software.

On the AWS Panorama console, choose Deployed applications in the navigation pane.
Select an application.
Choose Replace.

For applications using the Open GPU access feature, upgrading typically involves allowing your container access to the underlying GPU hardware and deploying and managing your own models and runtime. We recommend using NVIDIA TensorRT in your application, but you’re not limited to this.

You also need to update the libraries of your Dockerfile. Typical libraries to update include CUDA 10.2, cuDNN 8.2.1, TensorRT 8.2.1, DeepStream 6.0, OpenCV 4.1.1, and VPI 1.1. As a note, all related CUDA/NVIDIA changes in the software stack can be found at JetPack SDK 4.6.2.

Now you rebuild the models for TensorRT 8.2.1 and update your package.json file with the updated assets. You can now build your container with the updated dependencies and models and deploy the application container to your Appliance using the AWS Panorama console or APIs.

At this point your, AWS Panorama applications should be able to use the Jetpack SDK version 4.6.2. AWS Panorama’s sample applications that are compatible with this version are located on GitHub.

Conclusion

With the new update to AWS Panorama, you must rebuild and redeploy your applications. This post walked you through the steps to update your AWS Panorama application software libraries to the latest supported versions.

Please reach out to AWS Panorama with any questions at AWS re:Post.

About the Authors

Vinod Raman is a Principal Product Manager at AWS.

Steven White is a Senior Computer Vision/EdgeML Solutions Architect at AWS.

Build flexible and scalable distributed training architectures using Kubeflow on AWS and Amazon SageMaker

In this post, we demonstrate how Kubeflow on AWS (an AWS-specific distribution of Kubeflow) used with AWS Deep Learning Containers and Amazon Elastic File System (Amazon EFS) simplifies collaboration and provides flexibility in training deep learning models at scale on both Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon SageMaker utilizing a hybrid architecture approach.

Machine learning (ML) development relies on complex and continuously evolving open-source frameworks and toolkits, as well as complex and continuously evolving hardware ecosystems. This poses a challenge when scaling out ML development to a cluster. Containers offer a solution, because they can fully encapsulate not just the training code, but the entire dependency stack down to the hardware libraries. This ensures an ML environment that is consistent and portable, and facilitates reproducibility of the training environment on each individual node of the training cluster.

Kubernetes is a widely adopted system for automating infrastructure deployment, resource scaling, and management of these containerized applications. However, Kubernetes wasn’t built with ML in mind, so it can feel counterintuitive to data scientists due to its heavy reliance on YAML specification files. There isn’t a Jupyter experience, and there aren’t many ML-specific capabilities, such as workflow management and pipelines, and other capabilities that ML experts expect, such as hyperparameter tuning, model hosting, and others. Such capabilities can be built, but Kubernetes wasn’t designed to do this as its primary objective.

The open-source community took notice and developed a layer on top of Kubernetes called Kubeflow. Kubeflow aims to make the deployment of end-to-end ML workflows on Kubernetes simple, portable, and scalable. You can use Kubeflow to deploy best-of-breed open-source systems for ML to diverse infrastructures.

Kubeflow and Kubernetes provides flexibility and control to data scientist teams. However, ensuring high utilization of training clusters running at scale with reduced operational overheads is still challenging.

This post demonstrates how customers who have on-premises restrictions or existing Kubernetes investments can address this challenge by using Amazon EKS and Kubeflow on AWS to implement an ML pipeline for distributed training based on a self-managed approach, and use fully managed SageMaker for a cost-optimized, fully managed, and production-scale training infrastructure. This includes step-by-step implementation of a hybrid distributed training architecture that allows you to choose between the two approaches at runtime, conferring maximum control and flexibility with stringent needs for your deployments. You will see how you can continue using open-source libraries in your deep learning training script and still make it compatible to run on both Kubernetes and SageMaker in a platform agnostic way.

How does Kubeflow on AWS and SageMaker help?

Neural network models built with deep learning frameworks like TensorFlow, PyTorch, MXNet, and others provide much higher accuracy by using significantly larger training datasets, especially in computer vision and natural language processing use cases. However, with large training datasets, it takes longer to train the deep learning models, which ultimately slows down the time to market. If we could scale out a cluster and bring down the model training time from weeks to days or hours, it could have a huge impact on productivity and business velocity.

Amazon EKS helps provision the managed Kubernetes control plane. You can use Amazon EKS to create large-scale training clusters with CPU and GPU instances and use the Kubeflow toolkit to provide ML-friendly, open-source tools and operationalize ML workflows that are portable and scalable using Kubeflow Pipelines to improve your team’s productivity and reduce the time to market.

However, there could be a couple of challenges with this approach:

Ensuring maximum utilization of a cluster across data science teams. For example, you should provision GPU instances on demand and ensure its high utilization for demanding production-scale tasks such as deep learning training, and use CPU instances for the less demanding tasks such data preprocessing
Ensuring high availability of heavyweight Kubeflow infrastructure components, including database, storage, and authentication, that are deployed in the Kubernetes cluster worker node. For example, the Kubeflow control plane generates artifacts (such as MySQL instances, pod logs, or MinIO storage) that grow over time and need resizable storage volumes with continuous monitoring capabilities.
Sharing the training dataset, code, and compute environments between developers, training clusters, and projects is challenging. For example, if you’re working on your own set of libraries and those libraries have strong interdependencies, it gets really hard to share and run the same piece of code between data scientists in the same team. Also, each training run requires you to download the training dataset and build the training image with new code changes.

Kubeflow on AWS helps address these challenges and provides an enterprise-grade semi-managed Kubeflow product. With Kubeflow on AWS, you can replace some Kubeflow control plane services like database, storage, monitoring, and user management with AWS managed services like Amazon Relational Database Service (Amazon RDS), Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), Amazon FSx, Amazon CloudWatch, and Amazon Cognito.

Replacing these Kubeflow components decouples critical parts of the Kubeflow control plane from Kubernetes, providing a secure, scalable, resilient, and cost-optimized design. This approach also frees up storage and compute resources from the EKS data plane, which may be needed by applications such as distributed model training or user notebook servers. Kubeflow on AWS also provides native integration of Jupyter notebooks with Deep Learning Container (DLC) images, which are pre-packaged and preconfigured with AWS optimized deep learning frameworks such as PyTorch and TensorFlow that allow you to start writing your training code right away without dealing with dependency resolutions and framework optimizations. Also, Amazon EFS integration with training clusters and the development environment allows you to share your code and processed training dataset, which avoids building the container image and loading huge datasets after every code change. These integrations with Kubeflow on AWS help you speed up the model building and training time and allow for better collaboration with easier data and code sharing.

Kubeflow on AWS helps build a highly available and robust ML platform. This platform provides flexibility to build and train deep learning models and provides access to many open-source toolkits, insights into logs, and interactive debugging for experimentation. However, achieving maximum utilization of infrastructure resources while training deep learning models on hundreds of GPUs still involves a lot of operational overheads. This could be addressed by using SageMaker, which is a fully managed service designed and optimized for handling performant and cost-optimized training clusters that are only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, thereby providing close to 100% resource utilization. You can integrate SageMaker with Kubeflow Pipelines using managed SageMaker components. This allows you to operationalize ML workflows as part of Kubeflow pipelines, where you can use Kubernetes for local training and SageMaker for product-scale training in a hybrid architecture.

Solution overview

The following architecture describes how we use Kubeflow Pipelines to build and deploy portable and scalable end-to-end ML workflows to conditionally run distributed training on Kubernetes using Kubeflow training or SageMaker based on the runtime parameter.

Kubeflow training is a group of Kubernetes Operators that add to Kubeflow the support for distributed training of ML models using different frameworks like TensorFlow, PyTorch, and others. pytorch-operator is the Kubeflow implementation of the Kubernetes custom resource (PyTorchJob) to run distributed PyTorch training jobs on Kubernetes.

We use the PyTorchJob Launcher component as part of the Kubeflow pipeline to run PyTorch distributed training during the experimentation phase when we need flexibility and access to all the underlying resources for interactive debugging and analysis.

We also use SageMaker components for Kubeflow Pipelines to run our model training at production scale. This allows us to take advantage of powerful SageMaker features such as fully managed services, distributed training jobs with maximum GPU utilization, and cost-effective training through Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances.

As part for the workflow creation process, you complete the following steps (as shown in the preceding diagram) to create this pipeline:

Use the Kubeflow manifest file to create a Kubeflow dashboard and access Jupyter notebooks from the Kubeflow central dashboard.
Use the Kubeflow pipeline SDK to create and compile Kubeflow pipelines using Python code. Pipeline compilation converts the Python function to a workflow resource, which is an Argo-compatible YAML format.
Use the Kubeflow Pipelines SDK client to call the pipeline service endpoint to run the pipeline.
The pipeline evaluates the conditional runtime variables and decides between SageMaker or Kubernetes as the target run environment.
Use the Kubeflow PyTorch Launcher component to run distributed training on the native Kubernetes environment, or use the SageMaker component to submit the training on the SageMaker managed platform.

The following figure shows the Kubeflow Pipelines components involved in the architecture that give us the flexibility to choose between Kubernetes or SageMaker distributed environments.

Use Case Workflow

We use the following step-by-step approach to install and run the use case for distributed training using Amazon EKS and SageMaker using Kubeflow on AWS.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
A machine with Docker and the AWS Command Line Interface (AWS CLI) installed.
Optionally, you can use AWS Cloud9, a cloud-based integrated development environment (IDE) that enables completing all the work from your web browser. For setup instructions, refer to Setup Cloud9 IDE. From your Cloud9 environment, choose the plus sign and open new terminal.
Create a role with the name sagemakerrole. Add managed policies AmazonSageMakerFullAccess and AmazonS3FullAccess to give SageMaker access to S3 buckets. This role is used by SageMaker job submitted as part of Kubeflow Pipelines step.
Make sure your account has SageMaker Training resource type limit for ml.p3.2xlarge increased to 2 using Service Quotas Console

1. Install Amazon EKS and Kubeflow on AWS

You can use several different approaches to build a Kubernetes cluster and deploy Kubeflow. In this post, we focus on an approach that we believe brings simplicity to the process. First, we create an EKS cluster, then we deploy Kubeflow on AWS v1.5 on it. For each of these tasks, we use a corresponding open-source project that follows the principles of the Do Framework. Rather than installing a set of prerequisites for each task, we build Docker containers that have all the necessary tools and perform the tasks from within the containers.

We use the Do Framework in this post, which automates the Kubeflow deployment with Amazon EFS as an add-on. For the official Kubeflow on AWS deployment options for production deployments, refer to Deployment.

Configure the current working directory and AWS CLI

We configure a working directory so we can refer to it as the starting point for the steps that follow:

export working_dir=$PWD

We also configure an AWS CLI profile. To do so, you need an access key ID and secret access key of an AWS Identity and Access Management (IAM) user account with administrative privileges (attach the existing managed policy) and programmatic access. See the following code:

aws configure --profile=kubeflow
AWS Access Key ID [None]: <enter access key id>
AWS Secret Access Key [None]: <enter secret access key>
Default region name [None]: us-west-2
Default output format [None]: json

# (In Cloud9, select “Cancel” and “Permanently disable” when the AWS managed temporary credentials dialog pops up)

export AWS_PROFILE=kubeflow

1.1 Create an EKS cluster

If you already have an EKS cluster available, you can skip to the next section. For this post, we use the aws-do-eks project to create our cluster.

First clone the project in your working directory

cd ${working_dir}
git clone https://github.com/aws-samples/aws-do-eks
cd aws-do-eks/

Then build and run the aws-do-eks container:
```
./build.sh
./run.sh
```
The build.sh script creates a Docker container image that has all the necessary tools and scripts for provisioning and operation of EKS clusters. The run.sh script starts a container using the created Docker image and keeps it up, so we can use it as our EKS management environment. To see the status of your aws-do-eks container, you can run ./status.sh. If the container is in Exited status, you can use the ./start.sh script to bring the container up, or to restart the container, you can run ./stop.sh followed by ./run.sh.
Open a shell in the running aws-do-eks container:
```
./exec.sh
```
To review the EKS cluster configuration for our KubeFlow deployment, run the following command:
```
vi ./eks-kubeflow.yaml
```
By default, this configuration creates a cluster named eks-kubeflow in the us-west-2 Region with six m5.xlarge nodes. Also, EBS volumes encryption is not enabled by default. You can enable it by adding "volumeEncrypted: true" to the nodegroup and it will encrypt using the default key. Modify other configurations settings if needed.
To create the cluster, run the following command:
```
export AWS_PROFILE=kubeflow
eksctl create cluster -f ./eks-kubeflow.yaml
```
The cluster provisioning process may take up to 30 minutes.

To verify that the cluster was created successfully, run the following command:

kubectl get nodes

The output from the preceding command for a cluster that was created successfully looks like the following code:

root@cdf4ecbebf62:/eks# kubectl get nodes
NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-0-166.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-13-28.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-45-240.us-west-2.compute.internal   Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-63-84.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-75-56.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-85-226.us-west-2.compute.internal   Ready    <none>   23m   v1.21.14-eks-ba74326

Create an EFS volume for the SageMaker training job

In this use case, you speed up the SageMaker training job by training deep learning models from data already stored in Amazon EFS. This choice has the benefit of directly launching your training jobs from the data in Amazon EFS with no data movement required, resulting in faster training start times.

We create an EFS volume and deploy the EFS Container Storage Interface (CSI) driver. This is accomplished by a deployment script located in /eks/deployment/csi/efs within the aws-do-eks container.

This script assumes you have one EKS cluster in your account. Set CLUSTER_NAME=<eks_cluster_name> in case you have more than one EKS cluster.

cd /eks/deployment/csi/efs
./deploy.sh

This script provisions an EFS volume and creates mount targets for the subnets of the cluster VPC. It then deploys the EFS CSI driver and creates the efs-sc storage class and efs-pv persistent volume in the EKS cluster.

Upon successful completion of the script, you should see output like the following:

Generating efs-sc.yaml ...

Applying efs-sc.yaml ...
storageclass.storage.k8s.io/efs-sc created
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
efs-sc          efs.csi.aws.com         Delete          Immediate              false                  1s
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  36m

Generating efs-pv.yaml ...
Applying efs-pv.yaml ...
persistentvolume/efs-pv created
NAME     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
efs-pv   5Gi        RWX            Retain           Available           efs-sc                  10s

Done ...

Create an Amazon S3 VPC endpoint

You use a private VPC that your SageMaker training job and EFS file system have access to. To give the SageMaker training cluster access to the S3 buckets from your private VPC, you create a VPC endpoint:

cd /eks/vpc 
export CLUSTER_NAME=<eks-cluster> 
export REGION=<region> 
./vpc-endpoint-create.sh

You may now exit the aws-do-eks container shell and proceed to the next section:

exit

root@cdf4ecbebf62:/eks/deployment/csi/efs# exit
exit
TeamRole:~/environment/aws-do-eks (main) $

1.2 Deploy Kubeflow on AWS on Amazon EKS

To deploy Kubeflow on Amazon EKS, we use the aws-do-kubeflow project.

Clone the repository using the following commands:

cd ${working_dir}
git clone https://github.com/aws-samples/aws-do-kubeflow
cd aws-do-kubeflow

Then configure the project:
```
./config.sh
```
This script opens the project configuration file in a text editor. It’s important for AWS_REGION to be set to the Region your cluster is in, as well as AWS_CLUSTER_NAME to match the name of the cluster that you created earlier. By default, your configuration is already properly set, so if you don’t need to make any changes, just close the editor.
```
./build.sh
./run.sh
./exec.sh
```
The build.sh script creates a Docker container image that has all the tools necessary to deploy and manage Kubeflow on an existing Kubernetes cluster. The run.sh script starts a container, using the Docker image, and the exec.sh script opens a command shell into the container, which we can use as our Kubeflow management environment. You can use the ./status.sh script to see if the aws-do-kubeflow container is up and running and the ./stop.sh and ./run.sh scripts to restart it as needed.
After you have a shell opened in the aws-do-eks container, you can verify that the configured cluster context is as expected:
```
root@ip-172-31-43-155:/kubeflow# kubectx
kubeflow@eks-kubeflow.us-west-2.eksctl.io
```

To deploy Kubeflow on the EKS cluster, run the deploy.sh script:

./kubeflow-deploy.sh

The deployment is successful when all pods in the kubeflow namespace enter the Running state. A typical output looks like the following code:

Waiting for all Kubeflow pods to start Running ...

Waiting for all Kubeflow pods to start Running ...

Restarting central dashboard ...
pod "centraldashboard-79f489b55-vr6lp" deleted
/kubeflow/deploy/distro/aws/kubeflow-manifests /kubeflow/deploy/distro/aws
/kubeflow/deploy/distro/aws

Kubeflow deployment succeeded
Granting cluster access to kubeflow profile user ...
Argument not provided, assuming default user namespace kubeflow-user-example-com ...
clusterrolebinding.rbac.authorization.k8s.io/kubeflow-user-example-com-cluster-admin-binding created
Setting up access to Kubeflow Pipelines ...
Argument not provided, assuming default user namespace kubeflow-user-example-com ...

Creating pod-default for namespace kubeflow-user-example-com ...
poddefault.kubeflow.org/access-ml-pipeline created

To monitor the state of the KubeFlow pods, in a separate window, you can use the following command:
```
watch kubectl -n kubeflow get pods
```
Press Ctrl+C when all pods are Running, then expose the Kubeflow dashboard outside the cluster by running the following command:
```
./kubeflow-expose.sh
```

You should see output that looks like the following code:

root@ip-172-31-43-155:/kubeflow# ./kubeflow-expose.sh
root@ip-172-31-43-155:/kubeflow# Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

This command port-forwards the Istio ingress gateway service from your cluster to your local port 8080. To access the Kubeflow dashboard, visit http://localhost:8080 and log in using the default user credentials (user@example.com/12341234). If you’re running the aws-do-kubeflow container in AWS Cloud9, then you can choose Preview, then choose Preview Running Application. If you’re running on Docker Desktop, you may need to run the ./kubeflow-expose.sh script outside of the aws-do-kubeflow container.

2. Set up the Kubeflow on AWS environment

To set up your Kubeflow on AWS environment, we create an EFS volume and a Jupyter notebook.

2.1 Create an EFS volume

To create an EFS volume, complete the following steps:

On the Kubeflow dashboard, choose Volumes in the navigation pane.
Chose New volume.
For Name, enter efs-sc-claim.
For Volume size, enter 10.
For Storage class, choose efs-sc.
For Access mode, choose ReadWriteOnce.
Choose Create.

2.2 Create a Jupyter notebook

To create a new notebook, complete the following steps:

On the Kubeflow dashboard, choose Notebooks in the navigation pane.
Choose New notebook.
For Name, enter aws-hybrid-nb.
For Jupyter Docket Image, choose the image c9e4w0g3/notebook-servers/jupyter-pytorch:1.11.0-cpu-py38-ubuntu20.04-e3-v1.1 (the latest available jupyter-pytorch DLC image).
For CPU, enter 1.
For Memory, enter 5.
For GPUs, leave as None.
Don’t make any changes to the Workspace Volume section.
In the Data Volumes section, choose Attach existing volume and expand Existing volume section
For Name, choose efs-sc-claim.
For Mount path, enter /home/jovyan/efs-sc-claim.
This mounts the EFS volume to your Jupyter notebook pod, and you can see the folder efs-sc-claim in your Jupyter lab interface. You save the training dataset and training code to this folder so the training clusters can access it without needing to rebuild the container images for testing.
Select Allow access to Kubeflow Pipelines in Configuration section.
Choose Launch.
Verify that your notebook is created successfully (it may take a couple of minutes).
On the Notebooks page, choose Connect to log in to the JupyterLab environment.
On the Git menu, choose Clone a Repository.
For Clone a repo, enter https://github.com/aws-samples/aws-do-kubeflow.

3. Run distributed training

After you set up the Jupyter notebook, you can run the entire demo using the following high-level steps from the folder aws-do-kubeflow/workshop in the cloned repository:

PyTorch Distributed Data Parallel (DDP) training Script: Refer PyTorch DDP training script cifar10-distributed-gpu-final.py, which includes a sample convolutional neural network and logic to distribute training on a multi-node CPU and GPU cluster. (Refer 3.1 for details)
Install libraries: Run the notebook 0_initialize_dependencies.ipynb to initialize all dependencies. (Refer 3.2 for details)
Run distributed PyTorch job training on Kubernetes: Run the notebook 1_submit_pytorchdist_k8s.ipynb to create and submit distributed training on one primary and two worker containers using the Kubernetes custom resource PyTorchJob YAML file using Python code. (Refer 3.3 for details)
Create a hybrid Kubeflow pipeline: Run the notebook 2_create_pipeline_k8s_sagemaker.ipynb to create the hybrid Kubeflow pipeline that runs distributed training on the either SageMaker or Amazon EKS using the runtime variable training_runtime. (Refer 3.4 for details)

Make sure you ran the notebook 1_submit_pytorchdist_k8s.ipynb before you start notebook 2_create_pipeline_k8s_sagemaker.ipynb.

In the subsequent sections, we discuss each of these steps in detail.

3.1 PyTorch Distributed Data Parallel(DDP) training script

As part of the distributed training, we train a classification model created by a simple convolutional neural network that operates on the CIFAR10 dataset. The training script cifar10-distributed-gpu-final.py contains only the open-source libraries and is compatible to run both on Kubernetes and SageMaker training clusters on either GPU devices or CPU instances. Let’s look at a few important aspects of the training script before we run our notebook examples.

We use the torch.distributed module, which contains PyTorch support and communication primitives for multi-process parallelism across nodes in the cluster:

...
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision
from torchvision import datasets, transforms
...

We create a simple image classification model using a combination of convolutional, max pooling, and linear layers to which a relu activation function is applied in the forward pass of the model training:

# Define models
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)

def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x

We use the torch DataLoader that combines the dataset and DistributedSampler (loads a subset of data in a distributed manner using torch.nn.parallel.DistributedDataParallel) and provides a single-process or multi-process iterator over the data:

# Define data loader for training dataset
def _get_train_data_loader(batch_size, training_dir, is_distributed):
logger.info("Get train data loader")

train_set = torchvision.datasets.CIFAR10(root=training_dir,
train=True,
download=False,
transform=_get_transforms())

train_sampler = (
torch.utils.data.distributed.DistributedSampler(train_set) if is_distributed else None
)

return torch.utils.data.DataLoader(
train_set,
batch_size=batch_size,
shuffle=train_sampler is None,
sampler=train_sampler)
...

If the training cluster has GPUs, the script runs the training on CUDA devices and the device variable holds the default CUDA device:

device = "cuda" if torch.cuda.is_available() else "cpu"
...

Before you run distributed training using PyTorch DistributedDataParallel to run distributed processing on multiple nodes, you need to initialize the distributed environment by calling init_process_group. This is initialized on each machine of the training cluster.

dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
...

We instantiate the classifier model and copy over the model to the target device. If distributed training is enabled to run on multiple nodes, the DistributedDataParallel class is used as a wrapper object around the model object, which allows synchronous distributed training across multiple machines. The input data is split on the batch dimension and a replica of model is placed on each machine and each device.

model = Net().to(device)

if is_distributed:
model = torch.nn.parallel.DistributedDataParallel(model)

...

3.2 Install libraries

You will install all necessary libraries to run the PyTorch distributed training example. This includes Kubeflow Pipelines SDK, Training Operator Python SDK, Python client for Kubernetes and Amazon SageMaker Python SDK.

#Please run the below commands to install necessary libraries

!pip install kfp==1.8.4

!pip install kubeflow-training

!pip install kubernetes

!pip install sagemaker

3.3 Run distributed PyTorch job training on Kubernetes

The notebook 1_submit_pytorchdist_k8s.ipynb creates the Kubernetes custom resource PyTorchJob YAML file using Kubeflow training and the Kubernetes client Python SDK. The following are a few important snippets from this notebook.

We create the PyTorchJob YAML with the primary and worker containers as shown in the following code:

# Define PyTorchJob custom resource manifest
pytorchjob = V1PyTorchJob(
api_version="kubeflow.org/v1",
kind="PyTorchJob",
metadata=V1ObjectMeta(name=pytorch_distributed_jobname,namespace=user_namespace),
spec=V1PyTorchJobSpec(
run_policy=V1RunPolicy(clean_pod_policy="None"),
pytorch_replica_specs={"Master": master,
"Worker": worker}
)
)

This is submitted to the Kubernetes control plane using PyTorchJobClient:

# Creates and Submits PyTorchJob custom resource file to Kubernetes
pytorchjob_client = PyTorchJobClient()

pytorch_job_manifest=pytorchjob_client.create(pytorchjob):

View the Kubernetes training logs

You can view the training logs either from the same Jupyter notebook using Python code or from the Kubernetes client shell.

From the notebook, run the following code with the appropriate log_type parameter value to view the primary, worker, or all logs:

#  Function Definition: read_logs(pyTorchClient: str, jobname: str, namespace: str, log_type: str) -> None:
#    log_type: all, worker:all, master:all, worker:0, worker:1

read_logs(pytorchjob_client, pytorch_distributed_jobname, user_namespace, "master:0")

From the Kubernetes client shell connected to the Kubernetes cluster, run the following commands using Kubectl to see the logs (substitute your namespace and pod names):
```
kubectl get pods -n kubeflow-user-example-com
kubectl logs <pod-name> -n kubeflow-user-example-com -f
```
We set world size – 3 because we’re distributing the training to three processes running in one primary and two worker pods. Data is split at the batch dimension and a third of the data is processed by the model in each container.

3.4 Create a hybrid Kubeflow pipeline

The notebook 2_create_pipeline_k8s_sagemaker.ipynb creates a hybrid Kubeflow pipeline based on conditional runtime variable training_runtime, as shown in the following code. The notebook uses the Kubeflow Pipelines SDK and it’s provided a set of Python packages to specify and run the ML workflow pipelines. As part of this SDK, we use the following packages:

The domain-specific language (DSL) package decorator dsl.pipeline, which decorates the Python functions to return a pipeline
The dsl.Condition package, which represents a group of operations that are only run when a certain condition is met, such as checking the training_runtime value as sagemaker or kubernetes

See the following code:

# Define your training runtime value with either 'sagemaker' or 'kubernetes'
training_runtime='sagemaker'

# Create Hybrid Pipeline using Kubeflow PyTorch Training Operators and Amazon SageMaker Service
@dsl.pipeline(name="PyTorch Training pipeline", description="Sample training job test")
def pytorch_cnn_pipeline(<training parameters>):

# Pipeline Step 1: to evaluate the condition. You can enter any logic here. For demonstration we are checking if GPU is needed for training
condition_result = check_condition_op(training_runtime)

# Pipeline Step 2: to run training on Kuberentes using PyTorch Training Operators. This will be executed if gpus are not needed
with dsl.Condition(condition_result.output == 'kubernetes', name="PyTorch_Comp"):
train_task = pytorch_job_op(
name=training_job_name,
namespace=user_namespace,
master_spec=json.dumps(master_spec_loaded), # Please refer file at pipeline_yaml_specifications/pipeline_master_spec.yml
worker_spec=json.dumps(worker_spec_loaded), # Please refer file at pipeline_yaml_specifications/pipeline_worker_spec.yml
delete_after_done=False
).after(condition_result)

# Pipeline Step 3: to run training on SageMaker using SageMaker Components for Pipeline. This will be executed if gpus are needed
with dsl.Condition(condition_result.output == 'sagemaker', name="SageMaker_Comp"):
training = sagemaker_train_op(
region=region,
image=train_image,
job_name=training_job_name,
training_input_mode=training_input_mode,
hyperparameters='{ 
"backend": "'+str(pytorch_backend)+'", 
"batch-size": "64", 
"epochs": "3", 
"lr": "'+str(learning_rate)+'", 
"model-type": "custom", 
"sagemaker_container_log_level": "20", 
"sagemaker_program": "cifar10-distributed-gpu-final.py", 
"sagemaker_region": "us-west-2", 
"sagemaker_submit_directory": "'+source_s3+'" 
}',
channels=channels,
instance_type=instance_type,
instance_count=instance_count,
volume_size=volume_size,
max_run_time=max_run_time,
model_artifact_path=f's3://{bucket_name}/jobs',
network_isolation=network_isolation,
traffic_encryption=traffic_encryption,
role=role,
vpc_subnets=subnet_id,
vpc_security_group_ids=security_group_id
).after(condition_result)

We configure SageMaker distributed training using two ml.p3.2xlarge instances.

After the pipeline is defined, you can compile the pipeline to an Argo YAML specification using the Kubeflow Pipelines SDK’s kfp.compiler package. You can run this pipeline using the Kubeflow Pipeline SDK client, which calls the Pipelines service endpoint and passes in appropriate authentication headers right from the notebook. See the following code:

# DSL Compiler that compiles pipeline functions into workflow yaml.
kfp.compiler.Compiler().compile(pytorch_cnn_pipeline, "pytorch_cnn_pipeline.yaml")

# Connect to Kubeflow Pipelines using the Kubeflow Pipelines SDK client
client = kfp.Client()

experiment = client.create_experiment(name="kubeflow")

# Run a specified pipeline
my_run = client.run_pipeline(experiment.id, "pytorch_cnn_pipeline", "pytorch_cnn_pipeline.yaml")

# Please click “Run details” link generated below this cell to view your pipeline. You can click every pipeline step to see logs.

If you get a sagemaker import error, run !pip install sagemaker and restart the kernel (on the Kernel menu, choose Restart Kernel).

Choose the Run details link under the last cell to view the Kubeflow pipeline.

Repeat the pipeline creation step with training_runtime='kubernetes' to test the pipeline run on a Kubernetes environment. The training_runtime variable can also be passed in your CI/CD pipeline in a production scenario.

View the Kubeflow pipeline run logs for the SageMaker component

The following screenshot shows our pipeline details for the SageMaker component.

Choose the training job step and on the Logs tab, choose the CloudWatch logs link to access the SageMaker logs.

The following screenshot shows the CloudWatch logs for each of the two ml.p3.2xlarge instances.

Choose any of the groups to see the logs.

View the Kubeflow pipeline run logs for the Kubeflow PyTorchJob Launcher component

The following screenshot shows the pipeline details for our Kubeflow component.

Run the following commands using Kubectl on your Kubernetes client shell connected to the Kubernetes cluster to see the logs (substitute your namespace and pod names):

kubectl get pods -n kubeflow-user-example-com
kubectl logs <pod-name> -n kubeflow-user-example-com -f

4.1 Clean up

To clean up all the resources we created in the account, we need to remove them in reverse order.

Delete the Kubeflow installation by running ./kubeflow-remove.sh in the aws-do-kubeflow container. The first set of commands are optional and can be used in case you don’t already have a command shell into your aws-do-kubeflow container open.
```
cd aws-do-kubeflow
./status.sh
./start.sh
./exec.sh

./kubeflow-remove.sh
```
From the aws-do-eks container folder, remove the EFS volume. The first set of commands is optional and can be used in case you don’t already have a command shell into your aws-do-eks container open.
```
cd aws-do-eks
./status.sh
./start.sh
./exec.sh

cd /eks/deployment/csi/efs
./delete.sh
./efs-delete.sh
```
Deleting Amazon EFS is necessary in order to release the network interface associated with the VPC we created for our cluster. Note that deleting the EFS volume destroys any data that is stored on it.
From the aws-do-eks container, run the eks-delete.sh script to delete the cluster and any other resources associated with it, including the VPC:
```
cd /eks
./eks-delete.sh
```

Summary

In this post, we discussed some of the typical challenges of distributed model training and ML workflows. We provided an overview of the Kubeflow on AWS distribution and shared two open-source projects (aws-do-eks and aws-do-kubeflow) that simplify provisioning the infrastructure and the deployment of Kubeflow on it. Finally, we described and demonstrated a hybrid architecture that enables workloads to transition seamlessly between running on a self-managed Kubernetes and fully managed SageMaker infrastructure. We encourage you to use this hybrid architecture for your own use cases.

You can follow the AWS Labs repository to track all AWS contributions to Kubeflow. You can also find us on the Kubeflow #AWS Slack Channel; your feedback there will help us prioritize the next features to contribute to the Kubeflow project.

Special thanks to Sree Arasanagatta (Software Development Manager AWS ML) and Suraj Kota (Software Dev Engineer) for their support to the launch of this post.

About the authors

Kanwaljit Khurmi is an AI/ML Specialist Solutions Architect at Amazon Web Services. He works with the AWS product, engineering and customers to provide guidance and technical assistance helping them improve the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Gautam Kumar is a Software Engineer with AWS AI Deep Learning. He has developed AWS Deep Learning Containers and AWS Deep Learning AMI. He is passionate about building tools and systems for AI. In his spare time, he enjoy biking and reading books.

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect for Self-managed Machine Learning at AWS. In his role he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open source Do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on combating climate change, democratizing AI and ML, making travel safer, healthcare better, and energy smarter.

Bundesliga Match Fact Pressure Handling: Evaluating players’ performances in high-pressure situations on AWS

Pressing or pressure in football is a process in which a team seeks to apply stress to the opponent player who possesses the ball. A team applies pressure to limit the time an opposition player has left to make a decision, reduce passing options, and ultimately attempt to turn over ball possession. Although nearly all teams seek to apply pressure to their opponents, their strategy to do so may vary.

Some teams adopt a so-called deep press, leaving the opposition with time and room to move the ball up the pitch. However, once the ball reaches the last third of the field, defenders aim to intercept the ball by pressuring the ball carrier. A slightly less conservative approach is the middle press. Here pressure is applied around the halfway line, where defenders attempt to lead the buildup in a certain direction, blocking open players and passing lanes to ultimately force the opposition back. Borussia Dortmund under Jürgen Klopp was one of the most efficient teams to use a middle press. The most aggressive type of pressing teams apply is the high press strategy. Here a team seeks to pressure the defenders and goalkeeper, focusing on direct pressure on the ball carrier, leaving them with ample time to select the right passing options as they have to cover the ball. In this strategy, the pressing team seeks to turn over possession by challenges or intercepting sloppy passes.

In February 2021, the Bundesliga released the first insight into how teams apply pressure with the Most Pressed Player Match Fact powered by AWS. Most Pressed Player quantifies defensive pressure encountered by players in real time, allowing fans to compare how some players receive pressure vs. others. Over the last 1.5 years, this Match Fact provided fans with new insights on how much teams were applying pressure, but also resulted in new questions, such as “Was this pressure successful?” or “How is this player handling pressure?”

Introducing Pressure Handling, a new Bundesliga Match Fact that aims to evaluate the performance of a frequently pressed player using different metrics. Pressure Handling is a further development of Most Pressed Player, and adds a quality component to the number of significant pressure situations a player in ball possession finds themself in. A central statistic in this new Match Fact is Escape Rate, which indicates how often a player resolves pressure situations successfully by keeping the possession for their team. In addition, fans get insight into the passing and shooting performance of players under pressure.

This post dives deeper into how the AWS team worked closely together with the Bundesliga to bring the Pressure Handling Match Fact to life.

How does it work?

This new Bundesliga Match Fact profiles the performance of players in pressure situations. For example, an attacking player in ball possession may get pressured by opposing defenders. There is a significant probability of him losing the ball. If that player manages to resolve the pressure situation without losing the ball, they increase their performance under pressure. Not losing the ball is defined as the team retaining ball possession after the individual ball possession of the player ends. This could, for instance, be either by a successful pass to a teammate, being fouled, or obtaining a throw-in or a corner kick. In contrast, a pressed player can lose the ball through a tackle or an unsuccessful pass. We only count those ball possessions in which the player received the ball from their teammate. That way, we exclude situations where they intercept the ball and are under pressure immediately (which usually happens).

We aggregate the pressure handling performance of a player into a single KPI called escape rate. The escape rate is defined as the fraction of ball possessions by a player where they were under pressure and didn’t lost the ball. In this case, “under pressure” is defined as a pressure value of >0.6 (see our previous post for more information on the pressure value itself). The escape rate allows us to evaluate players on a per-match or per-season basis. The following heuristic is used for computing the escape rate:

We start with a series of pressure events, based on the existing Most Pressed Player Match Fact. Each event consists of a list containing all individual pressure events on the ball carrier during one individual ball possession (IBP) phase.
For each phase, we compute the maximum aggregated pressure on the ball carrier.
As mentioned earlier, a pressure phase needs to satisfy two conditions in order to be considered:
1. The previous IBP was by a player of the same team.
2. The maximum pressure on the player during the current IBP was > 0.6.
If the subsequent IBP accounts to a player of the same team, we count this as an escape. Otherwise, it’s counted as a lost ball.
For each player, we compute the escape rate by counting the number of escapes and dividing it by the number of pressure events.

Examples of escapes

To illustrate the different ways of successfully resolving pressure, the following videos show four examples of Joshua Kimmich escaping pressure situations (Matchday 5, Season 22/23 – Union Berlin vs. Bayern Munich).

Joshua Kimmich moving out of pressure and passing to the wing.

Joshua Kimmich playing a quick pass forward to escape ensuing pressure.

Joshua Kimmich escaping pressure twice. The first escape is by a sliding tackle of the opponent, which nevertheless resulted in retained team ball possession. The second escape is by being fouled and thereby retaining team ball possession.

Joshua Kimmich escapes pressure by with a quick move and a pass.

Pressure Handling findings

Let’s look at some findings.

With the Pressure Handling Match Fact, players are ranked according to their escape rate on a match basis. In order to have a fair comparison between players, we only rank players that were under pressure at least 10 times.

The following table shows the number of times a player was in the top 2 of match rankings over the first seven matchdays of the 2022/23 season. We only show players with at least three appearances in the top 2.

Number of Times in Top 2	Player	Number of Times in Ranking
4	Joshua Kimmich	5
4	Exequiel Palacios	6
3	Jude Bellingham	7
3	Alphonso Davies	6
3	Lars Stindl	3
3	Jonas Hector	6
3	Vincenzo Grifo	4
3	Kevin Stöger	7

Joshua Kimmich and Exequiel Palacios lead the pack with four appearances in the top 2 of match rankings each. A special mention may go to Lars Stindl, who appeared in the top 2 three times despite playing only three times before an injury prevent further Bundesliga starts.

How it is implemented?

The Bundesliga Match Fact Pressure Handling consumes positions and event data, as well as data from other Bundesliga Match Facts, namely xPasses and Most Pressed Player. Match Facts are independently running AWS Fargate containers inside Amazon Elastic Container Service (Amazon ECS). To guarantee the latest data is being reflected in the Pressure Handling calculations, we use Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Amazon MSK allows different Bundesliga Match Facts to send and receive the newest events and updates in real time. By consuming Kafka, we receive most up-to-date events from all systems. The following diagram illustrates the end-to-end workflow for Pressure Handling.

Pressure Handling starts its calculation after an event is received from the Most Pressed Player Match Fact. The Pressure Handling container writes the current statistics to a topic in Amazon MSK. A central AWS Lambda function consumes these messages from Amazon MSK, and writes the escape rates to an Amazon Aurora database. This data is then used for interactive near-real-time visualizations using Amazon QuickSight. Besides that, the results are also sent to a feed, which then triggers another Lambda function that sends the data to external systems where broadcasters worldwide can consume it.

Summary

In this post, we demonstrated how the new Bundesliga Match Fact Pressure Handling makes it possible to quantify and objectively compare the performance of different Bundesliga players in high-pressure situations. To do so, we build on and combine previously published Bundesliga Match Facts in real time. This allows commentators and fans to understand which players shine when pressured by their opponents.

The new Bundesliga Match Fact is the result of an in-depth analysis by the Bundesliga’s football experts and AWS data scientists. Extraordinary escape rates are shown in the live ticker of the respective matches in the official Bundesliga app. During a broadcast, escape rates are provided to commentators through the data story finder and visually shown to fans at key moments, such as when a player with a high pressure count and escape rate scores a goal, passes exceptionally well, or overcomes many challenges while staying in control of the ball.

We hope that you enjoy this brand-new Bundesliga Match Fact and that it provides you with new insights into the game. To learn more about the partnership between AWS and Bundesliga, visit Bundesliga on AWS!

We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.

About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals, and won 26 caps for Germany. Currently, Rolfes serves as Managing Director Sport at Bayer 04 Leverkusen, where he oversees and develops the pro player roster, the scouting department, and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS. There he offers his expertise as a former player, captain, and TV analyst to highlight the impact of advanced statistics and machine learning into the world of football.

Luuk Figdor is a Sports Technology Advisor in the AWS Professional Services team. He works with players, clubs, leagues, and media companies such as the Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time, he likes to learn all about the mind and the intersection between psychology, economics, and AI.

Javier Poveda-Panter is a Data Scientist for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high-quality user and fan experiences through machine learning and data science. He follows his passion for a broad range of sports, music, and AI in his spare time.

Tareq Haschemi is a consultant within AWS Professional Services. His skills and areas of expertise include application development, data science, machine learning, and big data. He supports customers in developing data-driven applications within the cloud. Prior to joining AWS, he was also a consultant in various industries such as aviation and telecommunications. He is passionate about enabling customers on their data/AI journey to the cloud.

Fotinos Kyriakides is a Consultant with AWS Professional Services. Through his work as a Data Engineer and Application Developer, he supports customers in developing applications in the cloud that leverage and innovate on insights generated from data. In his spare time, he likes to run and explore nature.

Uwe Dick is a Data Scientist at Sportec Solutions AG. He works to enable Bundesliga clubs and media to optimize their performance using advanced stats and data—before, after, and during matches. In his spare time, he settles for less and just tries to last the full 90 minutes for his recreational football team.

Bundesliga Match Fact Win Probability: Quantifying the effect of in-game events on winning chances using machine learning on AWS

Ten years from now, the technological fitness of clubs will be a key contributor towards their success. Today we’re already witnessing the potential of technology to revolutionize the understanding of football. xGoals quantifies and allows comparison of goal scoring potential of any shooting situation, while xThreat and EPV models predict the value of any in-game moment. Ultimately, these and other advanced statistics serve one purpose: to improve the understanding of who will win and why. Enter the new Bundesliga Match Fact: Win Probability.

In Bayern’s second match against Bochum last season, the tables turned unexpectedly. Early in the match, Lewandowski scores 1:0 after just 9 minutes. The “Grey Mouse” of the league is instantly reminded of their 7:0 disaster when facing Bayern for the first time that season. But not this time: Christopher Antwi-Adjei scores his first goal for the club just 5 minutes later. After conceiving a penalty goal in the 38th minute, the team from Monaco di Bavaria seems paralyzed and things began to erupt: Gamboa nutmegs Coman and finishes with an absolute corker of a goal, and Holtmann makes it 4:1 close to halftime with a dipper from the left. Bayern hadn’t conceived this many goals in the first half since 1975, and was barely able to walk away with a 4:2 result. Who could have guessed that? Both teams played without their first keepers, which for Bayern meant missing out on their captain Manuel Neuer. Could his presence have saved them from this unexpected result?

Similarly, Cologne pulled off two extraordinary zingers in the 2020/2021 season. When they faced Dortmund, they had gone 18 matches without a win, while BVB’s Haaland was providing a master class in scoring goals that season (23 in 22 matches). The role of the favorite was clear, yet Cologne took an early lead with just 9 minutes on the clock. In the beginning of the second half, Skhiri scored a carbon-copy goal of his first one: 0:2. Dortmund subbed in attacking strength, created big chances, and scored 1:2. Of all players, Haaland missed a sitter 5 minutes into extra time and crowned Cologne with the first 3 points in Dortmund after almost 30 years.

Later in that season, Cologne—being last in the home-table—surprised RB Leipzig, who had all the motivation to close in on the championship leader Bayern. The opponent Leipzig pressured the “Billy Goats” with a team season record of 13 shots at goal in the first half, increasing their already high chances of a win. Ironically, Cologne scored the 1:0 with the first shot at goal in minute 46. After the “Red Bulls” scored a well-deserved equalizer, they slept on a throw-in just 80 seconds later, leading to Jonas Hector scoring for Cologne again. Just like Dortmund, Leipzig now put all energy into offense, but the best they managed to achieve was hitting the post in overtime.

For all of these matches, experts and novices alike would have wrongly guessed the winner, even well into the match. But what are the events that led to these surprising in-game swings of win probability? At what minute did the underdog’s chance of winning overtake the favorite’s as they ran out of time? Bundesliga and AWS have worked together to compute and illustrate the live development of winning chances throughout matches, enabling fans to see key moments of probability swings. The result is the new machine learning (ML)-powered Bundesliga Match Fact: Win Probability.

How does it work?

The new Bundesliga Match Fact Win Probability was developed by building ML models that analyzed over 1,000 historical games. The live model takes the pre-match estimates and adjusts them according to the match proceedings based on features that affect the outcome, including the following:

Goals
Penalties
Red cards
Substitutions
Time passed
Goal scoring chances created
Set-piece situations

The live model is trained using a neural network architecture and uses a Poisson distribution aplproach to predict a goals-per-minute-rate r for each team, as described in the following equation:

Those rates can be viewed as an estimation of a team’s strength and are computed using a series of dense layers based on the inputs. Based on these rates and the difference between the opponents, the probabilities of a win and a draw are computed in real time.

The input to the model is a 3-tuple of input features, current goal difference, and remaining playtime in minutes.

The first component of the three input dimensions consists of a feature set that describes the current game action in real time for both teams in performance metrics. These include various aggregated team-based xG values, with particular attention to the shots taken in the last 15 minutes before the prediction. We also process red cards, penalties, corner kicks, and the number of dangerous free kicks. A dangerous free kick is classified as a free kick closer than 25m to the opponent’s goal. During the development of the model, besides the influence of the former Bundesliga Match Fact xGoals, we also evaluated the impact of Bundesliga Match Fact Skill in the model. This means that the model reacts to substitution of top players—players with badges in the skills Finisher, Initiator, or Ball winner.

Win Probability example

Let’s look at a match from the current season (2022/2023). The following graph shows the win probability for the Bayern Munich and Stuttgart match from matchday 6.

The pre-match model calculated a win probability of 67% for Bayern, 14% for Stuttgart, and 19% for a draw. When we look at the course of the match, we see a large impact of goals scored in minute 36′, 57′, and 60′. Until the first minute of overtime, the score was 2:1 for Bayern. Only a successful penalty shot by S. Grassy in minute 90+2 secured a draw. The Win Probability Live Model therefore corrected the draw forecast from 5% to over 90%. The result is an unexpected late swing, with Bayern’s win probability decreasing from 90% to 8% in the 90+2 minute. The graph is representative of the swing in atmosphere in the Allianz Arena that day.

How it is implemented?

Win Probability consumes event data from an ongoing match (goal events, fouls, red cards, and more) as well as data produced by other Match Facts, such as xGoals. For real-time updates of probabilities, we use Amazon Managed Streaming Kafka (Amazon MSK) as a central data streaming and messaging solution. This way, event data, positions data, and outputs of different Bundesliga Match Facts can be communicated between containers real time.

The following diagram illustrates the end-to-end workflow for Win Probability.

Gathered match-related data gets ingested through an external provider (DataHub). Metadata of the match is ingested and processed in an AWS Lambda function. Positions and events data are ingested through an AWS Fargate container (MatchLink). All ingested data is then published for consumption in respective MSK topics. The heart of the Win Probability Match Fact sits in a dedicated Fargate container (BMF WinProbability), which runs for the duration of the respective match and consumes all required data obtained though Amazon MSK. The ML models (live and pre-match) are deployed on Amazon SageMaker Serverless Inference endpoints. Serverless endpoints automatically launch compute resources and scale those compute resources depending on incoming traffic, eliminating the need to choose instance types or manage scaling policies. With this pay-per-use model, Serverless Inference is ideal for workloads that have idle periods between traffic spurts. When there are no Bundesliga matches, there is no cost for idle resources.

Shortly before kick-off, we generate our initial set of features and calculate the pre-match win probabilities by calling the PreMatch SageMaker endpoint. With those PreMatch probabilities, we then initialize the live model, which reacts in real time to relevant in-game events and is continuously queried to receive current win probabilities.

The calculated probabilities are then sent back to DataHub to be provided to other MatchFacts consumers. Probabilities are also sent to the MSK cluster to a dedicated topic, to be consumed by other Bundesliga Match Facts. A Lambda function consumes all probabilities from the respective Kafka topic, and writes them to an Amazon Aurora database. This data is then used for interactive near-real-time visualizations using Amazon QuickSight.

Summary

In this post, we demonstrated how the new Bundesliga Match Fact Win Probability shows the impact of in-game events on the chances of a team winning or losing a match. To do so, we build on and combine previously published Bundesliga Match Facts in real time. This allows commentators and fans to uncover moments of probability swings and more during live matches.

The new Bundesliga Match Fact is the result of an in-depth analysis by the Bundesliga’s football experts and AWS data scientists. Win probabilities are shown in the live ticker of the respective matches in the official Bundesliga app. During a broadcast, win probabilities are provided to commentators through the data story finder and visually shown to fans at key moments, such as when the underdog takes the lead and is now most likely to win the game.

We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.

About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals, and won 26 caps for Germany. Currently, Rolfes serves as Managing Director Sport at Bayer 04 Leverkusen, where he oversees and develops the pro player roster, the scouting department and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS. There he offers his expertise as a former player, captain, and TV analyst to highlight the impact of advanced statistics and machine learning into the world of football.

Gabriel Zylka is a Machine Learning Engineer within AWS Professional Services. He works closely with customers to accelerate their cloud adoption journey. Specialized in the MLOps domain, he focuses on productionizing machine learning workloads by automating end-to-end machine learning lifecycles and helping achieve desired business outcomes.

Jakub Michalczyk is a Data Scientist at Sportec Solutions AG. Several years ago, he chose math studies over playing football, as he came to the conclusion that he wasn’t good enough at the latter. Now he combines both these passions in his professional career by applying machine learning methods to gain a better insight into this beautiful game. In his spare time, he still enjoys playing seven-a-side football, watching crime movies, and listening to film music.

Fast Reduce and Mean in TensorFlow Lite

Posted by Alan Kelly, Software Engineer

We are happy to share that TensorFlow Lite version 2.10 has optimized Reduce (All, Any, Max, Min, Prod, Sum) and Mean operators. These common operators replace one or more dimensions of a multi-dimensional tensor with a scalar. Sum, Product, Min, Max, Bitwise And, Bitwise Or and Mean variants of reduce are available. Reduce is now fast for all possible inputs.

Benchmark for Reduce Mean on Google Pixel 6 Pro Cortex A55 (small core). Input tensor is 4D of shape [32, 256, 5, 128] reduced over axis [1, 3], Output is a 2D tensor of shape [32, 5].

Benchmark for Reduce Prod on Google Pixel 6 Pro Cortex A55 (small core). Input tensor is 4D of shape [32, 256, 5, 128] reduced over axis [1, 3], Output is a 2D tensor of shape [32, 5].

Benchmark for Reduce Sum on Google Pixel 6 Pro Cortex A55 (small core). Input tensor is 4D of shape [32, 256, 5, 128] reduced over axis [0, 2], Output is a 2D tensor of shape [256, 128].

These speed-ups are available by default using the latest version of TFLite on all architectures.

How does this work?

To understand how these improvements were made, we need to look at the problem from a different perspective. Let’s take a 3D tensor of shape [3, 2, 5].

Let’s reduce this tensor over axes [0] using Reduce Max. This will give us an output tensor of shape [2, 5] as dimension 0 will be removed. Each element in the output tensor will contain the max of the three elements in the same position along dimension 0. So the first element will be max{0, 10, 20} = 20. This gives us the following output:

To simplify things, let’s reshape the original 3D tensor as a 2D tensor of shape [3, 10]. This is the exact same tensor, just visualized differently.

Reducing this over dimension 0 by taking the max of each column gives us:

Which we then reshape back to its original shape of [2, 5]

This demonstrates how simply changing how we visualize the tensor dramatically simplifies the implementation. In this case, dimensions 1 and 2 are adjacent and not being reduced over. This means that we can fold them into one larger dimension of size 2 x 5 = 10, transforming the 3D tensor into a 2D one. We can do the same to adjacent dimensions which are being reduced over.

Let’s take a look at all possible Reduce permutations for the same 3D tensor of shape [3, 2, 5].

Of all 8 permutations, only two 3D permutations remain after we re-visualize the input tensor. For any number of dimensions, there are only two possible reduction permutations: the rows or the columns. All other ones simplify to a lower dimension.

This is the trick to an efficient and simple reduction operator as we no longer need to calculate input and output tensor indices and our memory access patterns are much more cache friendly.

This also allows the compiler to auto-vectorize the integer reductions. The compiler won’t auto-vectorize floats as float addition is not commutative. You can see the code which removes redundant axes here and the reduction code here.

Changing how we visualize tensors is a powerful code simplification and optimization technique which is used by many TensorFlow Lite operators.

Next steps

We are always working on adding new operators and speeding up existing ones. We’d love to hear about models of yours which have benefited from this work. Get in touch via the TensorFlow Forum. Thanks for reading!Read More

Unified data preparation, model training, and deployment with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot – Part 2

Depending on the quality and complexity of data, data scientists spend between 45–80% of their time on data preparation tasks. This implies that data preparation and cleansing take valuable time away from real data science work. After a machine learning (ML) model is trained with prepared data and readied for deployment, data scientists must often rewrite the data transformations used for preparing data for ML inference. This may stretch the time it takes to deploy a useful model that can inference and score the data from its raw shape and form.

In Part 1 of this series, we demonstrated how Data Wrangler enables a unified data preparation and model training experience with Amazon SageMaker Autopilot in just a few clicks. In this second and final part of this series, we focus on a feature that includes and reuses Amazon SageMaker Data Wrangler transforms, such as missing value imputers, ordinal or one-hot encoders, and more, along with the Autopilot models for ML inference. This feature enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference, further reducing the time required to deploy a trained model to production.

Solution overview

Data Wrangler reduces the time to aggregate and prepare data for ML from weeks to minutes, and Autopilot automatically builds, trains, and tunes the best ML models based on your data. With Autopilot, you still maintain full control and visibility of your data and model. Both services are purpose-built to make ML practitioners more productive and accelerate time to value.

The following diagram illustrates our solution architecture.

Prerequisites

Because this post is the second in a two-part series, make sure you’ve successfully read and implemented Part 1 before continuing.

Export and train the model

In Part 1, after data preparation for ML, we discussed how you can use the integrated experience in Data Wrangler to analyze datasets and easily build high-quality ML models in Autopilot.

This time, we use the Autopilot integration once again to train a model against the same training dataset, but instead of performing bulk inference, we perform real-time inference against an Amazon SageMaker inference endpoint that is created automatically for us.

In addition to the convenience provided by automatic endpoint deployment, we demonstrate how you can also deploy with all the Data Wrangler feature transforms as a SageMaker serial inference pipeline. This enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference.

Note that this feature is currently only supported for Data Wrangler flows that don’t use join, group by, concatenate, and time series transformations.

We can use the new Data Wrangler integration with Autopilot to directly train a model from the Data Wrangler data flow UI.

Choose the plus sign next to the Scale values node, and choose Train model.
For Amazon S3 location, specify the Amazon Simple Storage Service (Amazon S3) location where SageMaker exports your data.
If presented with a root bucket path by default, Data Wrangler creates a unique export sub-directory under it—you don’t need to modify this default root path unless you want to.Autopilot uses this location to automatically train a model, saving you time from having to define the output location of the Data Wrangler flow and then define the input location of the Autopilot training data. This makes for a more seamless experience.
Choose Export and train to export the transformed data to Amazon S3.

When export is successful, you’re redirected to the Create an Autopilot experiment page, with the Input data S3 location already filled in for you (it was populated from the results of the previous page).
For Experiment name, enter a name (or keep the default name).
For Target, choose Outcome as the column you want to predict.
Choose Next: Training method.

As detailed in the post Amazon SageMaker Autopilot is up to eight times faster with new ensemble training mode powered by AutoGluon, you can either let Autopilot select the training mode automatically based on the dataset size, or select the training mode manually for either ensembling or hyperparameter optimization (HPO).

The details of each option are as follows:

Auto – Autopilot automatically chooses either ensembling or HPO mode based on your dataset size. If your dataset is larger than 100 MB, Autopilot chooses HPO; otherwise it chooses ensembling.
Ensembling – Autopilot uses the AutoGluon ensembling technique to train several base models and combines their predictions using model stacking into an optimal predictive model.
Hyperparameter optimization – Autopilot finds the best version of a model by tuning hyperparameters using the Bayesian optimization technique and running training jobs on your dataset. HPO selects the algorithms most relevant to your dataset and picks the best range of hyperparameters to tune the models.For our example, we leave the default selection of Auto.

Choose Next: Deployment and advanced settings to continue.
On the Deployment and advanced settings page, select a deployment option.
It’s important to understand the deployment options in more detail; what we choose will impact whether or not the transforms we made earlier in Data Wrangler will be included in the inference pipeline:
- Auto deploy best model with transforms from Data Wrangler – With this deployment option, when you prepare data in Data Wrangler and train a model by invoking Autopilot, the trained model is deployed alongside all the Data Wrangler feature transforms as a SageMaker serial inference pipeline. This enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference. Note that the inference endpoint expects the format of your data to be in the same format as when it’s imported into the Data Wrangler flow.
- Auto deploy best model without transforms from Data Wrangler – This option deploys a real-time endpoint that doesn’t use Data Wrangler transforms. In this case, you need to apply the transforms defined in your Data Wrangler flow to your data prior to inference.
- Do not auto deploy best model – You should use this option when you don’t want to create an inference endpoint at all. It’s useful if you want to generate a best model for later use, such as locally run bulk inference. (This is the deployment option we selected in Part 1 of the series.) Note that when you select this option, the model created (from Autopilot’s best candidate via the SageMaker SDK) includes the Data Wrangler feature transforms as a SageMaker serial inference pipeline.
For this post, we use the Auto deploy best model with transforms from Data Wrangler option.
For Deployment option, select Auto deploy best model with transforms from Data Wrangler.
Leave the other settings as default.
Choose Next: Review and create to continue.
On the Review and create page, we see a summary of the settings chosen for our Autopilot experiment.
Choose Create experiment to begin the model creation process.

You’re redirected to the Autopilot job description page. The models show on the Models tab as they are generated. To confirm that the process is complete, go to the Job Profile tab and look for a Completed value for the Status field.

You can get back to this Autopilot job description page at any time from Amazon SageMaker Studio:

Choose Experiments and Trials on the SageMaker resources drop-down menu.
Select the name of the Autopilot job you created.
Choose (right-click) the experiment and choose Describe AutoML Job.

View the training and deployment

When Autopilot completes the experiment, we can view the training results and explore the best model from the Autopilot job description page.

Choose (right-click) the model labeled Best model, and choose Open in model details.

The Performance tab displays several model measurement tests, including a confusion matrix, the area under the precision/recall curve (AUCPR), and the area under the receiver operating characteristic curve (ROC). These illustrate the overall validation performance of the model, but they don’t tell us if the model will generalize well. We still need to run evaluations on unseen test data to see how accurately the model makes predictions (for this example, we predict if an individual will have diabetes).

Perform inference against the real-time endpoint

Create a new SageMaker notebook to perform real-time inference to assess the model performance. Enter the following code into a notebook to run real-time inference for validation:

import boto3

### Define required boto3 clients

sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client(service_name="sagemaker-runtime")

### Define endpoint name

endpoint_name = "<YOUR_ENDPOINT_NAME_HERE>"

### Define input data

payload_str = '5,166.0,72.0,19.0,175.0,25.8,0.587,51'
payload = payload_str.encode()
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="text/csv",
    Body=payload,
)

response["Body"].read()

After you set up the code to run in your notebook, you need to configure two variables:

endpoint_name
payload_str

Configure endpoint_name

endpoint_name represents the name of the real-time inference endpoint the deployment auto-created for us. Before we set it, we need to find its name.

Choose Endpoints on the SageMaker resources drop-down menu.
Locate the name of the endpoint that has the name of the Autopilot job you created with a random string appended to it.
Choose (right-click) the experiment, and choose Describe Endpoint.

The Endpoint Details page appears.
Highlight the full endpoint name, and press Ctrl+C to copy it the clipboard.
Enter this value (make sure its quoted) for endpoint_name in the inference notebook.

Configure payload_str

The notebook comes with a default payload string payload_str that you can use to test your endpoint, but feel free to experiment with different values, such as those from your test dataset.

To pull values from the test dataset, follow the instructions in Part 1 to export the test dataset to Amazon S3. Then on the Amazon S3 console, you can download it and select the rows to use the file from Amazon S3.

Each row in your test dataset has nine columns, with the last column being the outcome value. For this notebook code, make sure you only use a single data row (never a CSV header) for payload_str. Also make sure you only send a payload_str with eight columns, where you have removed the outcome value.

For example, if your test dataset files look like the following code, and we want to perform real-time inference of the first row:

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome 
10,115,0,0,0,35.3,0.134,29,0 
10,168,74,0,0,38.0,0.537,34,1 
1,103,30,38,83,43.3,0.183,33,0

We set payload_str to 10,115,0,0,0,35.3,0.134,29. Note how we omitted the outcome value of 0 at the end.

If by chance the target value of your dataset is not the first or last value, just remove the value with the comma structure intact. For example, assume we’re predicting bar, and our dataset looks like the following code:

foo,bar,foobar
85,17,20

In this case, we set payload_str to 85,,20.

When the notebook is run with the properly configured payload_str and endpoint_name values, you get a CSV response back in the format of outcome (0 or 1), confidence (0-1).

Cleaning Up

To make sure you don’t incur tutorial-related charges after completing this tutorial, be sure to shutdown the Data Wrangler app (https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-shut-down.html), as well as all notebook instances used to perform inference tasks. The inference endpoints created via the Auto Pilot deploy should be deleted to prevent additional charges as well.

Conclusion

In this post, we demonstrated how to integrate your data processing, featuring engineering, and model building using Data Wrangler and Autopilot. Building on Part 1 in the series, we highlighted how you can easily train, tune, and deploy a model to a real-time inference endpoint with Autopilot directly from the Data Wrangler user interface. In addition to the convenience provided by automatic endpoint deployment, we demonstrated how you can also deploy with all the Data Wrangler feature transforms as a SageMaker serial inference pipeline, providing for automatic preprocessing of the raw data, with the reuse of Data Wrangler feature transforms at the time of inference.

Low-code and AutoML solutions like Data Wrangler and Autopilot remove the need to have deep coding knowledge to build robust ML models. Get started using Data Wrangler today to experience how easy it is to build ML models using Autopilot.

About the authors

Geremy Cohen is a Solutions Architect with AWS where he helps customers build cutting-edge, cloud-based solutions. In his spare time, he enjoys short walks on the beach, exploring the bay area with his family, fixing things around the house, breaking things around the house, and BBQing.

Pradeep Reddy is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Autopilot, SageMaker Automatic Model Tuner. Outside of work, Pradeep enjoys reading, running and geeking out with palm sized computers like raspberry pi, and other home automation tech.

Dr. John He is a senior software development engineer with Amazon AI, where he focuses on machine learning and distributed computing. He holds a PhD degree from CMU.

About the author

Overview of update

Prerequisites

Upgrade your AWS Panorama device

Redeploy your application

Conclusion

About the Authors

How does Kubeflow on AWS and SageMaker help?

Solution overview

Use Case Workflow

Prerequisites

1. Install Amazon EKS and Kubeflow on AWS

Configure the current working directory and AWS CLI

1.1 Create an EKS cluster

Create an EFS volume for the SageMaker training job

Create an Amazon S3 VPC endpoint

1.2 Deploy Kubeflow on AWS on Amazon EKS

2. Set up the Kubeflow on AWS environment

2.1 Create an EFS volume

2.2 Create a Jupyter notebook

3. Run distributed training

3.1 PyTorch Distributed Data Parallel(DDP) training script

3.2 Install libraries

3.3 Run distributed PyTorch job training on Kubernetes

View the Kubernetes training logs

3.4 Create a hybrid Kubeflow pipeline

View the Kubeflow pipeline run logs for the SageMaker component

View the Kubeflow pipeline run logs for the Kubeflow PyTorchJob Launcher component

4.1 Clean up

Summary

About the authors

How does it work?

Examples of escapes

Pressure Handling findings

How it is implemented?

Summary

About the Authors

How does it work?

Win Probability example

How it is implemented?

Summary

About the Authors

How does this work?

Next steps

Solution overview

Prerequisites

Export and train the model

View the training and deployment

Perform inference against the real-time endpoint

Configure endpoint_name

Configure payload_str

Cleaning Up

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.