Amazon AWS – Page 197

Build flexible and scalable distributed training architectures using Kubeflow on AWS and Amazon SageMaker

September 30, 2022

by Kanwaljit Khurmi Amazon AWS

In this post, we demonstrate how Kubeflow on AWS (an AWS-specific distribution of Kubeflow) used with AWS Deep Learning Containers and Amazon Elastic File System (Amazon EFS) simplifies collaboration and provides flexibility in training deep learning models at scale on both Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon SageMaker utilizing a hybrid architecture approach.

Machine learning (ML) development relies on complex and continuously evolving open-source frameworks and toolkits, as well as complex and continuously evolving hardware ecosystems. This poses a challenge when scaling out ML development to a cluster. Containers offer a solution, because they can fully encapsulate not just the training code, but the entire dependency stack down to the hardware libraries. This ensures an ML environment that is consistent and portable, and facilitates reproducibility of the training environment on each individual node of the training cluster.

Kubernetes is a widely adopted system for automating infrastructure deployment, resource scaling, and management of these containerized applications. However, Kubernetes wasn’t built with ML in mind, so it can feel counterintuitive to data scientists due to its heavy reliance on YAML specification files. There isn’t a Jupyter experience, and there aren’t many ML-specific capabilities, such as workflow management and pipelines, and other capabilities that ML experts expect, such as hyperparameter tuning, model hosting, and others. Such capabilities can be built, but Kubernetes wasn’t designed to do this as its primary objective.

The open-source community took notice and developed a layer on top of Kubernetes called Kubeflow. Kubeflow aims to make the deployment of end-to-end ML workflows on Kubernetes simple, portable, and scalable. You can use Kubeflow to deploy best-of-breed open-source systems for ML to diverse infrastructures.

Kubeflow and Kubernetes provides flexibility and control to data scientist teams. However, ensuring high utilization of training clusters running at scale with reduced operational overheads is still challenging.

This post demonstrates how customers who have on-premises restrictions or existing Kubernetes investments can address this challenge by using Amazon EKS and Kubeflow on AWS to implement an ML pipeline for distributed training based on a self-managed approach, and use fully managed SageMaker for a cost-optimized, fully managed, and production-scale training infrastructure. This includes step-by-step implementation of a hybrid distributed training architecture that allows you to choose between the two approaches at runtime, conferring maximum control and flexibility with stringent needs for your deployments. You will see how you can continue using open-source libraries in your deep learning training script and still make it compatible to run on both Kubernetes and SageMaker in a platform agnostic way.

How does Kubeflow on AWS and SageMaker help?

Neural network models built with deep learning frameworks like TensorFlow, PyTorch, MXNet, and others provide much higher accuracy by using significantly larger training datasets, especially in computer vision and natural language processing use cases. However, with large training datasets, it takes longer to train the deep learning models, which ultimately slows down the time to market. If we could scale out a cluster and bring down the model training time from weeks to days or hours, it could have a huge impact on productivity and business velocity.

Amazon EKS helps provision the managed Kubernetes control plane. You can use Amazon EKS to create large-scale training clusters with CPU and GPU instances and use the Kubeflow toolkit to provide ML-friendly, open-source tools and operationalize ML workflows that are portable and scalable using Kubeflow Pipelines to improve your team’s productivity and reduce the time to market.

However, there could be a couple of challenges with this approach:

Ensuring maximum utilization of a cluster across data science teams. For example, you should provision GPU instances on demand and ensure its high utilization for demanding production-scale tasks such as deep learning training, and use CPU instances for the less demanding tasks such data preprocessing
Ensuring high availability of heavyweight Kubeflow infrastructure components, including database, storage, and authentication, that are deployed in the Kubernetes cluster worker node. For example, the Kubeflow control plane generates artifacts (such as MySQL instances, pod logs, or MinIO storage) that grow over time and need resizable storage volumes with continuous monitoring capabilities.
Sharing the training dataset, code, and compute environments between developers, training clusters, and projects is challenging. For example, if you’re working on your own set of libraries and those libraries have strong interdependencies, it gets really hard to share and run the same piece of code between data scientists in the same team. Also, each training run requires you to download the training dataset and build the training image with new code changes.

Kubeflow on AWS helps address these challenges and provides an enterprise-grade semi-managed Kubeflow product. With Kubeflow on AWS, you can replace some Kubeflow control plane services like database, storage, monitoring, and user management with AWS managed services like Amazon Relational Database Service (Amazon RDS), Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), Amazon FSx, Amazon CloudWatch, and Amazon Cognito.

Replacing these Kubeflow components decouples critical parts of the Kubeflow control plane from Kubernetes, providing a secure, scalable, resilient, and cost-optimized design. This approach also frees up storage and compute resources from the EKS data plane, which may be needed by applications such as distributed model training or user notebook servers. Kubeflow on AWS also provides native integration of Jupyter notebooks with Deep Learning Container (DLC) images, which are pre-packaged and preconfigured with AWS optimized deep learning frameworks such as PyTorch and TensorFlow that allow you to start writing your training code right away without dealing with dependency resolutions and framework optimizations. Also, Amazon EFS integration with training clusters and the development environment allows you to share your code and processed training dataset, which avoids building the container image and loading huge datasets after every code change. These integrations with Kubeflow on AWS help you speed up the model building and training time and allow for better collaboration with easier data and code sharing.

Kubeflow on AWS helps build a highly available and robust ML platform. This platform provides flexibility to build and train deep learning models and provides access to many open-source toolkits, insights into logs, and interactive debugging for experimentation. However, achieving maximum utilization of infrastructure resources while training deep learning models on hundreds of GPUs still involves a lot of operational overheads. This could be addressed by using SageMaker, which is a fully managed service designed and optimized for handling performant and cost-optimized training clusters that are only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, thereby providing close to 100% resource utilization. You can integrate SageMaker with Kubeflow Pipelines using managed SageMaker components. This allows you to operationalize ML workflows as part of Kubeflow pipelines, where you can use Kubernetes for local training and SageMaker for product-scale training in a hybrid architecture.

Solution overview

The following architecture describes how we use Kubeflow Pipelines to build and deploy portable and scalable end-to-end ML workflows to conditionally run distributed training on Kubernetes using Kubeflow training or SageMaker based on the runtime parameter.

Kubeflow training is a group of Kubernetes Operators that add to Kubeflow the support for distributed training of ML models using different frameworks like TensorFlow, PyTorch, and others. pytorch-operator is the Kubeflow implementation of the Kubernetes custom resource (PyTorchJob) to run distributed PyTorch training jobs on Kubernetes.

We use the PyTorchJob Launcher component as part of the Kubeflow pipeline to run PyTorch distributed training during the experimentation phase when we need flexibility and access to all the underlying resources for interactive debugging and analysis.

We also use SageMaker components for Kubeflow Pipelines to run our model training at production scale. This allows us to take advantage of powerful SageMaker features such as fully managed services, distributed training jobs with maximum GPU utilization, and cost-effective training through Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances.

As part for the workflow creation process, you complete the following steps (as shown in the preceding diagram) to create this pipeline:

Use the Kubeflow manifest file to create a Kubeflow dashboard and access Jupyter notebooks from the Kubeflow central dashboard.
Use the Kubeflow pipeline SDK to create and compile Kubeflow pipelines using Python code. Pipeline compilation converts the Python function to a workflow resource, which is an Argo-compatible YAML format.
Use the Kubeflow Pipelines SDK client to call the pipeline service endpoint to run the pipeline.
The pipeline evaluates the conditional runtime variables and decides between SageMaker or Kubernetes as the target run environment.
Use the Kubeflow PyTorch Launcher component to run distributed training on the native Kubernetes environment, or use the SageMaker component to submit the training on the SageMaker managed platform.

The following figure shows the Kubeflow Pipelines components involved in the architecture that give us the flexibility to choose between Kubernetes or SageMaker distributed environments.

Use Case Workflow

We use the following step-by-step approach to install and run the use case for distributed training using Amazon EKS and SageMaker using Kubeflow on AWS.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
A machine with Docker and the AWS Command Line Interface (AWS CLI) installed.
Optionally, you can use AWS Cloud9, a cloud-based integrated development environment (IDE) that enables completing all the work from your web browser. For setup instructions, refer to Setup Cloud9 IDE. From your Cloud9 environment, choose the plus sign and open new terminal.
Create a role with the name sagemakerrole. Add managed policies AmazonSageMakerFullAccess and AmazonS3FullAccess to give SageMaker access to S3 buckets. This role is used by SageMaker job submitted as part of Kubeflow Pipelines step.
Make sure your account has SageMaker Training resource type limit for ml.p3.2xlarge increased to 2 using Service Quotas Console

1. Install Amazon EKS and Kubeflow on AWS

You can use several different approaches to build a Kubernetes cluster and deploy Kubeflow. In this post, we focus on an approach that we believe brings simplicity to the process. First, we create an EKS cluster, then we deploy Kubeflow on AWS v1.5 on it. For each of these tasks, we use a corresponding open-source project that follows the principles of the Do Framework. Rather than installing a set of prerequisites for each task, we build Docker containers that have all the necessary tools and perform the tasks from within the containers.

We use the Do Framework in this post, which automates the Kubeflow deployment with Amazon EFS as an add-on. For the official Kubeflow on AWS deployment options for production deployments, refer to Deployment.

Configure the current working directory and AWS CLI

We configure a working directory so we can refer to it as the starting point for the steps that follow:

export working_dir=$PWD

We also configure an AWS CLI profile. To do so, you need an access key ID and secret access key of an AWS Identity and Access Management (IAM) user account with administrative privileges (attach the existing managed policy) and programmatic access. See the following code:

aws configure --profile=kubeflow
AWS Access Key ID [None]: <enter access key id>
AWS Secret Access Key [None]: <enter secret access key>
Default region name [None]: us-west-2
Default output format [None]: json

# (In Cloud9, select “Cancel” and “Permanently disable” when the AWS managed temporary credentials dialog pops up)

export AWS_PROFILE=kubeflow

1.1 Create an EKS cluster

If you already have an EKS cluster available, you can skip to the next section. For this post, we use the aws-do-eks project to create our cluster.

First clone the project in your working directory

cd ${working_dir}
git clone https://github.com/aws-samples/aws-do-eks
cd aws-do-eks/

Then build and run the aws-do-eks container:
```
./build.sh
./run.sh
```
The build.sh script creates a Docker container image that has all the necessary tools and scripts for provisioning and operation of EKS clusters. The run.sh script starts a container using the created Docker image and keeps it up, so we can use it as our EKS management environment. To see the status of your aws-do-eks container, you can run ./status.sh. If the container is in Exited status, you can use the ./start.sh script to bring the container up, or to restart the container, you can run ./stop.sh followed by ./run.sh.
Open a shell in the running aws-do-eks container:
```
./exec.sh
```
To review the EKS cluster configuration for our KubeFlow deployment, run the following command:
```
vi ./eks-kubeflow.yaml
```
By default, this configuration creates a cluster named eks-kubeflow in the us-west-2 Region with six m5.xlarge nodes. Also, EBS volumes encryption is not enabled by default. You can enable it by adding "volumeEncrypted: true" to the nodegroup and it will encrypt using the default key. Modify other configurations settings if needed.
To create the cluster, run the following command:
```
export AWS_PROFILE=kubeflow
eksctl create cluster -f ./eks-kubeflow.yaml
```
The cluster provisioning process may take up to 30 minutes.

To verify that the cluster was created successfully, run the following command:

kubectl get nodes

The output from the preceding command for a cluster that was created successfully looks like the following code:

root@cdf4ecbebf62:/eks# kubectl get nodes
NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-0-166.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-13-28.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-45-240.us-west-2.compute.internal   Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-63-84.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-75-56.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
ip-192-168-85-226.us-west-2.compute.internal   Ready    <none>   23m   v1.21.14-eks-ba74326

Create an EFS volume for the SageMaker training job

In this use case, you speed up the SageMaker training job by training deep learning models from data already stored in Amazon EFS. This choice has the benefit of directly launching your training jobs from the data in Amazon EFS with no data movement required, resulting in faster training start times.

We create an EFS volume and deploy the EFS Container Storage Interface (CSI) driver. This is accomplished by a deployment script located in /eks/deployment/csi/efs within the aws-do-eks container.

This script assumes you have one EKS cluster in your account. Set CLUSTER_NAME=<eks_cluster_name> in case you have more than one EKS cluster.

cd /eks/deployment/csi/efs
./deploy.sh

This script provisions an EFS volume and creates mount targets for the subnets of the cluster VPC. It then deploys the EFS CSI driver and creates the efs-sc storage class and efs-pv persistent volume in the EKS cluster.

Upon successful completion of the script, you should see output like the following:

Generating efs-sc.yaml ...

Applying efs-sc.yaml ...
storageclass.storage.k8s.io/efs-sc created
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
efs-sc          efs.csi.aws.com         Delete          Immediate              false                  1s
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  36m

Generating efs-pv.yaml ...
Applying efs-pv.yaml ...
persistentvolume/efs-pv created
NAME     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
efs-pv   5Gi        RWX            Retain           Available           efs-sc                  10s

Done ...

Create an Amazon S3 VPC endpoint

You use a private VPC that your SageMaker training job and EFS file system have access to. To give the SageMaker training cluster access to the S3 buckets from your private VPC, you create a VPC endpoint:

cd /eks/vpc 
export CLUSTER_NAME=<eks-cluster> 
export REGION=<region> 
./vpc-endpoint-create.sh

You may now exit the aws-do-eks container shell and proceed to the next section:

exit

root@cdf4ecbebf62:/eks/deployment/csi/efs# exit
exit
TeamRole:~/environment/aws-do-eks (main) $

1.2 Deploy Kubeflow on AWS on Amazon EKS

To deploy Kubeflow on Amazon EKS, we use the aws-do-kubeflow project.

Clone the repository using the following commands:

cd ${working_dir}
git clone https://github.com/aws-samples/aws-do-kubeflow
cd aws-do-kubeflow

Then configure the project:
```
./config.sh
```
This script opens the project configuration file in a text editor. It’s important for AWS_REGION to be set to the Region your cluster is in, as well as AWS_CLUSTER_NAME to match the name of the cluster that you created earlier. By default, your configuration is already properly set, so if you don’t need to make any changes, just close the editor.
```
./build.sh
./run.sh
./exec.sh
```
The build.sh script creates a Docker container image that has all the tools necessary to deploy and manage Kubeflow on an existing Kubernetes cluster. The run.sh script starts a container, using the Docker image, and the exec.sh script opens a command shell into the container, which we can use as our Kubeflow management environment. You can use the ./status.sh script to see if the aws-do-kubeflow container is up and running and the ./stop.sh and ./run.sh scripts to restart it as needed.
After you have a shell opened in the aws-do-eks container, you can verify that the configured cluster context is as expected:
```
root@ip-172-31-43-155:/kubeflow# kubectx
kubeflow@eks-kubeflow.us-west-2.eksctl.io
```

To deploy Kubeflow on the EKS cluster, run the deploy.sh script:

./kubeflow-deploy.sh

The deployment is successful when all pods in the kubeflow namespace enter the Running state. A typical output looks like the following code:

Waiting for all Kubeflow pods to start Running ...

Waiting for all Kubeflow pods to start Running ...

Restarting central dashboard ...
pod "centraldashboard-79f489b55-vr6lp" deleted
/kubeflow/deploy/distro/aws/kubeflow-manifests /kubeflow/deploy/distro/aws
/kubeflow/deploy/distro/aws

Kubeflow deployment succeeded
Granting cluster access to kubeflow profile user ...
Argument not provided, assuming default user namespace kubeflow-user-example-com ...
clusterrolebinding.rbac.authorization.k8s.io/kubeflow-user-example-com-cluster-admin-binding created
Setting up access to Kubeflow Pipelines ...
Argument not provided, assuming default user namespace kubeflow-user-example-com ...

Creating pod-default for namespace kubeflow-user-example-com ...
poddefault.kubeflow.org/access-ml-pipeline created

To monitor the state of the KubeFlow pods, in a separate window, you can use the following command:
```
watch kubectl -n kubeflow get pods
```
Press Ctrl+C when all pods are Running, then expose the Kubeflow dashboard outside the cluster by running the following command:
```
./kubeflow-expose.sh
```

You should see output that looks like the following code:

root@ip-172-31-43-155:/kubeflow# ./kubeflow-expose.sh
root@ip-172-31-43-155:/kubeflow# Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

This command port-forwards the Istio ingress gateway service from your cluster to your local port 8080. To access the Kubeflow dashboard, visit http://localhost:8080 and log in using the default user credentials (user@example.com/12341234). If you’re running the aws-do-kubeflow container in AWS Cloud9, then you can choose Preview, then choose Preview Running Application. If you’re running on Docker Desktop, you may need to run the ./kubeflow-expose.sh script outside of the aws-do-kubeflow container.

2. Set up the Kubeflow on AWS environment

To set up your Kubeflow on AWS environment, we create an EFS volume and a Jupyter notebook.

2.1 Create an EFS volume

To create an EFS volume, complete the following steps:

On the Kubeflow dashboard, choose Volumes in the navigation pane.
Chose New volume.
For Name, enter efs-sc-claim.
For Volume size, enter 10.
For Storage class, choose efs-sc.
For Access mode, choose ReadWriteOnce.
Choose Create.

2.2 Create a Jupyter notebook

To create a new notebook, complete the following steps:

On the Kubeflow dashboard, choose Notebooks in the navigation pane.
Choose New notebook.
For Name, enter aws-hybrid-nb.
For Jupyter Docket Image, choose the image c9e4w0g3/notebook-servers/jupyter-pytorch:1.11.0-cpu-py38-ubuntu20.04-e3-v1.1 (the latest available jupyter-pytorch DLC image).
For CPU, enter 1.
For Memory, enter 5.
For GPUs, leave as None.
Don’t make any changes to the Workspace Volume section.
In the Data Volumes section, choose Attach existing volume and expand Existing volume section
For Name, choose efs-sc-claim.
For Mount path, enter /home/jovyan/efs-sc-claim.
This mounts the EFS volume to your Jupyter notebook pod, and you can see the folder efs-sc-claim in your Jupyter lab interface. You save the training dataset and training code to this folder so the training clusters can access it without needing to rebuild the container images for testing.
Select Allow access to Kubeflow Pipelines in Configuration section.
Choose Launch.
Verify that your notebook is created successfully (it may take a couple of minutes).
On the Notebooks page, choose Connect to log in to the JupyterLab environment.
On the Git menu, choose Clone a Repository.
For Clone a repo, enter https://github.com/aws-samples/aws-do-kubeflow.

3. Run distributed training

After you set up the Jupyter notebook, you can run the entire demo using the following high-level steps from the folder aws-do-kubeflow/workshop in the cloned repository:

PyTorch Distributed Data Parallel (DDP) training Script: Refer PyTorch DDP training script cifar10-distributed-gpu-final.py, which includes a sample convolutional neural network and logic to distribute training on a multi-node CPU and GPU cluster. (Refer 3.1 for details)
Install libraries: Run the notebook 0_initialize_dependencies.ipynb to initialize all dependencies. (Refer 3.2 for details)
Run distributed PyTorch job training on Kubernetes: Run the notebook 1_submit_pytorchdist_k8s.ipynb to create and submit distributed training on one primary and two worker containers using the Kubernetes custom resource PyTorchJob YAML file using Python code. (Refer 3.3 for details)
Create a hybrid Kubeflow pipeline: Run the notebook 2_create_pipeline_k8s_sagemaker.ipynb to create the hybrid Kubeflow pipeline that runs distributed training on the either SageMaker or Amazon EKS using the runtime variable training_runtime. (Refer 3.4 for details)

Make sure you ran the notebook 1_submit_pytorchdist_k8s.ipynb before you start notebook 2_create_pipeline_k8s_sagemaker.ipynb.

In the subsequent sections, we discuss each of these steps in detail.

3.1 PyTorch Distributed Data Parallel(DDP) training script

As part of the distributed training, we train a classification model created by a simple convolutional neural network that operates on the CIFAR10 dataset. The training script cifar10-distributed-gpu-final.py contains only the open-source libraries and is compatible to run both on Kubernetes and SageMaker training clusters on either GPU devices or CPU instances. Let’s look at a few important aspects of the training script before we run our notebook examples.

We use the torch.distributed module, which contains PyTorch support and communication primitives for multi-process parallelism across nodes in the cluster:

...
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision
from torchvision import datasets, transforms
...

We create a simple image classification model using a combination of convolutional, max pooling, and linear layers to which a relu activation function is applied in the forward pass of the model training:

# Define models
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)

def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x

We use the torch DataLoader that combines the dataset and DistributedSampler (loads a subset of data in a distributed manner using torch.nn.parallel.DistributedDataParallel) and provides a single-process or multi-process iterator over the data:

# Define data loader for training dataset
def _get_train_data_loader(batch_size, training_dir, is_distributed):
logger.info("Get train data loader")

train_set = torchvision.datasets.CIFAR10(root=training_dir,
train=True,
download=False,
transform=_get_transforms())

train_sampler = (
torch.utils.data.distributed.DistributedSampler(train_set) if is_distributed else None
)

return torch.utils.data.DataLoader(
train_set,
batch_size=batch_size,
shuffle=train_sampler is None,
sampler=train_sampler)
...

If the training cluster has GPUs, the script runs the training on CUDA devices and the device variable holds the default CUDA device:

device = "cuda" if torch.cuda.is_available() else "cpu"
...

Before you run distributed training using PyTorch DistributedDataParallel to run distributed processing on multiple nodes, you need to initialize the distributed environment by calling init_process_group. This is initialized on each machine of the training cluster.

dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
...

We instantiate the classifier model and copy over the model to the target device. If distributed training is enabled to run on multiple nodes, the DistributedDataParallel class is used as a wrapper object around the model object, which allows synchronous distributed training across multiple machines. The input data is split on the batch dimension and a replica of model is placed on each machine and each device.

model = Net().to(device)

if is_distributed:
model = torch.nn.parallel.DistributedDataParallel(model)

...

3.2 Install libraries

You will install all necessary libraries to run the PyTorch distributed training example. This includes Kubeflow Pipelines SDK, Training Operator Python SDK, Python client for Kubernetes and Amazon SageMaker Python SDK.

#Please run the below commands to install necessary libraries

!pip install kfp==1.8.4

!pip install kubeflow-training

!pip install kubernetes

!pip install sagemaker

3.3 Run distributed PyTorch job training on Kubernetes

The notebook 1_submit_pytorchdist_k8s.ipynb creates the Kubernetes custom resource PyTorchJob YAML file using Kubeflow training and the Kubernetes client Python SDK. The following are a few important snippets from this notebook.

We create the PyTorchJob YAML with the primary and worker containers as shown in the following code:

# Define PyTorchJob custom resource manifest
pytorchjob = V1PyTorchJob(
api_version="kubeflow.org/v1",
kind="PyTorchJob",
metadata=V1ObjectMeta(name=pytorch_distributed_jobname,namespace=user_namespace),
spec=V1PyTorchJobSpec(
run_policy=V1RunPolicy(clean_pod_policy="None"),
pytorch_replica_specs={"Master": master,
"Worker": worker}
)
)

This is submitted to the Kubernetes control plane using PyTorchJobClient:

# Creates and Submits PyTorchJob custom resource file to Kubernetes
pytorchjob_client = PyTorchJobClient()

pytorch_job_manifest=pytorchjob_client.create(pytorchjob):

View the Kubernetes training logs

You can view the training logs either from the same Jupyter notebook using Python code or from the Kubernetes client shell.

From the notebook, run the following code with the appropriate log_type parameter value to view the primary, worker, or all logs:

#  Function Definition: read_logs(pyTorchClient: str, jobname: str, namespace: str, log_type: str) -> None:
#    log_type: all, worker:all, master:all, worker:0, worker:1

read_logs(pytorchjob_client, pytorch_distributed_jobname, user_namespace, "master:0")

From the Kubernetes client shell connected to the Kubernetes cluster, run the following commands using Kubectl to see the logs (substitute your namespace and pod names):
```
kubectl get pods -n kubeflow-user-example-com
kubectl logs <pod-name> -n kubeflow-user-example-com -f
```
We set world size – 3 because we’re distributing the training to three processes running in one primary and two worker pods. Data is split at the batch dimension and a third of the data is processed by the model in each container.

3.4 Create a hybrid Kubeflow pipeline

The notebook 2_create_pipeline_k8s_sagemaker.ipynb creates a hybrid Kubeflow pipeline based on conditional runtime variable training_runtime, as shown in the following code. The notebook uses the Kubeflow Pipelines SDK and it’s provided a set of Python packages to specify and run the ML workflow pipelines. As part of this SDK, we use the following packages:

The domain-specific language (DSL) package decorator dsl.pipeline, which decorates the Python functions to return a pipeline
The dsl.Condition package, which represents a group of operations that are only run when a certain condition is met, such as checking the training_runtime value as sagemaker or kubernetes

See the following code:

# Define your training runtime value with either 'sagemaker' or 'kubernetes'
training_runtime='sagemaker'

# Create Hybrid Pipeline using Kubeflow PyTorch Training Operators and Amazon SageMaker Service
@dsl.pipeline(name="PyTorch Training pipeline", description="Sample training job test")
def pytorch_cnn_pipeline(<training parameters>):

# Pipeline Step 1: to evaluate the condition. You can enter any logic here. For demonstration we are checking if GPU is needed for training
condition_result = check_condition_op(training_runtime)

# Pipeline Step 2: to run training on Kuberentes using PyTorch Training Operators. This will be executed if gpus are not needed
with dsl.Condition(condition_result.output == 'kubernetes', name="PyTorch_Comp"):
train_task = pytorch_job_op(
name=training_job_name,
namespace=user_namespace,
master_spec=json.dumps(master_spec_loaded), # Please refer file at pipeline_yaml_specifications/pipeline_master_spec.yml
worker_spec=json.dumps(worker_spec_loaded), # Please refer file at pipeline_yaml_specifications/pipeline_worker_spec.yml
delete_after_done=False
).after(condition_result)

# Pipeline Step 3: to run training on SageMaker using SageMaker Components for Pipeline. This will be executed if gpus are needed
with dsl.Condition(condition_result.output == 'sagemaker', name="SageMaker_Comp"):
training = sagemaker_train_op(
region=region,
image=train_image,
job_name=training_job_name,
training_input_mode=training_input_mode,
hyperparameters='{ 
"backend": "'+str(pytorch_backend)+'", 
"batch-size": "64", 
"epochs": "3", 
"lr": "'+str(learning_rate)+'", 
"model-type": "custom", 
"sagemaker_container_log_level": "20", 
"sagemaker_program": "cifar10-distributed-gpu-final.py", 
"sagemaker_region": "us-west-2", 
"sagemaker_submit_directory": "'+source_s3+'" 
}',
channels=channels,
instance_type=instance_type,
instance_count=instance_count,
volume_size=volume_size,
max_run_time=max_run_time,
model_artifact_path=f's3://{bucket_name}/jobs',
network_isolation=network_isolation,
traffic_encryption=traffic_encryption,
role=role,
vpc_subnets=subnet_id,
vpc_security_group_ids=security_group_id
).after(condition_result)

We configure SageMaker distributed training using two ml.p3.2xlarge instances.

After the pipeline is defined, you can compile the pipeline to an Argo YAML specification using the Kubeflow Pipelines SDK’s kfp.compiler package. You can run this pipeline using the Kubeflow Pipeline SDK client, which calls the Pipelines service endpoint and passes in appropriate authentication headers right from the notebook. See the following code:

# DSL Compiler that compiles pipeline functions into workflow yaml.
kfp.compiler.Compiler().compile(pytorch_cnn_pipeline, "pytorch_cnn_pipeline.yaml")

# Connect to Kubeflow Pipelines using the Kubeflow Pipelines SDK client
client = kfp.Client()

experiment = client.create_experiment(name="kubeflow")

# Run a specified pipeline
my_run = client.run_pipeline(experiment.id, "pytorch_cnn_pipeline", "pytorch_cnn_pipeline.yaml")

# Please click “Run details” link generated below this cell to view your pipeline. You can click every pipeline step to see logs.

If you get a sagemaker import error, run !pip install sagemaker and restart the kernel (on the Kernel menu, choose Restart Kernel).

Choose the Run details link under the last cell to view the Kubeflow pipeline.

Repeat the pipeline creation step with training_runtime='kubernetes' to test the pipeline run on a Kubernetes environment. The training_runtime variable can also be passed in your CI/CD pipeline in a production scenario.

View the Kubeflow pipeline run logs for the SageMaker component

The following screenshot shows our pipeline details for the SageMaker component.

Choose the training job step and on the Logs tab, choose the CloudWatch logs link to access the SageMaker logs.

The following screenshot shows the CloudWatch logs for each of the two ml.p3.2xlarge instances.

Choose any of the groups to see the logs.

View the Kubeflow pipeline run logs for the Kubeflow PyTorchJob Launcher component

The following screenshot shows the pipeline details for our Kubeflow component.

Run the following commands using Kubectl on your Kubernetes client shell connected to the Kubernetes cluster to see the logs (substitute your namespace and pod names):

kubectl get pods -n kubeflow-user-example-com
kubectl logs <pod-name> -n kubeflow-user-example-com -f

4.1 Clean up

To clean up all the resources we created in the account, we need to remove them in reverse order.

Delete the Kubeflow installation by running ./kubeflow-remove.sh in the aws-do-kubeflow container. The first set of commands are optional and can be used in case you don’t already have a command shell into your aws-do-kubeflow container open.
```
cd aws-do-kubeflow
./status.sh
./start.sh
./exec.sh

./kubeflow-remove.sh
```
From the aws-do-eks container folder, remove the EFS volume. The first set of commands is optional and can be used in case you don’t already have a command shell into your aws-do-eks container open.
```
cd aws-do-eks
./status.sh
./start.sh
./exec.sh

cd /eks/deployment/csi/efs
./delete.sh
./efs-delete.sh
```
Deleting Amazon EFS is necessary in order to release the network interface associated with the VPC we created for our cluster. Note that deleting the EFS volume destroys any data that is stored on it.
From the aws-do-eks container, run the eks-delete.sh script to delete the cluster and any other resources associated with it, including the VPC:
```
cd /eks
./eks-delete.sh
```

Summary

In this post, we discussed some of the typical challenges of distributed model training and ML workflows. We provided an overview of the Kubeflow on AWS distribution and shared two open-source projects (aws-do-eks and aws-do-kubeflow) that simplify provisioning the infrastructure and the deployment of Kubeflow on it. Finally, we described and demonstrated a hybrid architecture that enables workloads to transition seamlessly between running on a self-managed Kubernetes and fully managed SageMaker infrastructure. We encourage you to use this hybrid architecture for your own use cases.

You can follow the AWS Labs repository to track all AWS contributions to Kubeflow. You can also find us on the Kubeflow #AWS Slack Channel; your feedback there will help us prioritize the next features to contribute to the Kubeflow project.

Special thanks to Sree Arasanagatta (Software Development Manager AWS ML) and Suraj Kota (Software Dev Engineer) for their support to the launch of this post.

About the authors

Kanwaljit Khurmi is an AI/ML Specialist Solutions Architect at Amazon Web Services. He works with the AWS product, engineering and customers to provide guidance and technical assistance helping them improve the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Gautam Kumar is a Software Engineer with AWS AI Deep Learning. He has developed AWS Deep Learning Containers and AWS Deep Learning AMI. He is passionate about building tools and systems for AI. In his spare time, he enjoy biking and reading books.

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect for Self-managed Machine Learning at AWS. In his role he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open source Do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on combating climate change, democratizing AI and ML, making travel safer, healthcare better, and energy smarter.

Bundesliga Match Fact Pressure Handling: Evaluating players’ performances in high-pressure situations on AWS

September 30, 2022

by Simon Rolfes Amazon AWS

Pressing or pressure in football is a process in which a team seeks to apply stress to the opponent player who possesses the ball. A team applies pressure to limit the time an opposition player has left to make a decision, reduce passing options, and ultimately attempt to turn over ball possession. Although nearly all teams seek to apply pressure to their opponents, their strategy to do so may vary.

Some teams adopt a so-called deep press, leaving the opposition with time and room to move the ball up the pitch. However, once the ball reaches the last third of the field, defenders aim to intercept the ball by pressuring the ball carrier. A slightly less conservative approach is the middle press. Here pressure is applied around the halfway line, where defenders attempt to lead the buildup in a certain direction, blocking open players and passing lanes to ultimately force the opposition back. Borussia Dortmund under Jürgen Klopp was one of the most efficient teams to use a middle press. The most aggressive type of pressing teams apply is the high press strategy. Here a team seeks to pressure the defenders and goalkeeper, focusing on direct pressure on the ball carrier, leaving them with ample time to select the right passing options as they have to cover the ball. In this strategy, the pressing team seeks to turn over possession by challenges or intercepting sloppy passes.

In February 2021, the Bundesliga released the first insight into how teams apply pressure with the Most Pressed Player Match Fact powered by AWS. Most Pressed Player quantifies defensive pressure encountered by players in real time, allowing fans to compare how some players receive pressure vs. others. Over the last 1.5 years, this Match Fact provided fans with new insights on how much teams were applying pressure, but also resulted in new questions, such as “Was this pressure successful?” or “How is this player handling pressure?”

Introducing Pressure Handling, a new Bundesliga Match Fact that aims to evaluate the performance of a frequently pressed player using different metrics. Pressure Handling is a further development of Most Pressed Player, and adds a quality component to the number of significant pressure situations a player in ball possession finds themself in. A central statistic in this new Match Fact is Escape Rate, which indicates how often a player resolves pressure situations successfully by keeping the possession for their team. In addition, fans get insight into the passing and shooting performance of players under pressure.

This post dives deeper into how the AWS team worked closely together with the Bundesliga to bring the Pressure Handling Match Fact to life.

How does it work?

This new Bundesliga Match Fact profiles the performance of players in pressure situations. For example, an attacking player in ball possession may get pressured by opposing defenders. There is a significant probability of him losing the ball. If that player manages to resolve the pressure situation without losing the ball, they increase their performance under pressure. Not losing the ball is defined as the team retaining ball possession after the individual ball possession of the player ends. This could, for instance, be either by a successful pass to a teammate, being fouled, or obtaining a throw-in or a corner kick. In contrast, a pressed player can lose the ball through a tackle or an unsuccessful pass. We only count those ball possessions in which the player received the ball from their teammate. That way, we exclude situations where they intercept the ball and are under pressure immediately (which usually happens).

We aggregate the pressure handling performance of a player into a single KPI called escape rate. The escape rate is defined as the fraction of ball possessions by a player where they were under pressure and didn’t lost the ball. In this case, “under pressure” is defined as a pressure value of >0.6 (see our previous post for more information on the pressure value itself). The escape rate allows us to evaluate players on a per-match or per-season basis. The following heuristic is used for computing the escape rate:

We start with a series of pressure events, based on the existing Most Pressed Player Match Fact. Each event consists of a list containing all individual pressure events on the ball carrier during one individual ball possession (IBP) phase.
For each phase, we compute the maximum aggregated pressure on the ball carrier.
As mentioned earlier, a pressure phase needs to satisfy two conditions in order to be considered:
1. The previous IBP was by a player of the same team.
2. The maximum pressure on the player during the current IBP was > 0.6.
If the subsequent IBP accounts to a player of the same team, we count this as an escape. Otherwise, it’s counted as a lost ball.
For each player, we compute the escape rate by counting the number of escapes and dividing it by the number of pressure events.

Examples of escapes

To illustrate the different ways of successfully resolving pressure, the following videos show four examples of Joshua Kimmich escaping pressure situations (Matchday 5, Season 22/23 – Union Berlin vs. Bayern Munich).

Joshua Kimmich moving out of pressure and passing to the wing.

Joshua Kimmich playing a quick pass forward to escape ensuing pressure.

Joshua Kimmich escaping pressure twice. The first escape is by a sliding tackle of the opponent, which nevertheless resulted in retained team ball possession. The second escape is by being fouled and thereby retaining team ball possession.

Joshua Kimmich escapes pressure by with a quick move and a pass.

Pressure Handling findings

Let’s look at some findings.

With the Pressure Handling Match Fact, players are ranked according to their escape rate on a match basis. In order to have a fair comparison between players, we only rank players that were under pressure at least 10 times.

The following table shows the number of times a player was in the top 2 of match rankings over the first seven matchdays of the 2022/23 season. We only show players with at least three appearances in the top 2.

Number of Times in Top 2	Player	Number of Times in Ranking
4	Joshua Kimmich	5
4	Exequiel Palacios	6
3	Jude Bellingham	7
3	Alphonso Davies	6
3	Lars Stindl	3
3	Jonas Hector	6
3	Vincenzo Grifo	4
3	Kevin Stöger	7

Joshua Kimmich and Exequiel Palacios lead the pack with four appearances in the top 2 of match rankings each. A special mention may go to Lars Stindl, who appeared in the top 2 three times despite playing only three times before an injury prevent further Bundesliga starts.

How it is implemented?

The Bundesliga Match Fact Pressure Handling consumes positions and event data, as well as data from other Bundesliga Match Facts, namely xPasses and Most Pressed Player. Match Facts are independently running AWS Fargate containers inside Amazon Elastic Container Service (Amazon ECS). To guarantee the latest data is being reflected in the Pressure Handling calculations, we use Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Amazon MSK allows different Bundesliga Match Facts to send and receive the newest events and updates in real time. By consuming Kafka, we receive most up-to-date events from all systems. The following diagram illustrates the end-to-end workflow for Pressure Handling.

Pressure Handling starts its calculation after an event is received from the Most Pressed Player Match Fact. The Pressure Handling container writes the current statistics to a topic in Amazon MSK. A central AWS Lambda function consumes these messages from Amazon MSK, and writes the escape rates to an Amazon Aurora database. This data is then used for interactive near-real-time visualizations using Amazon QuickSight. Besides that, the results are also sent to a feed, which then triggers another Lambda function that sends the data to external systems where broadcasters worldwide can consume it.

Summary

In this post, we demonstrated how the new Bundesliga Match Fact Pressure Handling makes it possible to quantify and objectively compare the performance of different Bundesliga players in high-pressure situations. To do so, we build on and combine previously published Bundesliga Match Facts in real time. This allows commentators and fans to understand which players shine when pressured by their opponents.

The new Bundesliga Match Fact is the result of an in-depth analysis by the Bundesliga’s football experts and AWS data scientists. Extraordinary escape rates are shown in the live ticker of the respective matches in the official Bundesliga app. During a broadcast, escape rates are provided to commentators through the data story finder and visually shown to fans at key moments, such as when a player with a high pressure count and escape rate scores a goal, passes exceptionally well, or overcomes many challenges while staying in control of the ball.

We hope that you enjoy this brand-new Bundesliga Match Fact and that it provides you with new insights into the game. To learn more about the partnership between AWS and Bundesliga, visit Bundesliga on AWS!

We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.

About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals, and won 26 caps for Germany. Currently, Rolfes serves as Managing Director Sport at Bayer 04 Leverkusen, where he oversees and develops the pro player roster, the scouting department, and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS. There he offers his expertise as a former player, captain, and TV analyst to highlight the impact of advanced statistics and machine learning into the world of football.

Luuk Figdor is a Sports Technology Advisor in the AWS Professional Services team. He works with players, clubs, leagues, and media companies such as the Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time, he likes to learn all about the mind and the intersection between psychology, economics, and AI.

Javier Poveda-Panter is a Data Scientist for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high-quality user and fan experiences through machine learning and data science. He follows his passion for a broad range of sports, music, and AI in his spare time.

Tareq Haschemi is a consultant within AWS Professional Services. His skills and areas of expertise include application development, data science, machine learning, and big data. He supports customers in developing data-driven applications within the cloud. Prior to joining AWS, he was also a consultant in various industries such as aviation and telecommunications. He is passionate about enabling customers on their data/AI journey to the cloud.

Fotinos Kyriakides is a Consultant with AWS Professional Services. Through his work as a Data Engineer and Application Developer, he supports customers in developing applications in the cloud that leverage and innovate on insights generated from data. In his spare time, he likes to run and explore nature.

Uwe Dick is a Data Scientist at Sportec Solutions AG. He works to enable Bundesliga clubs and media to optimize their performance using advanced stats and data—before, after, and during matches. In his spare time, he settles for less and just tries to last the full 90 minutes for his recreational football team.

Bundesliga Match Fact Win Probability: Quantifying the effect of in-game events on winning chances using machine learning on AWS

September 30, 2022

by Simon Rolfes Amazon AWS

Ten years from now, the technological fitness of clubs will be a key contributor towards their success. Today we’re already witnessing the potential of technology to revolutionize the understanding of football. xGoals quantifies and allows comparison of goal scoring potential of any shooting situation, while xThreat and EPV models predict the value of any in-game moment. Ultimately, these and other advanced statistics serve one purpose: to improve the understanding of who will win and why. Enter the new Bundesliga Match Fact: Win Probability.

In Bayern’s second match against Bochum last season, the tables turned unexpectedly. Early in the match, Lewandowski scores 1:0 after just 9 minutes. The “Grey Mouse” of the league is instantly reminded of their 7:0 disaster when facing Bayern for the first time that season. But not this time: Christopher Antwi-Adjei scores his first goal for the club just 5 minutes later. After conceiving a penalty goal in the 38th minute, the team from Monaco di Bavaria seems paralyzed and things began to erupt: Gamboa nutmegs Coman and finishes with an absolute corker of a goal, and Holtmann makes it 4:1 close to halftime with a dipper from the left. Bayern hadn’t conceived this many goals in the first half since 1975, and was barely able to walk away with a 4:2 result. Who could have guessed that? Both teams played without their first keepers, which for Bayern meant missing out on their captain Manuel Neuer. Could his presence have saved them from this unexpected result?

Similarly, Cologne pulled off two extraordinary zingers in the 2020/2021 season. When they faced Dortmund, they had gone 18 matches without a win, while BVB’s Haaland was providing a master class in scoring goals that season (23 in 22 matches). The role of the favorite was clear, yet Cologne took an early lead with just 9 minutes on the clock. In the beginning of the second half, Skhiri scored a carbon-copy goal of his first one: 0:2. Dortmund subbed in attacking strength, created big chances, and scored 1:2. Of all players, Haaland missed a sitter 5 minutes into extra time and crowned Cologne with the first 3 points in Dortmund after almost 30 years.

Later in that season, Cologne—being last in the home-table—surprised RB Leipzig, who had all the motivation to close in on the championship leader Bayern. The opponent Leipzig pressured the “Billy Goats” with a team season record of 13 shots at goal in the first half, increasing their already high chances of a win. Ironically, Cologne scored the 1:0 with the first shot at goal in minute 46. After the “Red Bulls” scored a well-deserved equalizer, they slept on a throw-in just 80 seconds later, leading to Jonas Hector scoring for Cologne again. Just like Dortmund, Leipzig now put all energy into offense, but the best they managed to achieve was hitting the post in overtime.

For all of these matches, experts and novices alike would have wrongly guessed the winner, even well into the match. But what are the events that led to these surprising in-game swings of win probability? At what minute did the underdog’s chance of winning overtake the favorite’s as they ran out of time? Bundesliga and AWS have worked together to compute and illustrate the live development of winning chances throughout matches, enabling fans to see key moments of probability swings. The result is the new machine learning (ML)-powered Bundesliga Match Fact: Win Probability.

How does it work?

The new Bundesliga Match Fact Win Probability was developed by building ML models that analyzed over 1,000 historical games. The live model takes the pre-match estimates and adjusts them according to the match proceedings based on features that affect the outcome, including the following:

Goals
Penalties
Red cards
Substitutions
Time passed
Goal scoring chances created
Set-piece situations

The live model is trained using a neural network architecture and uses a Poisson distribution aplproach to predict a goals-per-minute-rate r for each team, as described in the following equation:

Those rates can be viewed as an estimation of a team’s strength and are computed using a series of dense layers based on the inputs. Based on these rates and the difference between the opponents, the probabilities of a win and a draw are computed in real time.

The input to the model is a 3-tuple of input features, current goal difference, and remaining playtime in minutes.

The first component of the three input dimensions consists of a feature set that describes the current game action in real time for both teams in performance metrics. These include various aggregated team-based xG values, with particular attention to the shots taken in the last 15 minutes before the prediction. We also process red cards, penalties, corner kicks, and the number of dangerous free kicks. A dangerous free kick is classified as a free kick closer than 25m to the opponent’s goal. During the development of the model, besides the influence of the former Bundesliga Match Fact xGoals, we also evaluated the impact of Bundesliga Match Fact Skill in the model. This means that the model reacts to substitution of top players—players with badges in the skills Finisher, Initiator, or Ball winner.

Win Probability example

Let’s look at a match from the current season (2022/2023). The following graph shows the win probability for the Bayern Munich and Stuttgart match from matchday 6.

The pre-match model calculated a win probability of 67% for Bayern, 14% for Stuttgart, and 19% for a draw. When we look at the course of the match, we see a large impact of goals scored in minute 36′, 57′, and 60′. Until the first minute of overtime, the score was 2:1 for Bayern. Only a successful penalty shot by S. Grassy in minute 90+2 secured a draw. The Win Probability Live Model therefore corrected the draw forecast from 5% to over 90%. The result is an unexpected late swing, with Bayern’s win probability decreasing from 90% to 8% in the 90+2 minute. The graph is representative of the swing in atmosphere in the Allianz Arena that day.

How it is implemented?

Win Probability consumes event data from an ongoing match (goal events, fouls, red cards, and more) as well as data produced by other Match Facts, such as xGoals. For real-time updates of probabilities, we use Amazon Managed Streaming Kafka (Amazon MSK) as a central data streaming and messaging solution. This way, event data, positions data, and outputs of different Bundesliga Match Facts can be communicated between containers real time.

The following diagram illustrates the end-to-end workflow for Win Probability.

Gathered match-related data gets ingested through an external provider (DataHub). Metadata of the match is ingested and processed in an AWS Lambda function. Positions and events data are ingested through an AWS Fargate container (MatchLink). All ingested data is then published for consumption in respective MSK topics. The heart of the Win Probability Match Fact sits in a dedicated Fargate container (BMF WinProbability), which runs for the duration of the respective match and consumes all required data obtained though Amazon MSK. The ML models (live and pre-match) are deployed on Amazon SageMaker Serverless Inference endpoints. Serverless endpoints automatically launch compute resources and scale those compute resources depending on incoming traffic, eliminating the need to choose instance types or manage scaling policies. With this pay-per-use model, Serverless Inference is ideal for workloads that have idle periods between traffic spurts. When there are no Bundesliga matches, there is no cost for idle resources.

Shortly before kick-off, we generate our initial set of features and calculate the pre-match win probabilities by calling the PreMatch SageMaker endpoint. With those PreMatch probabilities, we then initialize the live model, which reacts in real time to relevant in-game events and is continuously queried to receive current win probabilities.

The calculated probabilities are then sent back to DataHub to be provided to other MatchFacts consumers. Probabilities are also sent to the MSK cluster to a dedicated topic, to be consumed by other Bundesliga Match Facts. A Lambda function consumes all probabilities from the respective Kafka topic, and writes them to an Amazon Aurora database. This data is then used for interactive near-real-time visualizations using Amazon QuickSight.

Summary

In this post, we demonstrated how the new Bundesliga Match Fact Win Probability shows the impact of in-game events on the chances of a team winning or losing a match. To do so, we build on and combine previously published Bundesliga Match Facts in real time. This allows commentators and fans to uncover moments of probability swings and more during live matches.

The new Bundesliga Match Fact is the result of an in-depth analysis by the Bundesliga’s football experts and AWS data scientists. Win probabilities are shown in the live ticker of the respective matches in the official Bundesliga app. During a broadcast, win probabilities are provided to commentators through the data story finder and visually shown to fans at key moments, such as when the underdog takes the lead and is now most likely to win the game.

We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.

About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals, and won 26 caps for Germany. Currently, Rolfes serves as Managing Director Sport at Bayer 04 Leverkusen, where he oversees and develops the pro player roster, the scouting department and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS. There he offers his expertise as a former player, captain, and TV analyst to highlight the impact of advanced statistics and machine learning into the world of football.

Gabriel Zylka is a Machine Learning Engineer within AWS Professional Services. He works closely with customers to accelerate their cloud adoption journey. Specialized in the MLOps domain, he focuses on productionizing machine learning workloads by automating end-to-end machine learning lifecycles and helping achieve desired business outcomes.

Jakub Michalczyk is a Data Scientist at Sportec Solutions AG. Several years ago, he chose math studies over playing football, as he came to the conclusion that he wasn’t good enough at the latter. Now he combines both these passions in his professional career by applying machine learning methods to gain a better insight into this beautiful game. In his spare time, he still enjoys playing seven-a-side football, watching crime movies, and listening to film music.

Unified data preparation, model training, and deployment with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot – Part 2

September 30, 2022

by Geremy Cohen Amazon AWS

Depending on the quality and complexity of data, data scientists spend between 45–80% of their time on data preparation tasks. This implies that data preparation and cleansing take valuable time away from real data science work. After a machine learning (ML) model is trained with prepared data and readied for deployment, data scientists must often rewrite the data transformations used for preparing data for ML inference. This may stretch the time it takes to deploy a useful model that can inference and score the data from its raw shape and form.

In Part 1 of this series, we demonstrated how Data Wrangler enables a unified data preparation and model training experience with Amazon SageMaker Autopilot in just a few clicks. In this second and final part of this series, we focus on a feature that includes and reuses Amazon SageMaker Data Wrangler transforms, such as missing value imputers, ordinal or one-hot encoders, and more, along with the Autopilot models for ML inference. This feature enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference, further reducing the time required to deploy a trained model to production.

Solution overview

Data Wrangler reduces the time to aggregate and prepare data for ML from weeks to minutes, and Autopilot automatically builds, trains, and tunes the best ML models based on your data. With Autopilot, you still maintain full control and visibility of your data and model. Both services are purpose-built to make ML practitioners more productive and accelerate time to value.

The following diagram illustrates our solution architecture.

Prerequisites

Because this post is the second in a two-part series, make sure you’ve successfully read and implemented Part 1 before continuing.

Export and train the model

In Part 1, after data preparation for ML, we discussed how you can use the integrated experience in Data Wrangler to analyze datasets and easily build high-quality ML models in Autopilot.

This time, we use the Autopilot integration once again to train a model against the same training dataset, but instead of performing bulk inference, we perform real-time inference against an Amazon SageMaker inference endpoint that is created automatically for us.

In addition to the convenience provided by automatic endpoint deployment, we demonstrate how you can also deploy with all the Data Wrangler feature transforms as a SageMaker serial inference pipeline. This enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference.

Note that this feature is currently only supported for Data Wrangler flows that don’t use join, group by, concatenate, and time series transformations.

We can use the new Data Wrangler integration with Autopilot to directly train a model from the Data Wrangler data flow UI.

Choose the plus sign next to the Scale values node, and choose Train model.
For Amazon S3 location, specify the Amazon Simple Storage Service (Amazon S3) location where SageMaker exports your data.
If presented with a root bucket path by default, Data Wrangler creates a unique export sub-directory under it—you don’t need to modify this default root path unless you want to.Autopilot uses this location to automatically train a model, saving you time from having to define the output location of the Data Wrangler flow and then define the input location of the Autopilot training data. This makes for a more seamless experience.
Choose Export and train to export the transformed data to Amazon S3.

When export is successful, you’re redirected to the Create an Autopilot experiment page, with the Input data S3 location already filled in for you (it was populated from the results of the previous page).
For Experiment name, enter a name (or keep the default name).
For Target, choose Outcome as the column you want to predict.
Choose Next: Training method.

As detailed in the post Amazon SageMaker Autopilot is up to eight times faster with new ensemble training mode powered by AutoGluon, you can either let Autopilot select the training mode automatically based on the dataset size, or select the training mode manually for either ensembling or hyperparameter optimization (HPO).

The details of each option are as follows:

Auto – Autopilot automatically chooses either ensembling or HPO mode based on your dataset size. If your dataset is larger than 100 MB, Autopilot chooses HPO; otherwise it chooses ensembling.
Ensembling – Autopilot uses the AutoGluon ensembling technique to train several base models and combines their predictions using model stacking into an optimal predictive model.
Hyperparameter optimization – Autopilot finds the best version of a model by tuning hyperparameters using the Bayesian optimization technique and running training jobs on your dataset. HPO selects the algorithms most relevant to your dataset and picks the best range of hyperparameters to tune the models.For our example, we leave the default selection of Auto.

Choose Next: Deployment and advanced settings to continue.
On the Deployment and advanced settings page, select a deployment option.
It’s important to understand the deployment options in more detail; what we choose will impact whether or not the transforms we made earlier in Data Wrangler will be included in the inference pipeline:
- Auto deploy best model with transforms from Data Wrangler – With this deployment option, when you prepare data in Data Wrangler and train a model by invoking Autopilot, the trained model is deployed alongside all the Data Wrangler feature transforms as a SageMaker serial inference pipeline. This enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference. Note that the inference endpoint expects the format of your data to be in the same format as when it’s imported into the Data Wrangler flow.
- Auto deploy best model without transforms from Data Wrangler – This option deploys a real-time endpoint that doesn’t use Data Wrangler transforms. In this case, you need to apply the transforms defined in your Data Wrangler flow to your data prior to inference.
- Do not auto deploy best model – You should use this option when you don’t want to create an inference endpoint at all. It’s useful if you want to generate a best model for later use, such as locally run bulk inference. (This is the deployment option we selected in Part 1 of the series.) Note that when you select this option, the model created (from Autopilot’s best candidate via the SageMaker SDK) includes the Data Wrangler feature transforms as a SageMaker serial inference pipeline.
For this post, we use the Auto deploy best model with transforms from Data Wrangler option.
For Deployment option, select Auto deploy best model with transforms from Data Wrangler.
Leave the other settings as default.
Choose Next: Review and create to continue.
On the Review and create page, we see a summary of the settings chosen for our Autopilot experiment.
Choose Create experiment to begin the model creation process.

You’re redirected to the Autopilot job description page. The models show on the Models tab as they are generated. To confirm that the process is complete, go to the Job Profile tab and look for a Completed value for the Status field.

You can get back to this Autopilot job description page at any time from Amazon SageMaker Studio:

Choose Experiments and Trials on the SageMaker resources drop-down menu.
Select the name of the Autopilot job you created.
Choose (right-click) the experiment and choose Describe AutoML Job.

View the training and deployment

When Autopilot completes the experiment, we can view the training results and explore the best model from the Autopilot job description page.

Choose (right-click) the model labeled Best model, and choose Open in model details.

The Performance tab displays several model measurement tests, including a confusion matrix, the area under the precision/recall curve (AUCPR), and the area under the receiver operating characteristic curve (ROC). These illustrate the overall validation performance of the model, but they don’t tell us if the model will generalize well. We still need to run evaluations on unseen test data to see how accurately the model makes predictions (for this example, we predict if an individual will have diabetes).

Perform inference against the real-time endpoint

Create a new SageMaker notebook to perform real-time inference to assess the model performance. Enter the following code into a notebook to run real-time inference for validation:

import boto3

### Define required boto3 clients

sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client(service_name="sagemaker-runtime")

### Define endpoint name

endpoint_name = "<YOUR_ENDPOINT_NAME_HERE>"

### Define input data

payload_str = '5,166.0,72.0,19.0,175.0,25.8,0.587,51'
payload = payload_str.encode()
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="text/csv",
    Body=payload,
)

response["Body"].read()

After you set up the code to run in your notebook, you need to configure two variables:

endpoint_name
payload_str

Configure endpoint_name

endpoint_name represents the name of the real-time inference endpoint the deployment auto-created for us. Before we set it, we need to find its name.

Choose Endpoints on the SageMaker resources drop-down menu.
Locate the name of the endpoint that has the name of the Autopilot job you created with a random string appended to it.
Choose (right-click) the experiment, and choose Describe Endpoint.

The Endpoint Details page appears.
Highlight the full endpoint name, and press Ctrl+C to copy it the clipboard.
Enter this value (make sure its quoted) for endpoint_name in the inference notebook.

Configure payload_str

The notebook comes with a default payload string payload_str that you can use to test your endpoint, but feel free to experiment with different values, such as those from your test dataset.

To pull values from the test dataset, follow the instructions in Part 1 to export the test dataset to Amazon S3. Then on the Amazon S3 console, you can download it and select the rows to use the file from Amazon S3.

Each row in your test dataset has nine columns, with the last column being the outcome value. For this notebook code, make sure you only use a single data row (never a CSV header) for payload_str. Also make sure you only send a payload_str with eight columns, where you have removed the outcome value.

For example, if your test dataset files look like the following code, and we want to perform real-time inference of the first row:

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome 
10,115,0,0,0,35.3,0.134,29,0 
10,168,74,0,0,38.0,0.537,34,1 
1,103,30,38,83,43.3,0.183,33,0

We set payload_str to 10,115,0,0,0,35.3,0.134,29. Note how we omitted the outcome value of 0 at the end.

If by chance the target value of your dataset is not the first or last value, just remove the value with the comma structure intact. For example, assume we’re predicting bar, and our dataset looks like the following code:

foo,bar,foobar
85,17,20

In this case, we set payload_str to 85,,20.

When the notebook is run with the properly configured payload_str and endpoint_name values, you get a CSV response back in the format of outcome (0 or 1), confidence (0-1).

Cleaning Up

To make sure you don’t incur tutorial-related charges after completing this tutorial, be sure to shutdown the Data Wrangler app (https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-shut-down.html), as well as all notebook instances used to perform inference tasks. The inference endpoints created via the Auto Pilot deploy should be deleted to prevent additional charges as well.

Conclusion

In this post, we demonstrated how to integrate your data processing, featuring engineering, and model building using Data Wrangler and Autopilot. Building on Part 1 in the series, we highlighted how you can easily train, tune, and deploy a model to a real-time inference endpoint with Autopilot directly from the Data Wrangler user interface. In addition to the convenience provided by automatic endpoint deployment, we demonstrated how you can also deploy with all the Data Wrangler feature transforms as a SageMaker serial inference pipeline, providing for automatic preprocessing of the raw data, with the reuse of Data Wrangler feature transforms at the time of inference.

Low-code and AutoML solutions like Data Wrangler and Autopilot remove the need to have deep coding knowledge to build robust ML models. Get started using Data Wrangler today to experience how easy it is to build ML models using Autopilot.

About the authors

Geremy Cohen is a Solutions Architect with AWS where he helps customers build cutting-edge, cloud-based solutions. In his spare time, he enjoys short walks on the beach, exploring the bay area with his family, fixing things around the house, breaking things around the house, and BBQing.

Pradeep Reddy is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Autopilot, SageMaker Automatic Model Tuner. Outside of work, Pradeep enjoys reading, running and geeking out with palm sized computers like raspberry pi, and other home automation tech.

Dr. John He is a senior software development engineer with Amazon AI, where he focuses on machine learning and distributed computing. He holds a PhD degree from CMU.

Amazon sponsors contest on energy management in buildings

September 30, 2022

by Amazon AWS

NeurIPS competition involves reinforcement learning, with the objective of minimizing both cost and CO2 emissions.Read More

Data-driven fault analysis is key to sustainable facilities management

September 30, 2022

by Amazon AWS

How data-driven methods can help to identify fault detection and drive energy efficiencies for facilities of all sizes.Read More

How Sophos trains a powerful, lightweight PDF malware detector at ultra scale with Amazon SageMaker

September 29, 2022

by Salma Taoufiq Amazon AWS

This post is co-authored by Salma Taoufiq and Harini Kannan from Sophos.

As a leader in next-generation cybersecurity, Sophos strives to protect more than 500,000 organizations and millions of customers across over 150 countries against evolving threats. Powered by threat intelligence, machine learning (ML), and artificial intelligence from Sophos X-Ops, Sophos delivers a broad and varied portfolio of advanced products and services to secure and defend users, networks, and endpoints against phishing, ransomware, malware, and the wide range of cyberattacks out there.

The Sophos Artificial Intelligence (AI) group (SophosAI) oversees the development and maintenance of Sophos’s major ML security technology. Security is a big-data problem. To evade detection, cybercriminals are constantly crafting novel attacks. This translates into colossal threat datasets that the group must work with to best defend customers. One notable example is the detection and elimination of files that were cunningly laced with malware, where the datasets are in terabytes.

In this post, we focus on Sophos’s malware detection system for the PDF file format specifically. We showcase how SophosAI uses Amazon SageMaker distributed training with terabytes of data to train a powerful lightweight XGBoost (Extreme Gradient Boosting) model. This allows their team to iterate over large training data faster with automatic hyperparameter tuning and without managing the underlying training infrastructure.

The solution is currently seamlessly integrated into the production training pipeline and the model deployed on millions of user endpoints via the Sophos endpoint service.

Use case context

Whether you want to share an important contract or preserve the fancy design of your CV, the PDF format is the most common choice. Its widespread use and the general perception that such documents are airtight and static have lulled users into a false sense of security. PDF has, therefore, become an infection vector of choice in attackers’ arsenal. Malicious actions using PDFs are most often achieved via embedding a JavaScript payload that is run by the PDF reader to download a virus from a URI, sabotage the user’s machine, or steal sensitive information.

Sophos detects malicious PDF files at various points of an attack using an ensemble of deterministic and ML models. One such approach is illustrated in the following diagram, where the malicious PDF file is delivered through email. As soon as a download attempt is made, it triggers the malicious executable script to connect to the attacker’s Command and Control server. SophosAI’s PDF detector blocks the download attempt after detecting that it’s malicious.

Other ways include blocking the PDF files in the endpoint, sending the malicious files to a sandbox (where it’s scored using multiple models), submitting the malicious file to a scoring infrastructure and generating a security report, and so on.

Motivation

To build a tree-based detector that can convict malicious PDFs with high confidence, while allowing for low endpoint computing power consumption and fast inference responses, the SophosAI team found the XGBoost algorithm to be a perfect candidate for the task. Such research avenues are important for Sophos for two reasons. Having powerful yet small models deployed at the level of customer endpoints has a high impact on the company’s product reviews by analysts. It also, and more importantly, provides a better user experience overall.

Technical challenge

Because the goal was to have a model with a smaller memory footprint than their existing PDF malware detectors (both on disk and in memory), SophosAI turned XGBoost, a classification algorithm with a proven record of producing drastically smaller models than neural networks while achieving impressive performance on tabular data. Before venturing into modeling XGBoost experiments, an important consideration was the sheer size of the dataset. Indeed, Sophos’s core dataset of PDF files is in terabytes.

Therefore, the main challenge was training the model with a large dataset without having to downsample. Because it’s crucial for the detector to learn to spot any PDF-based attacks — even needle-in-the-haystack and completely novel ones to better defend Sophos customers — it’s of the utmost importance to use all available diverse datasets.

Unlike neural networks, where you can train in batches, for XGBoost, we need the entire training dataset in memory. The largest training dataset for this project is over 1 TB, and there is no way to train on such a scale without utilizing the methodologies of a distributed training framework.

Solution overview

SageMaker is a fully managed ML service providing various tools to build, train, optimize, and deploy ML models. The SageMaker built-in libraries of algorithms consist of 21 popular ML algorithms, including XGBoost. (For more information, see Simplify machine learning with XGBoost and Amazon SageMaker.) With the XGBoost built-in algorithm, you can take advantage of the open-source SageMaker XGBoost Container by specifying a framework version greater than 1.0-1, which has improved flexibility, scalability, extensibility, and Managed Spot Training, and supports input formats like Parquet, which is the format used for the PDF dataset.

The main reason SophosAI chose SageMaker is the ability to benefit from the fully managed distributed training on multi-node CPU instances by simply specifying more than one instance. SageMaker automatically splits the data across nodes, aggregates the results across peer nodes, and generates a single model. The instances can be Spot Instances, thereby significantly reducing the training costs. With the built-in algorithm for XGBoost, you can do this without any additional custom script. Distributed versions of XGBoost also exist as open source, such as XGBoost-Ray and XGBoost4J-Spark, but their use requires building, securing, tuning, and self-managing distributed computing clusters, which represents significant effort additional to scientific development.

Additionally, SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs with ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric for the given ML task.

The following diagram illustrates the solution architecture.

It’s worth noting that, when SophosAI started XGBoost experiments before turning to SageMaker, attempts were made to use large-memory Amazon Elastic Compute Cloud (Amazon EC2) instances (for example, r5a.24xlarge and x1.32xlarge) to train the model on as large of a sample of the data as possible. However, these attempts took more than 10 hours on average and usually failed due to running out of memory.

In contrast, by using the SageMaker XGBoost algorithm and a hassle-free distributed training mechanism, SophosAI could train a booster model at scale on the colossal PDF training dataset in a matter of 20 minutes. The team only had to store the data on Amazon Simple Storage Service (Amazon S3) as Parquet files of similar size, and choose an EC2 instance type and the desired number of instances, and SageMaker managed the underlying compute cluster infrastructure and distributed training between multiple nodes of the cluster. Under the hood, SageMaker splits the data across nodes using ShardedByS3Key to distribute the file objects equally between each instance and uses XGBoost implementation of the Rabit protocol (reliable AllReduce and broadcast interface) to launch distributed processing and communicate between primary and peer nodes. (For more details on the histogram aggregation and broadcast across nodes, refer to XGBoost: A Scalable Tree Boosting System.)

Beyond just training one model, with SageMaker, XGBoost hyperparameter tuning was also made quick and easy with the ability to run different experiments simultaneously to fine-tune the best combination of hyperparameters. The tunable hyperparameters include both booster-specific and objective function-specific hyperparameters. Two search strategies are offered: random or Bayesian. The Bayesian search strategy has proven to be valuable because it helps find better hyperparameters than a mere random search, in fewer experimental iterations.

Dataset information

SophosAI’s PDF malware detection modeling relies on a variety of features such as n-gram histograms and byte entropy features (For more information, refer to MEADE: Towards a Malicious Email Attachment Detection Engine). Metadata and features extracted from collected PDF files are stored in a distributed data warehouse. A dataset of over 3,500 features is then computed, further split based on time into training and test sets and stored in batches as Parquet files in Amazon S3 to be readily accessible by SageMaker for training jobs.

The following table provides information about the training and test data.

Dataset	Number of Samples	Number of Parquet Files	Total Size
Training	70,391,634	5,500	~1010 GB
Test	1,242,283	98	~18 GB

The data sizes have been computed following the formula:

Data Size = N × (n_F + n_L) × 4

The formula has the following parameters:

N is the number of samples in the dataset
n_F is the number of features, with n_F = 3585
n_L is the number of ground truth labels, with n_L = 1
4 is the number of bytes needed for the features’ data type: float32

Additionally, the following pie charts provide the label distribution of both the training and test sets, eliciting the class imbalance faced in the PDF malware detection task.

The distribution shifts from the training set to the One-month test set. A time-based split of the dataset into training and testing is applied in order to simulate the real-life deployment scenario and avoid temporal snooping. This strategy also allowed SophosAI to evaluate the model’s true generalization capabilities when faced with previously unseen brand-new PDF attacks, for example.

Experiments and results

To kickstart experiments, the SophosAI team trained a baseline XGBoost model with default parameters. Then they started performing hyperparameter fine-tuning with SageMaker using the Bayesian strategy, which is as simple as specifying the hyperparameters to be tuned and the desired range of values, the evaluation metric (ROC (Receiver Operating Characteristic) AUC in this case) and the training and validation sets. For the PDF malware detector, SophosAI prioritized hyperparameters including the number of boosting rounds (num_round), the maximum tree depth (max_depth), the learning rate (eta), and the columns sampling ratio when building trees (colsample_bytree). Eventually, the best hyperparameters were obtained and used to train a model on the full dataset, and finally evaluated on the holdout test set.

The following plot shows the objective metric (ROC AUC) vs. the 15 training jobs run within the tuning job. The best hyperparameters are those corresponding to the ninth training job.

At the beginning of SophosAI’s experiments on SageMaker, an especially important question to answer was: what type of instances and how many of them are needed to train XGBoost on the data on hand? This is crucial because using the wrong number or type of instance can be a waste of time and money; the training is bound to fail due to running out of memory, or, if using too many too-large instances, this can become unnecessarily expensive.

XGBoost is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C4). To make an informed decision, there is a simple SageMaker guideline for picking the number of instances required to run training on the full dataset:

Total Training Data Size × Safety Factor^(*) < Instance Count × Instance Type’s Total Memory

In this case: Total Training Data Size × Safety Factor (12) = 12120 GB

The following table summarizes the requirements when the chosen instance type is ml.m5.24xlarge.

Training Size × Safety Factor (12)	Instance Memory ml.m5.24xlarge	Minimum Instance Count Required for Training
12120 GB	384 GB	32

^*Due to the nature of XGBoost distributed training, which requires the entire training dataset to be loaded into a DMatrix object before training and additional free memory, a safety factor of 10–12 is recommended.

To take a closer look at the memory utilization for a full SageMaker training of XGBoost on the provided dataset, we provide the corresponding graph obtained from the training’s Amazon CloudWatch monitoring. For this training job, 40 ml.m5.24xlarge instances were used and maximum memory utilization reached around 62 %.

The engineering cost saved by integrating a managed ML service like SageMaker into the data pipeline is around 50%. The option to use Spot Instances for training and hyperparameter tuning jobs cut costs by an additional 63%.

Conclusion

With SageMaker, the SophosAI team could successfully resolve a complex high-priority project by building a lightweight PDF malware detection XGBoost model that is much smaller on disk (up to 25 times smaller) and in-memory (up to 5 times smaller) than its detector predecessor. It’s a small but mighty malware detector with ~0.99 AUC and a true positive rate of 0.99 and a false positive rate of . This model can be quickly retrained, and its performance can be easily monitored over time, because it takes less than 20 minutes to train it on more than 1 TB of data.

You can leverage SageMaker built-in algorithm XGBoost for building models with your tabular data at scale. Additionally, you can also try the new built-in Amazon SageMaker algorithms LightGBM, CatBoost, AutoGluon-Tabular and Tab Transformer as described in this blog.

About the authors

Salma Taoufiq is a Senior Data Scientist at Sophos, working at the intersection of machine learning and cybersecurity. With an undergraduate background in computer science, she graduated from the Central European University with a MSc. in Mathematics and Its Applications. When not developing a malware detector, Salma is an avid hiker, traveler, and consumer of thrillers.

Harini Kannan is a Data Scientist at SophosAI. She has been in security data science for ~4 years. She was previously the Principal Data Scientist at Capsule8, which got acquired by Sophos. She has given talks at CAMLIS, BlackHat (USA), Open Data Science Conference (East), Data Science Salon, PyData (Boston), and Data Connectors. Her areas of research include detecting hardware-based attacks using performance counters, user behavior analysis, interpretable ML, and unsupervised anomaly detection.

Hasan Poonawala is a Senior AI/ML Specialist Solutions Architect at AWS, based in London, UK. Hasan helps customers design and deploy machine learning applications in production on AWS. He has over 12 years of work experience as a data scientist, machine learning practitioner and software developer. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Digant Patel is an Enterprise Support Lead at AWS. He works with customers to design, deploy and operate in cloud at scale. His areas of interest are MLOps and DevOps practices and how it can help customers in their cloud journey. Outside of work, he enjoys photography, playing volleyball and spending time with friends and family.

Amazon Scholar Rupak Majumdar wins CONCUR Test-of-Time Award

September 29, 2022

by Amazon AWS

Majumdar’s 2003 paper established an elegant algorithm that influences current work on timed games.Read More

The science behind Alexa’s new interactive story-creation experience

September 28, 2022

by Amazon AWS

AI models that generate stories, place objects in a visual scene, and assemble music on the fly customize content to children’s specifications.Read More

The science behind Amazon’s spatial audio-processing technology

September 28, 2022

by Amazon AWS

Combining psychoacoustics, signal processing, and speaker beamforming enhances stereo audio and delivers an immersive sound experience for customers.Read More

How does Kubeflow on AWS and SageMaker help?

Solution overview

Use Case Workflow

Prerequisites

1. Install Amazon EKS and Kubeflow on AWS

Configure the current working directory and AWS CLI

1.1 Create an EKS cluster

Create an EFS volume for the SageMaker training job

Create an Amazon S3 VPC endpoint

1.2 Deploy Kubeflow on AWS on Amazon EKS

2. Set up the Kubeflow on AWS environment

2.1 Create an EFS volume

2.2 Create a Jupyter notebook

3. Run distributed training

3.1 PyTorch Distributed Data Parallel(DDP) training script

3.2 Install libraries

3.3 Run distributed PyTorch job training on Kubernetes

View the Kubernetes training logs

3.4 Create a hybrid Kubeflow pipeline

View the Kubeflow pipeline run logs for the SageMaker component

View the Kubeflow pipeline run logs for the Kubeflow PyTorchJob Launcher component

4.1 Clean up

Summary

About the authors

How does it work?

Examples of escapes

Pressure Handling findings

How it is implemented?

Summary

About the Authors

How does it work?

Win Probability example

How it is implemented?

Summary

About the Authors

Solution overview

Prerequisites

Export and train the model

View the training and deployment

Perform inference against the real-time endpoint

Configure endpoint_name

Configure payload_str

Cleaning Up

Conclusion

About the authors

Use case context

Motivation

Technical challenge

Solution overview

Dataset information

Experiments and results

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.