How The Barcode Registry detects counterfeit products using object detection and Amazon SageMaker

This is a guest post authored by Andrew Masek, Software Engineer at The Barcode Registry and Erik Quisling, CEO of The Barcode Registry.

Product counterfeiting is the single largest criminal enterprise in the world. Growing over 10,000% in the last two decades, sales of counterfeit goods now total $1.7 trillion per year worldwide, which is more than drugs and human trafficking. Although traditional methods of counterfeit prevention like unique barcodes and product verification can be very effective, new machine learning (ML) technologies such as object detection seem very promising. With object detection, you can now snap a picture of a product and know almost instantly if that product is likely to be legitimate or fraudulent.

The Barcode Registry (in conjunction with its partner Buyabarcode.com) is a full-service solution that helps customers prevent product fraud and counterfeiting. It does this by selling unique GS1-registered barcodes, verifying product ownership, and registering users’ products and barcodes in a comprehensive database. Their latest offering, which we discuss in this post, uses Amazon SageMaker to create object detection models to help instantly recognize counterfeit products.

Overview of solution

To use these object detection models, you first need to collect data to train them. Companies upload annotated pictures of their products to The Barcode Registry website. After this data is uploaded to Amazon Simple Storage Service (Amazon S3) and processed by AWS Lambda functions, you can use it to train a SageMaker object detection model. This model is hosted on a SageMaker endpoint, where the website connects it to the end-user.

There are three key steps to creating The Barcode Registry uses to create a custom object detection model with SageMaker:

  1. Create a training script for SageMaker to run.
  2. Build a Docker container from the training script and upload it to Amazon ECR.
  3. Use the SageMaker console to train a model with the custom algorithm.

Product data

As a prerequisite in order to train an object detection model you will need an AWS account and training images, consisting of at least 100 high-quality (high-resolution and in multiple lighting-conditions) pictures of your object. As with any ML model, high-quality data is paramount. To train an object detection model, we need images containing the relevant products as well as bounding boxes describing where the products are in the images, as shown in the following example.

To train an effective model, pictures of each of a brand’s products with different backgrounds and lighting conditions are needed—approximately 30–100 unique annotated images for each product.

After the images are uploaded to the web server, they’re uploaded to Amazon S3 using the AWS SDK for PHP. A Lambda event is triggered each time an image is uploaded. The function removes the Exif metadata from the images, which can sometimes cause them to appear rotated when they’re opened by the ML libraries later used to train the model. The associated bounding box data is stored in JSON files and uploaded to Amazon S3 to accompany the images.

SageMaker for object detection models

SageMaker is a managed ML service that includes a variety of tools for building, training and hosting models in the cloud. In particular, TheBarcodeRegistry uses SageMaker for its object detection service because of SageMaker’s reliable and scalable ML model training and hosting services. This means that many brands can have their own object detection models trained and hosted and even if usage spikes unpredictably, there won’t be any downtime.

The Barcode Registry uses custom Docker containers uploaded to Amazon Elastic Container Registry (Amazon ECR) in order to have more fine-grained control of the object detection algorithm employed for training and inference as well as support for Multi Model Server (MMS). MMS is very important for the counterfeit detection use case because it allows multiple brand’s models to be cost-effectively hosted on the same server. Alternatively, you can use the built-in object detection algorithm to quickly deploy standard models developed by AWS.

Train a custom object detection model with SageMaker

First, you need to add your object detection algorithm. In this case, upload a Docker container featuring scripts to train a Yolov5 object detection model to Amazon ECR:

  1. On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
  2. Choose Create notebook instance.
  3. Enter a name for the notebook instance and under Permissions and encryption choose an AWS Identity and Access Management (IAM) role with the necessary permissions.
  4. Open the Git repositories menu.
  5. Select Clone a public Git repository to this notebook instance only and paste the following Git repository URL: https://github.com/portoaj/SageMakerObjectDetection
  6. Click Create notebook instance and wait about five minutes for the instance’s status to update from Pending to InService in the Notebook instance menu.
  7. Once the notebook is InService, select it and click Actions and Open Jupyter to launch the notebook instance in a new tab.
  8. Select the SageMakerObjectDetection directory and then click on sagemakerobjectdetection.ipynb to launch the Jupyter notebook.
  9. Select the conda_python3 kernel and click Set Kernel.
  10. Select the code cell and set the aws_account_id variable to your AWS Account ID.
  11. Click Run to begin the process of building a Docker container and uploading it to Amazon ECR. This process may take about 20 minutes to complete.
  12. Once the Docker container has been uploaded, return to the Notebook instances menu, select your instance, and click Actions and Stop to shut your notebook instance down.

After the algorithm is built and pushed to Amazon ECR, you can use it to train a model via the SageMaker console.

  1. On the SageMaker console, under Training in the navigation pane, choose Training jobs.
  2. Choose Create training job.
  3. Enter a name for the job and choose the AWS Identity and Access Management (IAM) role with the necessary permissions.
  4. For Algorithm source, select Your own algorithm container in ECR.
  5. For Container, enter the registry path.
  6. Setting a single ml.p2.xlarge instance under the resource configuration should be sufficient for training a Yolov5 model.
  7. Specify Amazon S3 locations for both your input data and output path and any other settings such as configuring a VPC via Amazon Virtual Private Cloud (Amazon VPC) or enabling Managed Spot Training.
  8. Choose Create training job.

You can track the model’s training progress on the SageMaker console.

Automated model training

The following diagram illustrates the automated model training workflow:

To make SageMaker start training the object detection model as soon as a user finishes uploading their data, the web server uses Amazon API Gateway to notify a Lambda function that the brand has finished and to begin a training job.

When a brand’s model is successfully trained, Amazon EventBridge calls a Lambda function that moves the trained model into the live endpoint’s S3 bucket, where it’s finally ready for inference. A newer alternative to using Amazon EventBridge to move models through the MLOps lifecycle that you should consider is SageMaker Pipelines.

Host the model for inference

The following diagram illustrates the inference workflow:

To use the trained models, SageMaker requires an inference model to be hosted by an endpoint. The endpoint is the server or array of servers that are used to actually host the inference model. Similar to the training container that we created, a Docker container for inference is hosted in Amazon ECR. The inference model uses that Docker container and takes the input image the user took with their phone, runs it through the trained object detection model, and outputs the result.

Again, The Barcode Registry uses custom Docker containers for the inference model to enable the use of Multi Model Server, but if only one model is needed that can be easily hosted through the built-in object detection algorithm.

Conclusion

The Barcode Registry (in conjunction with its partner Buyabarcode.com) uses AWS for its entire object detection pipeline. The web server reliably stores data in Amazon S3 and uses API Gateway and Lambda functions to connect the web server to the cloud. SageMaker readily trains and hosts ML models, which means a user can take a picture of a product on their phone and see if the product is a counterfeit. This post shows how to create and host an object detection model using SageMaker, as well as how to automate the process.

In testing, the model was able to achieve over 90% accuracy on a training set of 62 images and a testing set of 32 images, which is pretty impressive for a model trained without any human intervention. To get started training object detection models yourself check out the official documentation or learn how to deploy an object detection model to the edge using AWS IoT Greengrass.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Andrew Masek, Software Engineer at The Barcode Registry.

Erik Quisling, CEO of The Barcode Registry.

Read More

Build a cold start time series forecasting engine using AutoGluon

Whether you’re allocating resources more efficiently for web traffic, forecasting patient demand for staffing needs, or anticipating sales of a company’s products, forecasting is an essential tool across many businesses. One particular use case, known as cold start forecasting, builds forecasts for a time series that has little or no existing historical data, such as a new product that just entered the market in the retail industry. Traditional time series forecasting methods such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ES) rely heavily on historical time series of each individual product, and therefore aren’t effective for cold start forecasting.

In this post, we demonstrate how to build a cold start forecasting engine using AutoGluon AutoML for time series forecasting, an open-source Python package to automate machine learning (ML) on image, text, tabular, and time series data. AutoGluon provides an end-to-end automated machine learning (AutoML) pipeline for beginners to experienced ML developers, making it the most accurate and easy-to-use fully automated solution. We use the free Amazon SageMaker Studio Lab service for this demonstration.

Introduction to AutoGluon time series

AutoGluon is a leading open-source library for AutoML for text, image, and tabular data, allowing you to produce highly accurate models from raw data with just one line of code. Recently, the team has been working to extend these capabilities to time series data, and has developed an automated forecasting module that is publicly available on GitHub. The autogluon.forecasting module automatically processes raw time series data into the appropriate format, and then trains and tunes various state-of-the-art deep learning models to produce accurate forecasts. In this post, we demonstrate how to use autogluon.forecasting and apply it to cold start forecasting tasks.

Solution overview

Because AutoGluon is an open-source Python package, you can implement this solution locally on your laptop or on Amazon SageMaker Studio Lab. We walk through the following steps:

  1. Set up AutoGluon for Amazon SageMaker Studio Lab.
  2. Prepare the dataset.
  3. Define training parameters using AutoGluon.
  4. Train a cold start forecasting engine for time series forecasting.
  5. Visualize cold start forecasting predictions.

The key assumption of cold start forecasting is that items with similar characteristics should have similar time series trajectories, which is what allows cold start forecasting to make predictions on items without historical data, as illustrated in the following figure.

In our walkthrough, we use a synthetic dataset based on electricity consumption, which consists of the hourly time series for 370 items, each with an item_id from 0–369. Within this synthetic dataset, each item_id is also associated with a static feature (a feature that doesn’t change over time). We train a DeepAR model using AutoGluon to learn the typical behavior of similar items, and transfer such behavior to make predictions on new items (item_id 370–373) that don’t have historical time series data. Although we’re demonstrating the cold start forecasting approach with only one static feature, in practice, having informative and high-quality static features is the key for a good cold start forecast.

The following diagram provides a high-level overview of our solution. The open-source code is available on the GitHub repo.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Log in to your Amazon SageMaker Studio Lab account and set up the environment using the terminal:

cd sagemaker-studiolab-notebooks/ 
git clone https://github.com/whosivan/amazon-sagemaker-studio-lab-cold-start-forecasting-using-autogluon
conda env create -f autogluon.yml
conda activate autogluon
git clone https://github.com/yx1215/autogluon.git
cd autogluon/
git checkout --track origin/add_forecasting_predictor

These instructions should also work from your laptop if you don’t have access to Amazon SageMaker Studio Lab (we recommend installing Anaconda on your laptop first).

When you have the virtual environment fully set up, launch the notebook AutoGluon-cold-start-demo.ipynb and select the custom environment .conda-autogluon:Python kernel.

Prepare the target time series and item meta dataset

Download the following datasets to your notebook instance if they’re not included, and save them under the directory data/. You can find these datasets on our GitHub repo:

  • Test.csv.gz
  • coldStartTargetData.csv
  • itemMetaData.csv

Run the following snippet to load the target time series dataset into the kernel:

zipLocalFilePath = "data/test.csv.gz"
localFilePath = "data/test.csv"
util.extract_gz(zipLocalFilePath, localFilePath)

tdf = pd.read_csv(zipLocalFilePath, dtype = object)
tdf['target_value'] = tdf['target_value'].astype('float')
tdf.head()

AutoGluon time series requires static features to be represented in numerical format. This can be achieved through applying LabelEncoder() on our static feature type, where we encode A=0, B=1, C=2, D=3 (see the following code). By default, AutoGluon infers the static feature to be either ordinal or categorical. You can also overwrite this by converting the static feature column to be the object/string data type for categorical features, or integer/float data type for ordinal features.

localItemMetaDataFilePath = "data/itemMetaData.csv"
imdf = pd.read_csv(localItemMetaDataFilePath, dtype = object)

labelencoder = LabelEncoder()
imdf['type'] = labelencoder.fit_transform(imdf['type'])

imdf_without_coldstart_item['type'] = imdf_without_coldstart_item['type'].astype(str)

imdf_without_coldstart_item = imdf[imdf.item_id.isin(tdf.item_id.tolist())]
imdf_without_coldstart_item.to_csv('data/itemMetaDatawithoutColdstart.csv', index=False)

imdf_with_coldstart_item = imdf[~imdf.item_id.isin(tdf.item_id.tolist())]
imdf_with_coldstart_item.to_csv('data/itemMetaDataOnlyColdstart.csv', index=False)

Set up and start AutoGluon model training

We need to specify save_path = ‘autogluon-coldstart-demo’ as the model artifact folder name (see the following code). We also set our eval_metric as mean absolute percentage error, or ‘MAPE’ for short, where we defined prediction_length as 24 hours. If not specified, AutoGluon by default produces probabilistic forecasts and scores them via the weighted quantile loss. We only look at the DeepAR model in our demo, because we know the DeepAR algorithm allows cold start forecasting by design. We set one of the DeepAR hyperparameters arbitrarily and pass that hyperparameter to the ForecastingPredictor().fit() call. This allows AutoGluon to look only into the specified model. For a full list of tunable hyperparameters, refer to gluonts.model.deepar package.

save_path = 'autogluon-coldstart-demo'
eval_metric = 'MAPE'
deepar_params = {
    "scaling":True
}

ag_predictor = ForecastingPredictor(path=save_path, 
eval_metric=eval_metric).fit(tdf, static_features = imdf_without_coldstart_item,
prediction_length=24, #how far out in the future we wish to forecast                                                                  index_column="item_id",                             
target_column="target_value",                                          
time_column="timestamp",
quantiles=[0.1, 0.5, 0.9],                                                                
hyperparameters={"DeepAR": deepar_params})

The training takes 30–45 minutes. You can get the model summary by calling the following function:

ag_predictor.fit_summary()

Forecast on the cold start item

Now we’re ready to generate forecasts for the cold start item. We recommend having at least five rows for each item_id. Therefore, for the item_id that has fewer than five observations, we fill in with NaNs. In our demo, both item_id 370 and 372 have zero observation, a pure cold start problem, whereas the other two have five target values.

Load in the cold start target time series dataset with the following code:

localColdStartDataFilePath = "data/coldStartTargetData.csv"
cstdf = pd.read_csv(localColdStartDataFilePath, dtype = object)
cstdf.head(20)

We feed the cold start target time series into our AutoGluon model, along with the item meta dataset for the cold start item_id:

cold_start_prediction = ag_predictor.predict(cstdf, static_features=imdf_with_coldstart_item)

Visualize the predictions

We can create a plotting function to generate a visualization on the cold start forecasting, as shown in the following graph.

Clean up

To optimize resource usage, consider stopping the runtime on Amazon SageMaker Studio Lab after you have fully explored the notebook.

Conclusion

In this post, we showed how to build a cold start forecasting engine using AutoGluon AutoML for time series data on Amazon SageMaker Studio Lab. For those of you who are wondering the difference between Amazon Forecast and AutoGluon (time series), Amazon Forecast is a fully managed and supported service that uses machine learning (ML) to generate highly accurate forecasts without requiring any prior ML experience. While AutoGluon is an open-source project that is community supported with the latest research contributions. We walked through an end-to-end example to demonstrate what AutoGluon for time series is capable of, and provided a dataset and use case.

AutoGluon for time series data is an open-source Python package, and we hope that this post, together with our code example, gives you a straightforward solution to tackle challenging cold start forecasting problems. You can access the entire example on our GitHub repo. Try it out, and let us know what you think!


About the Authors

Ivan Cui is a Data Scientist with AWS Professional Services, where he helps customers build and deploy solutions using machine learning on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, and healthcare. In his free time, he enjoys reading, spending time with his family, and maximizing his stock portfolio.

Jonas Mueller is a Senior Applied Scientist in the AI Research and Education group at AWS, where he develops new algorithms to improve deep learning and develop automated machine learning. Before joining AWS to democratize ML, he completed his PhD at the MIT Computer Science and Artificial Intelligence Lab. In his free time, he enjoys exploring mountains and the outdoors.

Wenming Ye is a Research Product Manager at AWS AI. He is passionate about helping researchers and enterprise customers rapidly scale their innovations through open-source and state-of-the-art machine learning technology. Wenming has diverse R&D experience from Microsoft Research, the SQL engineering team, and successful startups.

Read More

Enable the visually impaired to hear documents using Amazon Textract and Amazon Polly

At the 2021 AWS re:Invent conference in Las Vegas, we demoed Read For Me at the AWS Builders Fair—a website that helps the visually impaired hear documents.

For better quality, view the video here.

Adaptive technology and accessibility features are often expensive, if they’re available at all. Audio books help the visually impaired read. Audio description makes movies accessible. But what do you do when the content isn’t already digitized?

This post focuses on the AWS AI services Amazon Textract and Amazon Polly, which empower those with impaired vision. Read For Me was co-developed by Jack Marchetti, who is visually impaired.

Solution overview

Through an event-driven, serverless architecture and a combination of multiple AI services, we can create natural-sounding audio files in multiple languages from a picture of a document, or any image with text. For example, a letter from the IRS, a holiday card from family, or even the opening titles to a film.

The following Reference Architecture, published in the AWS Architecture Center shows the workflow of a user taking a picture with their phone and playing an MP3 of the content found within that document.

The workflow includes the following steps:

  1. Static content (HTML, CSS, JavaScript) is hosted on AWS Amplify.
  2. Temporary access is granted for anonymous users to backend services via an Amazon Cognito identity pool.
  3. The image files are stored in Amazon Simple Storage Service (Amazon S3).
  4. A user makes a POST request through Amazon API Gateway to the audio service, which proxies to an express AWS Step Functions workflow.
  5. The Step Functions workflow includes the following steps:
    1. Amazon Textract extracts text from the image.
    2. Amazon Comprehend detects the language of the text.
    3. If the target language differs from the detected language, Amazon Translate translates to the target language.
    4. Amazon Polly creates an audio file as output using the text.
  6. The AWS Step Functions workflow creates an audio file as output and stores it in Amazon S3 in MP3 format.
  7. A pre-signed URL with the location of the audio file stored in Amazon S3 is sent back to the user’s browser through API Gateway. The user’s mobile device plays the audio file using the pre-signed URL.

In the following sections, we discuss the reasons for why we chose the specific services, architecture pattern, and service features for this solution.

AWS AI services

Several AI services are wired together to power Read For Me:

  • Amazon Textract identifies the text in the uploaded picture.
  • Amazon Comprehend determines the language.
  • If the user chooses a different spoken language than the language in the picture, we translate it using Amazon Translate.
  • Amazon Polly creates the MP3 file. We take advantage of the Amazon Polly neural engine, which creates a more natural, lifelike audio recording.

One of the main benefits of using these AI services is the ease of adoption with little or no core machine learning experience required. The services expose APIs that clients can invoke using SDKs made available in multiple programming languages, such as Python and Java.

With Read For Me, we wrote the underlying AWS Lambda functions in Python.

AWS SDK for Python (Boto3)

The AWS SDK for Python (Boto3) makes interacting with AWS services simple. For example, the following lines of Python code return the text found in the image or document you provide:

import boto3
client = boto3.client('textract')
response = client.detect_document_text(
Document={
'S3Object': {
'Bucket': 'bucket-name',
'Name': 's3-key'
}
})
#do something with the response

All Python code is run within individual Lambda functions. There are no servers to provision and no infrastructure to maintain.

Architecture patterns

In this section, we discuss the different architecture patterns used in the solution.

Serverless

We implemented a serverless architecture for two main reasons: speed to build and cost. With no underlying hardware to maintain or infrastructure to deploy, we focused entirely on the business logic code and nothing else. This allowed us to get a functioning prototype up and running in a matter of days. If users aren’t actively uploading pictures and listening to recordings, nothing is running, and therefore nothing is incurring costs outside of storage. An S3 lifecycle management rule deletes uploaded images and MP3 files after 1 day, so storage costs are low.

Synchronous workflow

When you’re building serverless workflows, it’s important to understand when a synchronous call makes more sense from the architecture and user experience than an asynchronous process. With Read For Me, we initially went down the asynchronous path and planned on using WebSockets to bi-directionally communicate with the front end. Our workflow would include a step to find the connection ID associated with the Step Functions workflow and upon completion, alert the front end. For more information about this process, refer to From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets.

We ultimately chose not to do this and used express step functions which are synchronous. Users understand that processing an image won’t be instant, but also know it won’t take 30 seconds or a minute. We were in a space where a few seconds was satisfactory to the end-user and didn’t need the benefit of WebSockets. This simplified the workflow overall.

Express Step Functions workflow

The ability to break out your code into smaller, isolated functions allows for fine-grained control, easier maintenance, and the ability to scale more accurately. For instance, if we determined that the Lambda function that triggered Amazon Polly to create the audio file was running slower than the function that determined the language, we could vertically scale that function, adding more memory, without having to do so for the others. Similarly, you limit the blast radius of what your Lambda function can do or access when you limit its scope and reach.

One of the benefits of orchestrating your workflow with Step Functions is the ability to introduce decision flow logic without having to write any code.

Our Step Functions workflow isn’t complex. It’s linear until the translation step. If we don’t need to call a translation Lambda function, that’s less cost to us, and a faster experience for the user. We can use the visual designer on the Step Functions console to find the specific key in the input payload and, if it’s present, call one function over the other using JSONPath. For example, our payload includes a key called translate:

{ 
extracted_text: "hello world",
target_language: "es",
source_language: "en",
translate: true
}

Within the Step Functions visual designer, we find the translate key, and set up rules to match.

Headless architecture

Amplify hosts the front-end code. The front end is written in React and the source code is checked into AWS CodeCommit. Amplify solves a few problems for users trying to deploy and manage static websites. If you were doing this manually (using an S3 bucket set up for static website hosting and fronting that with Amazon CloudFront), you’d have to expire the cache yourself each time you did deployments. You’d also have to write up your own CI/CD pipeline. Amplify handles this for you.

This allows for a headless architecture, where front-end code is decoupled from the backend and each layer can be managed and scaled independent of the other.

Analyze ID

In the preceding section, we discussed the architecture patterns for processing the uploaded picture and creating an MP3 file from it. Having a document read back to you is a great first step, but what if you only want to know something specific without having the whole thing read back to you? For instance, you need to fill out a form online and provide your state ID or passport number, or perhaps its expiration date. You then have to take a picture of your ID and, while having it read back to you, wait for that specific part. Alternatively, you could use Analyze ID.

Analyze ID is a feature of Amazon Textract that enables you to query documents. Read For Me contains a drop-down menu where you can specifically ask for the expiration date, date of issue, or document number. You can use the same workflow to create an MP3 file that provides an answer to your specific question.

You can demo the Analyze ID feature at readforme.io/analyze.

Additional Polly Features

  • Read For Me offers multiple neural voices utilizing different languages and dialects. Note that there are several other voices you can choose from, which we did not implement. When a new voice is available, an update to the front-end code and a lambda function is all it takes to take advantage of it.
  • The Polly service also offers other options which we have yet to include in Read For Me. Those include adjusting the speed of the voices and speech marks.

Conclusion

In this post, we discussed how to use numerous AWS services, including AI and serverless, to aid the visually impaired. You can learn more about the Read For Me project and use it by visiting readforme.io. You can also find Amazon Textract examples on the GitHub repo. To learn more about Analyze ID, check out Announcing support for extracting data from identity documents using Amazon Textract.

The source code for this project will be open-sourced and added to AWS’s public GitHub soon.


About the Authors

Jack Marchetti is a Senior Solutions architect at AWS. With a background in software engineering, Jack is primarily focused on helping customers implement serverless, event-driven architectures. He built his first distributed, cloud-based application in 2013 after attending the second AWS re:Invent conference and has been hooked ever since. Prior to AWS Jack spent the bulk of his career in the ad agency space building experiences for some of the largest brands in the world. Jack is legally blind and resides in Chicago with his wife Erin and cat Minou. He also is a screenwriter, and director with a primary focus on Christmas movies and horror. View Jack’s filmography at his IMDb page.

Alak Eswaradass is a Solutions Architect at AWS based in Chicago, Illinois. She is passionate about helping customers design cloud architectures utilizing AWS services to solve business challenges. She has a Master’s degree in computer science engineering. Before joining AWS, she worked for different healthcare organizations, and she has in-depth experience architecting complex systems, technology innovation, and research. She hangs out with her daughters and explores the outdoors in her free time.

Swagat Kulkarni is a Senior Solutions Architect at AWS and an AI/ML enthusiast. He is passionate about solving real-world problems for customers with cloud native services and machine learning. Outside of work, Swagat enjoys travel, reading and meditating.

Read More

Bundesliga Match Fact Set Piece Threat: Evaluating team performance in set pieces on AWS

The importance of set pieces in football (or soccer in the US) has been on the rise in recent years: now more than one quarter of all goals are scored via set pieces. Free kicks and corners generally create the most promising situations, and some professional teams have even hired specific coaches for those parts of the game.

In this post, we share how the Bundesliga Match Fact Set Piece Threat helps evaluate performance in set pieces. As teams look to capitalize more and more on these dead ball situations, Set Piece Threat will help the viewer understand how well teams are leveraging these situations. In addition, it will explain the reader how AWS services can be used to compute statistics in real-time.

Bundesliga’s Union Berlin is a great example for the relevance of set pieces. The team managed to rise from Bundesliga 2 to qualification for a European competition in just 2 years. They finished third in Bundesliga 2 during the 18/19 season, earning themselves a slot in the relegation playoffs to the Bundesliga. In that season, they scored 28 goals from open play, ranking just ninth in the league. However, they ranked second for goals scored through set pieces (16 goals).

Tellingly, in the first relegation playoff match against VfB Stuttgart, Union secured a 2:2 draw, scoring a header after a corner. And in the return match, Stuttgart was disallowed a free kick goal due to a passive offside, allowing Union to enter the Bundesliga with a 0:0 draw.

The relevance of set pieces for Union’s success doesn’t end there. Union finished their first two Bundesliga seasons with strong eleventh and seventh, ranking third and first in number of set piece goals (scoring 15 goals from set pieces in both seasons). For comparison, FC Bayern München—the league champion—only managed to score 10 goals from set pieces in both seasons. The success that Union Berlin has had with their set pieces allowed them to secure the seventh place in the 20/21 Bundesliga season, which meant qualification for the UEFA Europa Conference League, going from Bundesliga 2 to Europe just 2 years after having earned promotion. Unsurprisingly, in the deciding match, they scored one of their two goals after a corner. At the time of this writing, Union Berlin ranks fourth in the Bundesliga (matchday 20) and first in corner performance, a statistic we explain later.

Union Berlin’s path to Europe clearly demonstrates the influential role of offensive and defensive performance during set pieces. Until now however, it was difficult for fans and broadcasters to properly quantify this performance, unless they wanted to dissect massive tables on analytics websites. Bundesliga and AWS have worked together to illustrate the threat that a team produces and the threat that is produced by set pieces against the team, and came up with the new Bundesliga Match Fact: Set Piece Threat.

How does Set Piece Threat work?

To determine the threat a team poses with their set pieces, we take into account different facets of their set piece performance. It’s important to note that we only consider corners and free kicks as set pieces, and compute the threat for each category independently.

Facet 1: Outcome of a set piece: Goals, shots, or nothing

First, we consider the outcome of a set piece. That is, we observe if it results in a goal. However, the outcome is generally influenced by fine margins, such as a great save by the goal keeper or if a shot brushes the post instead of going in, so we also categorize the quality of a shot that results from the set piece. Shots are categorized into several categories.

Category Explanation
Goal A successful shot that lead to a goal
Outstanding Shots that almost led to a goal, such as a shot at the post
Decent Other noteworthy goals scenes
Average The rest of the chances that would be included in a chances ratio with relevant threat of a goal
None No real goal threat, should not be considered a real chance, such as a header that barely touched the ball or a blocked shot
No shot No shots taken at all

The above video shows examples of shot outcome categories in the following order: outstanding, decent, average, none.

Facet 2: Potential of a shot

Second, our algorithm considers the potential of a shot. This incorporates how likely it should have resulted in a goal, taking the actual performance of the shot-taker out of the equation. In other words, we quantify the goal potential of the situation in which the shot was taken. This is captured by the expected goal (xGoals) value of the shot. We remove not only the occurrence of luck or lack thereof, but also the quality of the strike or header.

Facet 3: Quantity of set pieces

Next, we consider the aspect of pure quantity of set pieces that a team gets. Our definition of Set Piece Threat measures the threat on a per-set-piece-basis. Instead of summing up all outcomes and xGoal values of a team over the course of a season, the values are aggregated such that they represent the average threat per set piece. That way, the corner threat, for example, represents the team’s danger for each corner and doesn’t consider a team more dangerous simply because they have more corners than other teams (and therefore potentially more shots or goals).

Facet 4: Development over time

The last aspect to consider is the development of a team’s threat over time. Consider for example a team that scored three goals from corners in the first three matchdays but fails to deliver any considerable threat over the next 15 matchdays. This team should not be considered to pose a significant threat from corners on matchday 19, despite it already having scored three times, which may still be a good return. We account for this (positive or negative) development of a team’s set piece quality by assigning a discount to each set piece, depending on how long ago it occurred. In other words, a free kick that was taken 10 matchdays ago has less influence on the computed threat than one that was taken during the last or even current game.

Score: Per set piece aggregation

All four facets we’ve described are aggregated into two values for each team, one for corners and one for free kicks, which describe the danger that a corresponding set piece by that team would currently pose. The value is defined as the weighted average of the scores of each set piece, where the score of a set piece is defined as (0.7 * shot-outcome + 0.3 * xG-value) if the set piece resulted in a shot and 0 otherwise. The shot-outcome is 1 if the team scored and lower for other outcomes, such as a shot that went wide, depending on its quality. The weight for each set piece is determined by how long ago it was taken, as described earlier. Overall, the values are defined between 0–1, where 1 is the perfect score.

Set piece threat

Next, the values for each team are compared to the league average. The exact formula is score(team)/avg_score(league) - 1. This value is what we call the Set Piece Threat value. A team has a threat value of 0 if it’s exactly as good as the league average. A value of -1 (or -100%) describes a team that poses no threat at all, and a value of +1 (+100%) describes a team that is twice as dangerous as the league average. With those values, we compute a ranking that orders the teams from 1–18 according to their offensive threat of corners and free kicks, respectively.

We use the same data and similar calculations to also compute a defensive threat that measures the defensive performance of a team with regard to how they defend set pieces. Now, instead of computing a score per own set piece, the algorithm computes a score per opponent set piece. Just like for the offensive threat, the score is compared to the league average, but the value is reversed: -score(team)/avg_score(league) + 1. This way, a threat of +1 (+100%) is achieved if team allows opponents no shots at all, whereas a team with defensive threat of -1 (-100%) is twice as susceptible to opponents’ set pieces as the league average. Again, a team with a threat of 0 is as good as the league average.

Set Piece Threat findings

An important aspect of Set Piece Threat is that we focus on an estimation of threat instead of goals scored and conceded via set pieces. If we take SC Freiburg and Union Berlin at matchday 21 as an example, over the course of this season Freiburg has scored seven goals via corners in comparison to four from Union Berlin. Our threat ranking still ranks both of the teams fairly equal. In fact, we predict a corner by Freiburg (Rank 3) to even be 7% less threatening than a corner by Union Berlin (Rank 1). The main reason for this is that Union Berlin created a similar number of great chances out of their corners, but failed to convert these chances into goals. Freiburg on the other hand was vastly more efficient with their chances. Such a discrepancy between chance quality and actual goals can happen in a high-variance sport like football.

The following graph shows Union Berlin’s set piece offensive corner ranking (blue) and score (red) from matchdays 6–21. At matchday 12, Union scored a goal from a corner and additionally had a great chance from a second corner that didn’t result in a goal but was perceived as a high threat by our algorithm. In addition, Union had a shot on target in five of seven corner kicks on matchday 12. Union immediately jumped in the ranking from twelfth to fifth place as a result of this, and the score value for Union increased as well as the league average. As Union saw more and more high threat chances in the later matchdays from corners, they step by step claimed first place of the corner threat ranking. The score is always relative to the current league average, meaning that Union’s threat at matchday 21 is 50% higher from corners than the average threat coming from all teams in the league.

Implementation and architecture

Bundesliga Match Facts are independently running AWS Fargate containers inside Amazon Elastic Container Service (Amazon ECS). Previous Bundesliga Match Facts consume raw event and positional data to calculate advanced statistics. This changes with the release of Set Piece Threat, which analyzes data produced by an existing Bundesliga Match Fact (xGoals) to calculate its rankings. Therefore, we created an architecture to exchange messages between different Bundesliga Match Facts during live matches in real time.

To guarantee the latest data is reflected in the set piece threat calculations, we use Amazon Managed Streaming for Apache Kafka (Amazon MSK). This message broker service allows different Bundesliga Match Facts to send and receive the newest events and updates in real time. By consuming a match and Bundesliga Match Fact-specific topic from Kafka, we can receive the most up-to-date data from all systems involved while retaining the ability to replay and reprocess messages sent earlier.

The following diagram illustrates the solution architecture:

We introduced Amazon MSK to this project to generally replace all internal message passing for the Bundesliga Match Facts platform. It handles the injection of positional and event data, which can aggregate to over 3.6 million data points per match. With Amazon MSK, we can use the underlying persistent storage of messages, which allows us to replay games from any point in time. However, for Set Piece Threat, the focus lies on the specific use case of passing events produced by Bundesliga Match Facts to other Bundesliga Match Facts that are running in parallel.

To facilitate this, we distinguish between two types of Kafka topics: global and match-specific. First, each Bundesliga Match Fact has an own specific global topic, which handles all messages created by the Bundesliga Match Fact. Additionally, there is an additional match-specific topic for each Bundesliga Match Fact for each match that is handling all messages created by a Bundesliga Match Fact for a specific match. When multiple live matches run in parallel, each message is first produced and sent to this Bundesliga Match Fact-specific global topic.

A dispatcher AWS Lambda function is subscribed to every Bundesliga Match Fact-specific global topic and has two tasks:

  1. Write the incoming data to a database provisioned through Amazon Relational Database Service (Amazon RDS).
  2. Redistribute the messages that can be consumed by other Bundesliga Match Facts to a Bundesliga Match Fact-specific topic.

The left side of the architecture diagram shows the different Bundesliga Match Facts running independently from each other for every match and producing messages to the global topic. The new Set Piece Threat Bundesliga Match Fact now can consume the latest xGoal values for each shot for a specific match (right side of the diagram) to immediately compute the threat produced by the set piece that resulted in one or more shots.

Summary

We’re excited about the launch of Set Piece Threat and the patterns commentators and fans will uncover using this brand-new insight. As teams look to capitalize more and more on these dead ball situations, Set Piece Threat will help the viewer understand which team is doing this successfully and which team still has some ground to cover, which adds additional suspense before each of these set piece situations. The new Bundesliga Match Fact is available to Bundesliga’s broadcasters to uncover new perspectives and stories of a match, and team rankings can be viewed at any time in the Bundesliga app.

We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.


About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals and won 26 caps for Germany. Currently Rolfes serves as Sporting Director at Bayer 04 Leverkusen where he oversees and develops the pro player roster, the scouting department and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS

Luuk Figdor is a Senior Sports Technology Specialist in the AWS Professional Services team. He works with players, clubs, leagues, and media companies such as Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time, he likes to learn all about the mind and the intersection between psychology, economics, and AI.

Jan Bauer is a Cloud Application Architect at AWS Professional Services. His interests are serverless computing, machine learning, and everything that involves cloud computing. He works with clients across industries to help them be successful on their cloud journey.

Pascal Kühner is a Cloud Application Developer in the AWS Professional Services Team. He works with customers across industries to help them achieve their business outcomes via application development, DevOps, and infrastructure. He loves ball sports and in his spare time likes to play basketball and football.

Uwe Dick is a Data Scientist at Sportec Solutions AG. He works to enable Bundesliga clubs and media to optimize their performance using advanced stats and data—before, after, and during matches. In his spare time, he settles for less and just tries to last the full 90 minutes for his recreational football team.

Javier Poveda-Panter is a Data Scientist for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high quality user and fan experiences through machine learning and data science. He follows his passion for a broad range of sports, music and AI in his spare time.

Read More

Bundesliga Match Fact Skill: Quantifying football player qualities using machine learning on AWS

In football, as in many sports, discussions about individual players have always been part of the fun. “Who is the best scorer?” or “Who is the king of defenders?” are questions perennially debated by fans, and social media amplifies this debate. Just consider that Erling Haaland, Robert Lewandowski, and Thomas Müller alone have a combined 50 million followers on Instagram. Many fans are aware of the incredible statistics star players like Lewandowski and Haaland create, but stories like this are just the tip of the iceberg.

Consider that almost 600 players are under contract in the Bundesliga, and each team has their own champions—players that get introduced to bring a specific skill to bear in a match. Look for example at Michael Gregoritsch of FC Augsburg. As of this writing (matchday 21), he has scored five goals in the 21/22 season, not something that would make anybody mention him in a conversation about the great goal scorers. But let’s look closer: if you accumulate the expected goal (xGoals) values of all scoring chances Gregoritsch had this season, the figure you get is 1.7. This means he over-performed on his shots on goal by +194%, scoring 3.2 more goals than expected. In comparison, Lewandowski over-performed by only 1.6 goals (+7%). What a feat! Clearly Gregoritsch brings a special skill to Augsburg.

So how do we shed light on all the hidden stories about individual Bundesliga players, their skills, and impact on match outcomes? Enter the new Bundesliga Match Fact powered by AWS called Skill. Skill has been developed through in-depth analysis by the DFL and AWS to identify players with skills in four specific categories: initiator, finisher, ball winner, and sprinter. This post provides a deep dive into these four skills and discusses how they are implemented on AWS infrastructure.

Another interesting point is that until now, Bundesliga Match Facts have been developed independent from one another. Skill is the first Bundesliga Match Fact that combines the output of multiple Bundesliga Match Facts in real time using a streaming architecture built on Amazon Managed Streaming Kafka (Amazon MSK).

Initiator

An initiator is a player who performs a high number of valuable first and second assists. To identify and quantify the value of those assists, we introduced the new metric xAssist. It’s calculated by tracking the last and second-last pass before a shot at goal, and assigning the respective xGoals value to those actions. A good initiator creates opportunities under challenging circumstances by successfully completing passes with a rate of high difficulty. To evaluate how hard it is to complete a given pass, we use our existing xPass model. In this metric, we purposely exclude crosses and free kicks to focus on players who generate scoring chances with their precise assists from open play.

The skill score is calculated with the following formula:

Let’s look at the current Rank 1 initiator, Thomas Müller, as an example. He has collected an xAssist value of 9.23 as of this writing (matchday 21), meaning that his passes for the next players who shot at the goal have generated a total xGoal value of 9.23. The xAssist per 90 minutes ratio is 0.46. This can be calculated from his total playing time of the current season, which is remarkable—over 1,804 minutes of playing time. As a second assist, he generated a total value of 3.80, which translates in 0.19 second assists per 90 minutes. In total, 38 of his 58 first assists were difficult passes. And as a second assist, 11 of his 28 passes were also difficult passes. With these statistics, Thomas Müller has catapulted himself into first place in the initiator ranking. For comparison, the following table presents the values of the current top three.

.. xAssist xAssistper90 xSecondAssist xSecondAssistper90 DifficultPassesAssisted DifficultPassesAssisted2 Final Score
Thomas Müller – Rank 1 9.23 0.46 3.80 0.18 38 11 0.948
Serge Gnabry – Rank 2 3.94 0.25 2.54 0.16 15 11 0.516
Florian Wirtz – Rank 3 6.41 0.37 2.45 0.14 21 1 0.510

Finisher

A finisher is a player who is exceptionally good at scoring goals. He has a high shot efficiency and accomplishes many goals respective to his playing time. The skill is based on actual goals scored and its difference to expected goals (xGoals). This allows us to evaluate whether chances are being well exploited. Let’s assume that two strikers have the same number of goals. Are they equally strong? Or does one of them score from easy circumstances while the other one finishes in challenging situations? With shot efficiency, this can be answered: if the goals scored exceed the number of xGoals, a player is over-performing and is a more efficient shooter than average. Through the magnitude of this difference, we can quantify the extent to which a shooter’s efficiency beats the average.

The skill score is calculated with the following formula:

For the finisher, we focus more on goals. The following table gives a closer look at the current top three.

.. Goals GoalsPer90 ShotEfficiency Final Score
Robert Lewandowski – Rank 1 24 1.14 1.55 0.813
Erling Haaland – Rank 2 16 1.18 5.32 0.811
Patrik Schick – Rank 3 18 1.10 4.27 0.802

Robert Lewandowski has scored 24 goals this season, which puts him in first place. Although Haaland has a higher shot efficiency, it’s still not enough for Haaland to be ranked first, because we give higher weighting to goals scored. This indicates that Lewandowski profits highly from both the quality and quantity of received assists, even though he scores exceptionally well. Patrick Schick has scored two more goals than Haaland, but has a lower goal per 90 minutes rate and a lower shot efficiency.

Sprinter

The sprinter has the physical ability to reach high top speeds, and do so more often than others. For this purpose, we evaluate average top speeds across all games of a player’s current season and include the frequency of sprints per 90 minutes, among other metrics. A sprint is counted if a player runs at a minimum pace of 4.0 m/s for more than two seconds, and reaches a peak velocity of at least 6.3 m/s during this time. The duration of the sprint is characterized by the time between the first and last time the 6.3 m/s threshold is reached, and needs to be at least 1 second long to be acknowledged. A new sprint can only be considered to have occurred after the pace had fallen below the 4.0 m/s threshold again.

The skill score is calculated with the following formula:

The formula allows us to evaluate the many ways we can look at sprints by players, and go further than just looking at the top speeds these players produce. For example, Jeremiah St. Juste has the current season record of 36.65 km/h. However, if we look at the frequency of his sprints, we find he only sprints nine times on average per match! Alphonso Davies on the other hand might not be as fast as St. Juste (top speed 36.08 km/h), but performs a staggering 31 sprints per match! He sprints much more frequently at with a much higher average speed, opening up space for his team on the pitch.

Ball winner

A player with this ability causes ball losses to the opposing team, both in total and respective to his playing time. He wins a high number of ground and aerial duels, and he steals or intercepts the ball often, creating a safe ball control himself, and a possibility for his team to counterattack.

The skill score is calculated with the following formula:

As of this writing, the first place ball winner is Danilo Soares. He has a total of 235 defensive duels. Of the 235 defensive duels, he has won 75, defeating opponents in a face-off. He has intercepted 51 balls this season in his playing position as a defensive back, giving him a win rate of about 32%. On average, he intercepted 2.4 balls per 90 minutes.

Skill example

The Skill Bundesliga Match Fact enables us to unveil abilities and strengths of Bundesliga players. The Skill rankings put players in the spotlight that might have gone unnoticed before in rankings of conventional statistics like goals. For example, take a player like Michael Gregoritsch. Gregoritsch is a striker for FC Augsburg who placed sixth in the finisher ranking as of matchday 21. He has scored five goals so far, which wouldn’t put him at the top of any goal scoring ranking. However, he managed to do this in only 663 minutes played! One of these goals was the late equalizer in the 97th minute that helped Augsburg to avoid the away loss in Berlin.

Through the Skill Bundesliga Match Fact, we can also recognize various qualities of each player. One example of this is the Dortmund star Erling Haaland, who has also earned the badge of sprinter and finisher, and is currently placed sixth amongst Bundesliga sprinters.

All of these metrics are based on player movement data, goal-related data, ball action-related data, and pass-related data. We process this information in data pipelines and extract the necessary relevant statistics per skill, allowing us to calculate the development of all metrics in real time. Many of the aforementioned statistics are normalized by time on the pitch, allowing for the consideration of players who have little playing time but perform amazingly well when they play. The combinations and weights of the metrics are combined into a single score. The result is a ranking for all players on the four player skills. Players ranking in the top 10 receive a skill badge to help fans quickly identify the exceptional qualities they bring to their squads.

Implementation and architecture

Bundesliga Match Facts that have been developed up to this point are independent from one another and rely only on the ingestion of positional and event data, as well as their own calculations. However, this changes for the new Bundesliga Match Fact Skill, which calculates skill rankings based on data produced by existing Match Facts, as for example xGoals or xPass. The outcome of one event, possibly an incredible goal with low chances of going in, can have a significant impact on the finisher skill ranking. Therefore, we built an architecture that always provides the most up-to-date skill rankings whenever there is an update to the underlying data. To achieve real-time updates to the skills, we use Amazon MSK, a managed AWS service for Apache Kafka, as a data streaming and messaging solution. This way, different Bundesliga Match Facts can communicate the latest events and updates in real time.

The underlying architecture for Skill consists of four main parts:

  • An Amazon Aurora Serverless cluster stores all outputs of existing match facts. This includes, for example, data for each pass (such as xPass, player, intended receiver) or shot (xGoal, player, goal) that has happened since the introduction of Bundesliga Match Facts.
  • A central AWS Lambda function writes the Bundesliga Match Fact outputs into the Aurora database and notifies other components that there has been an update.
  • A Lambda function for each individual skill computes the skill ranking. These functions run whenever new data is available for the calculation of the specific skill.
  • An Amazon MSK Kafka cluster serves as a central point of communication between all these components.

The following diagram illustrates this workflow. Each Bundesliga Match Fact immediately sends an event message to Kafka whenever there is an update to an event (such as an updated xGoals value for a shot event). The central dispatcher Lambda function is automatically triggered whenever a Bundesliga Match Fact sends such a message and writes this data to the database. Then it sends another message via Kafka containing the new data back to Kafka, which serves as a trigger for the individual skill calculation functions. These functions use data from this trigger event, as well as the underlying Aurora cluster, to calculate and publish the newest skill rankings. For a more in-depth look into the use of Amazon MSK within this project, refer to the Set Piece Threat blogpost.

Summary

In this post, we demonstrated how the new Bundesliga Match Fact Skill makes it possible to objectively compare Bundesliga players on four core player dimensions, building on and combining former independent Bundesliga Match Facts in real time. This allows commentators and fans alike to uncover previously unnoticed player abilities and shed light on the roles that various Bundesliga players fulfill.

The new Bundesliga Match Fact is the result of an in-depth analysis by the Bundesliga’s football experts and AWS data scientists to distill and categorize football player qualities based on objective performance data. Player skill badges are shown in the lineup and on player detail pages in the Bundesliga app. In the broadcast, player skills are provided to commentators through the data story finder and visually shown to fans at player substitution and when a player moves up into the respective top 10 ranking.

We hope that you enjoy this brand-new Bundesliga Match Fact and that it provides you with new insights into the game. To learn more about the partnership between AWS and Bundesliga, visit Bundesliga on AWS!


About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals and won 26 caps for Germany. Currently Rolfes serves as Sporting Director at Bayer 04 Leverkusen where he oversees and develops the pro player roster, the scouting department and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS

Luuk Figdor is a Senior Sports Technology Specialist in the AWS Professional Services team. He works with players, clubs, leagues, and media companies such as the Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time, he likes to learn all about the mind and the intersection between psychology, economics, and AI.

Pascal Kühner is a Cloud Application Developer in the AWS Professional Services Team. He works with customers across industries to help them achieve their business outcomes via application development, DevOps, and infrastructure. He is very passionate about sports and enjoys playing basketball and football in his spare time.

Tareq Haschemi is a consultant within AWS Professional Services. His skills and areas of expertise include application development, data science, machine learning, and big data. Based in Hamburg, he supports customers in developing data-driven applications within the cloud. Prior to joining AWS, he was also a consultant in various industries such as aviation and telecommunications. He is passionate about enabling customers on their data/AI journey to the cloud.

Jakub Michalczyk is a Data Scientist at Sportec Solutions AG. Several years ago, he chose math studies over playing football, as he came to the conclusion, he was not good enough at the latter. Now he combines both these passions in his professional career by applying machine learning methods to gain a better insight into this beautiful game. In his spare time, he still enjoys playing seven-a-side football, watching crime movies, and listening to film music.

Javier Poveda-Panter is a Data Scientist for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high-quality user and fan experiences through machine learning and data science. He follows his passion for a broad range of sports, music, and AI in his spare time.

Read More

Announcing the AWS DeepRacer League 2022

Unleash the power of machine learning (ML) through hands-on learning and compete for prizes and glory. The AWS DeepRacer League is the world’s first global autonomous racing competition driven by reinforcement learning; bringing together students, professionals, and enthusiasts from almost every continent.

I’m Tomasz Ptak, a senior software engineer at Duco, an AWS Machine Learning Hero, an AWS DeepRacer competitor (named Breadcentric), a hobbyist baker, and a leader of the AWS Machine Learning Community on Slack, where we learn, race, and help each other start and grow our adventures in the cloud. It’s my pleasure to unveil the exciting details of the upcoming 2022 AWS DeepRacer League season.

What is AWS DeepRacer?

AWS DeepRacer is a 1/18th scale autonomous race car—but also much more. It’s a complete program that has helped over 175,000 individuals from over 700 businesses, educational institutions, and organizations begin their educational journey into machine learning through fun and rivalry.

AWS DeepRacer League 2022

Over 100,000 developers took part in the 2021 season of the AWS DeepRacer League. They got hands-on with reinforcement learning by participating in a workshop or training a model to achieve the fastest lap on the eight monthly global leaderboards. The monthly winners from these races joined us for the fully virtual 2021 Championship during the month of November, with the head-to head live finale taking place on November 23, 2021.

Due to restrictions on global travel in 2021, the entire AWS DeepRacer League was hosted within the Virtual Circuit. Although it made for entertaining livestream events that brought racers together each month from around the globe, it’s my opinion that nothing compares to the excitement of bringing the community together to race in person in the same room. That’s why I’m so thrilled that in 2022, the AWS DeepRacer League Summit Circuit will be back with 17 stops across five continents. We will learn the exact number of races after the AWS Summits dates are announced.

New in 2022, the AWS Summit Circuit structure is fully revamped, adding new regional playoffs coming this fall on the AWS Twitch channel. The top two competitors of each AWS Summit event will advance to the regional playoff round, where they will compete with other developers from their region for a chance to win a trip to Las Vegas to compete in the Championship Cup at AWS re:Invent 2022. In total, 15 participants from the AWS Summit Circuit regional playoffs will advance to the AWS DeepRacer League Championships. More on that later in this post.

That’s not all—anyone can participate in the AWS DeepRacer League Virtual Circuit races, the first of which is starting today, March 1, 2022. The Virtual Circuit consists of 8 months of racing in two divisions: Open and Pro. All racers start in the Open Division (except for the top three champions from 2021), competing in time trial races. At the end of each month, the top 10% of racers in the Open Division will advance to the pros and win a Pro Division welcome kit including swag, stickers, and a “Pro Driver’s License” certificate. Once in the Pro Division, racers will take part in car-to-bot racing to qualify for the Pro Finale, an entertaining live virtual race that you can tune in to watch, hear from the pros as they compete, and get exclusive offers and prizes. The top three each month advance to the AWS DeepRacer League Championships. Each Pro Finale race is recapped in an AWS DeepRacer community blog post and streamed live on Twitch.

The 10 best racers each month win an AWS DeepRacer car or cash equivalent in countries where the car is unavailable for shipping. Also, every month participants receive new digital rewards such as a new car shell revealed on the AWS DeepRacer console.

In March, Open Division racers will take on the Rogue Circuit, named in honor of the 2021 DeepRacer Championship Cup winner Sairam Naragoni (a.k.a. JPMC-Rogue-Hyderabad), the Rogue Circuit is a moderately short loop (48.06m) featuring a classic kidney style shape reminiscent of the AWS re:Invent 2019 track. Meanwhile the Pros will attempt to navigate the more challenging Rogue Raceway this month based on a brand new real world track near Sairam’s hometown of Hyderabad in India.

Also new for 2022, the AWS DeepRacer Student League is launching AWS DeepRacer Student. Over the last two seasons, participation by students such as the Canberra Grammar School in Australia, NYCU Taiwan, and Hong Kong Institute of Vocational Education (IVE) have catapulted students to the top of the global league. This year, in addition to being able to race in the Open and Pro divisions, students will get a league of their own with AWS DeepRacer Student. Students can access 10 hours of free training and compete each month for a chance to win prizes, scholarships, and even place as one of three wildcards in the global championships.

The 15 AWS Summit Circuit, 24 Virtual Circuit, and 3 student racers will be joined by the 2021 AWS DeepRacer League Champion Sairam Naragoni (racing as JPMC-Rogue-Hyderabad), 2021 AWS DeepRacer re:Invent Live Stream Champion Eric Morrison, and six enterprise wildcards to form a group of 50 finalists.

For the first time since 2019, the qualification will include a trip to re:Invent in Las Vegas, where racers can compete for the title of 2022 AWS DeepRacer League Champion, which comes with bragging rights, prizes, and glory.

So join the competition virtually now in either the Student League or Open League, join the AWS DeepRacer community, and plan a trip to a nearby AWS Summit to start your adventure.You can also learn more about how to create AWS DeepRacer events for your organization on the AWS DeepRacer enterprise events page. I hope to see you on the track and in the community.


About the Author

Tomasz Ptak is a senior software engineer at Duco, an AWS Machine Learning Hero,  an AWS DeepRacer competitor (named Breadcentric), a hobbyist baker, and a leader of the AWS Machine Learning Community on Slack, where we learn, race, and help each other start and grow our adventures in the cloud.

Read More

Train 175+ billion parameter NLP models with model parallel additions and Hugging Face on Amazon SageMaker

The last few years have seen rapid development in the field of natural language processing (NLP). While hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) practitioners still regularly run into issues scaling their large language models across multiple GPU’s.

In this blog post, we briefly summarize the rise of large- and small- scale NLP models, primarily through the abstraction provided by Hugging Face and with the modular backend of Amazon SageMaker. In particular we highlight the launch of four additional features within the SageMaker model parallel library that unlock 175 billion parameter NLP model pretraining and fine-tuning for customers.

We used this library on the SageMaker training platform and achieved a throughput of 32 samples per second on 120 ml.p4d.24xlarge instances and 175 billion parameters. We anticipate that if we increased this up to 240 instances, the full model would take 25 days to train.

For more information about model parallelism, see the paper Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training.

You can also see the GPT2 notebook we used to generate these performance numbers on our GitHub repository.

To learn more about how to use the new features within SageMaker model parallel, refer to Extended Features of the SageMaker Model Parallel Library for PyTorch, and Use with the SageMaker Python SDK.

NLP on Amazon SageMaker – Hugging Face and model parallelism

If you’re new to Hugging Face and NLP, the biggest highlight you need to know is that applications using natural language processing (NLP) are starting to achieve human level performance. This is largely driven by a learning mechanism, called attention,  which gave rise to a deep learning model, called the transformer, that is much more scalable than previous deep learning sequential methods. The now-famous BERT model was developed to capitalize on the transformer, and developed several useful NLP tactics along the way. Transformers and the suite of models, both within and outside of NLP, which have all been inspired by BERT, are the primary engine behind your Google search results, in your Google translate results, and a host of new startups.

SageMaker and Hugging Face partnered to make this easier for customers than ever before. We’ve launched Hugging Face deep learning containers (DLC’s) for you to train and host pre-trained models directly from Hugging Face’s repository of over 26,000 models. We’ve launched the SageMaker Training Compiler for you to speed up the runtime of your Hugging Face training loops by up to 50%. We’ve also integrated the Hugging Face flagship Transformers SDK with our distributed training libraries to make scaling out your NLP models easier than ever before.

For more information about Hugging Face Transformer models on Amazon SageMaker, see Support for Hugging Face Transformer Models.

New features for large-scale NLP model training with the SageMaker model parallel library 

At AWS re:Invent 2020, SageMaker launched distributed libraries that provide the best performance on the cloud for training computer vision models like Mask-RCNN and NLP models like T5-3B. This is possible through enhanced communication primitives that are 20-40% faster than NCCL on AWS, and model distribution techniques that enable extremely large language models to scale across tens to hundreds to thousands of GPUs.

The SageMaker model parallel library (SMP) has always given you the ability to take your predefined NLP model in PyTorch, be that through Hugging Face or elsewhere, and partition that model onto multiple GPUs in your cluster. Said another way, SMP breaks up your model into smaller chunks so you don’t experience out of memory (OOM) errors. We’re pleased to add additional memory-saving techniques that are critical for large scale models, namely:

  • Tensor parallelism
  • Optimizer state sharding
  • Activation checkpointing
  • Activation offloading

You can combine these four features can be combined to utilize memory more efficiently and train the next generation of extreme scale NLP models.

Distributed training and tensor parallelism

To understand tensor parallelism, it’s helpful to know that there are many kinds of distributed training, or parallelism. You’re probably already familiar with the most common type, data parallelism. The core of data parallelism works like this: you add an extra node to your cluster, such as going from one to two ml.EC2 instances in your SageMaker estimator. Then, you use a data parallel framework like Horovod, PyTorch Distributed Data Parallel, or SageMaker Distributed. This creates replicas of your model, one per accelerator, and handles sharding out the data to each node, along with bringing all the results together during the back propagation step of your neural network. Think distributed gradient descent. Data parallelism is also popular within servers; you’re sharding data into all the GPUs, and occasionally CPUs, on all of your nodes. The following diagram illustrates data parallelism.

Model parallelism is slightly different. Instead of making copies of the same model, we split your model into pieces. Then we manage the running it, so your data is still flowing through your neural network in exactly the same way mathematically, but different pieces of your model are sitting on different GPUs. If you’re using an ml.p3.8xlarge, you’ve got four NVIDIA V100’s, so you’d probably want to shard your model into 4 pieces, one piece per GPU. If you jump up to two ml.p4d.24xlarge’s, that’s 16 A100’s total in your cluster, so you might break your model into 16 pieces. This is also sometimes called pipeline parallelism. That’s because the set of layers in the network are partitioned across GPUs, and run in a pipelined manner to maximize the GPU utilization. The following diagram illustrates model parallelism.

To make model parallelism happen at scale, we need a third type of distribution: tensor parallelism. Tensor parallelism applies the same concepts at one step further—we break apart the largest layers of your neural network and place parts of the layers themselves on different devices. This is relevant when you’re working with 175 billion parameters or more, and trying to fit even a few records into RAM, along with parts of your model, to train that transformer. The following diagram illustrates tensor parallelism.

To enable tensor parallelism, set it within the smp options you pass to your estimator.

In the preceding code, pipeline_parallel_degree describes into how many segments your model should be sharded, based on the pipeline parallelism we discussed above. Another word for this is partitions.

To enable tensor parallelism, set tensor_parallel_degree  to your desired level. Make sure you’re picking a number equal to or smaller than the number of GPU’s per instance, so no greater than 8 for the ml.p4d.24xlarge machines. For additional script changes, refer to Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism.

The ddp parameter refers to distributed data parallel. You typically enable this if you’re using data parallelism or tensor parallelism, because the model parallelism library relies on DDP for these features.

Optimizer state sharding, activation offloading and checkpoints

If you have an extremely large model, you also need an extremely large optimizer state. Prepping your optimizer for SMP is straightforward: simply pick it up from disk in your script and load it into the smp.DistributedOptimizer() object.

Make sure you enable this at the estimator by setting shard_optimizer_state to True in the smp_options you use to configure SMP:

Similar to tensor and pipeline parallelism, SMP profiles your model and your world size (the total number of GPUs in all of your training nodes), to find the best placement strategies.

In deep learning the intermediate layer outputs are also called activations, and these need to be stored during forward pass. This is because they need to be used for gradient computation in the backward pass. In a large model, storing all these activations simultaneously in memory can create significant memory bottlenecks. To address this bottleneck, you can use activation checkpointing, the third new feature in the SageMaker model parallelism library. Activation checkpointing, or gradient checkpointing, is a technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. This effectively trades extra computation time for reduced memory usage.

Lastly, activation offloading directly uses activation checkpointing. It’s a strategy to keep only a few tensor activations on the GPU RAM during the model training. Specifically, we move the checkpointed activations to CPU memory during the forward pass and load them back to GPU for the backward pass of a specific micro-batch.

 Micro-batches and placement strategies

Other topics that sometimes cause customers confusion are micro-batches and placement strategies. Both of these are hyperparameters you can supply to the SageMaker model parallel library. Specifically micro-batches are relevant when implementing models that rely on pipeline parallelism, such as those at least 30 billion parameters in size or more.

Micro-batches are subsets of minibatches. When your model is in its training loop, you define a certain number of records to pick up and pass forward and backward through the layers–this is called a minibatch, or sometimes just a batch. A full pass through your dataset is called an epoch. To run forward and backward passes with pipeline parallelism, SageMaker model parallel library shards the batches into smaller subsets called micro-batches, which are run one at a time to maximize GPU utilization. The resultant, much smaller set of examples per-GPU, is called a micro-batch. In our GPT-2 example, we added a default of 1 microbatch directly to the training script.

As you scale up your training configuration, you are strongly recommended to change your batch size and micro-batch size accordingly. This is the only way to ensure good performance: you must consider batch size and micro-batch sizes as a function of your overall world size when relying on pipeline parallelism.

Placement strategies are how to tell SageMaker physically where to place your model partitions. If you’re using both model parallel and data parallel, setting placement_strategy to “cluster” places model replicas in device ID’s (GPUs) that are physically close to each other. However, if you really want to be more prescriptive about your parallelism strategy, you can break it down into a single string with different combinations of three letters: D for data parallelism, P indicates pipeline parallelism, and T for tensor parallelism. We generally recommend keeping the default placement of "cluster", because this is most appropriate for large scale model training. The “cluster” placement corresponds to “DPT“.

For more information about placement strategies, see Placement Strategy with Tensor Parallelism.

Example use case

Let’s imagine you have one ml.p3.16xlarge in your training job. That gives you 8 NVIDIA V100’s per node. Remember, every time you add an extra instance, you experience additional bandwidth overhead, so it’s always better to have more GP’Us on a single node. In this case, you’re better off with one ml.p3.16xlarge than, for example, two ml.p3.8xlarges. Even though the number of GPUs is the same, the extra bandwidth overhead of the extra node slows down your throughput.

The following diagram illustrates four-way model parallelism, combined with two-way data parallelism. This means you actually have two replicas of your model (think data parallel), with each of them partitioned across four GPU’s (model parallel).

If any of those model partitions are too large to fit onto a single GPU, you can add an extra type of distribution–tensor parallelism–to spit it and utilize both devices.

Conclusion

In this blog post we discussed SageMaker distributed training libraries, especially focusing on model parallelism. We shared performance benchmarks from our latest test, achieving 32 samples per second  across 120 ml.p4d.24xlarge instances and 175B parameters on Amazon SageMaker. We anticipate that if we increased this to 240 p4 instances we could train a 175B parameter model in 25 days.

We also discussed the newest features the enable large-scale training, namely tensor parallelism, optimizer state sharding, activation checkpointing, and activation offloading. We shared some tips and tricks for enabling this through training on Amazon SageMaker.

Try it out yourself using the same notebook that generated our numbers, which is available on GitHub here. You can also request more GPUs for your AWS account through requesting a service limit approval right here.


About the Authors

Emily Webber joined AWS just after SageMaker launched, and has been trying to tell the world about it ever since! Outside of building new ML experiences for customers, Emily enjoys meditating and studying Tibetan Buddhism.

Aditya Bindal is a Senior Product Manager for AWS Deep Learning. He works on products that make it easier for customers to train deep learning models on AWS. In his spare time, he enjoys spending time with his daughter, playing tennis, reading historical fiction, and traveling.

Luis Quintela is the Software Developer Manager for the AWS SageMaker model parallel library. In his spare time, he can be found riding his Harley in the SF Bay Area.

Read More