Get inspired in 2023 with new machine learning solutions for web developers with MediaPipe

Get inspired in 2023 with new machine learning solutions for web developers with MediaPipe

Posted by Jen Person, Senior Developer Relations Engineer

I’m the type of person to say I don’t like to make New Year’s resolutions, but then I still quietly resolve to make some changes anyway. After overindulging over the holidays, I resolve to eat healthier, exercise more, spend more time with friends and family, and prioritize my mental health…but they’re not *New Year’s* resolutions I swear! Because whether you like to make New Year’s resolutions or not, the start of a new year can give you a feeling of inspiration. It’s like a blank slate full of possibilities!

What kind of changes are you resolving to make this year? If you’re looking to create an exciting new web project or take your work to the next level, then I recommend adding machine learning (ML)!

Near year, new solutions

MediaPipe has been a great go-to solution for web developers interested in adding ML to their web applications. In 2022, the MediaPipe hands NPM package had around 70K downloads, the pose package had about 90K downloads, and the selfie segmentation package had over 130K downloads!

This year, MediaPipe has expanded to include MediaPipe Tasks, Model Maker, and Studio! Tasks are aptly named because they can be used to perform common ML tasks like image classification and object detection. Model Maker is a low-code solution for customizing your MediaPipe Tasks to fit your app’s needs. With MediaPipe Studio, you can view interactive demos of MediaPipe Tasks. In the future, you will be able to customize your tasks in MediaPipe Studio without writing any code.

MediaPipe’s solutions are special because they are available across multiple platforms, including Android, web, and Python, but given my background in JavaScript, I want to take this opportunity to shine the spotlight on web.

When compared to server-side ML, web ML has some unique benefits:

Lower latency – Predictions are done right on your users’ devices, so there is no waiting for server calls to complete. This is essential for applications that use a streaming component like the webcam.

User privacy – With predictions taking place on-device, your users’ data never leaves their device.

Click and go – Your users don’t have to download any additional applications or plugins. Just navigate to the desired URL and your ML experience is good to go!

MediaPipe is updating its offerings, including more solutions and opportunities for customization. Check out these new MediaPipe Tasks:

Image Classification – identify what an image represents among a set of categories defined at training time.

Photo of an American Flamingo facing left with ttext 'Flamingo 95%'

Object Detection – detect the presence and location of multiple classes of object.

Image of a dog on the left and a cat on the right with respective Object detection labels 'dog' and 'cat'

Text Classification – classify text into a set of defined categories, such as positive or negative sentiment.

Image showing input text in a white bubble reads 'Great movie with a classic plot. I will recommend this to everyone.'and Output showing five turquoise stars and a white thumbs up against a green background

Gesture Recognition – recognize specific hand gestures from a user, and invoke application features that correspond to those gestures.

Image showing a hand giving thumbs up gesture. Text in a white bubble reads 'Thumbs up 63%'

Hand Landmark Detection – localize key points of the hands and render visual effects over the hands.

Image showing a hand holding an egg. White lines with blue nodes indicate the detection of landmarks in the hand in the image

MediaPipe is adding more exciting solutions in 2023, so keep an eye out for what’s next!

Customize for your needs

Many of these solutions offer customization using MediaPipe Model Maker. The MediaPipe Model Maker package is a simple, low-code solution for customizing on-device ML models, including models for the web. And with MediaPipe Studio, you can prototype and benchmark solutions in-browser!

Resolve to make something great!

By now, a lot of our New Year’s resolutions have already been abandoned. But it’s definitely not too late to make a new one! Why not resolve to build something amazing with MediaPipe solutions for the web?

Create a rock paper scissors game

At the Women in ML Symposium, the MediaPipe team hosted a workshop walking through creating a rock paper scissors game using the MediaPipe solutions Gesture Recognizer task. Learn how to train a custom gesture recognizer by following along with the workshop on YouTube using the corresponding Colab notebook. You can also view a complete version of the game on Codepen.

Categorize your images

When uploading images, run image classification to automatically add relevant tags. Check out the image classification task documentation and the Codepen demo to see how to get started. You can even customize your model to add your own tags to suit your needs.

Cropped Screen grab of Mediapipe Image Classifier for web Codepen demo showing the image of a dog under text which reads Demo: Classify images Click in an image below to see its classification

Run sentiment analysis

Want to get an idea how your users are feeling? Run sentiment analysis on text to classify it as positive or negative. See the documentation and the Codepen demo to find out how it’s done. The best part is that you can also customize your model to classify text in whatever category you need!

Cropped Screen grab of Mediapipe Text Classifier for web Codepen demo showing the image of a dog under text which reads Demo: Classify images Click in an image below to see its classification

[Your idea here]

Let’s face it: you’re much more creative than I am! So when you build something amazing with MediaPipe Solutions, share it with us on the TensorFlow forum, LinkedIn, or Twitter!

Read More

Real-time tracking of wildfire boundaries using satellite imagery

Real-time tracking of wildfire boundaries using satellite imagery

As global temperatures rise, wildfires around the world are becoming more frequent and more dangerous. Their effects are felt by many communities as people evacuate their homes or suffer harm even from proximity to the fire and smoke.

As part of Google’s mission to help people access trusted information in critical moments, we use satellite imagery and machine learning (ML) to track wildfires and inform affected communities. Our wildfire tracker was recently expanded. It provides updated fire boundary information every 10–15 minutes, is more accurate than similar satellite products, and improves on our previous work. These boundaries are shown for large fires in the continental US, Mexico, and most of Canada and Australia. They are displayed, with additional information from local authorities, on Google Search and Google Maps, allowing people to keep safe and stay informed about potential dangers near them, their homes or loved ones.

Real-time boundary tracking of the 2021-2022 Wrattonbully bushfire, shown as a red polygon in Google Maps.

Inputs

Wildfire boundary tracking requires balancing spatial resolution and update frequency. The most scalable method to obtain frequent boundary updates is to use geostationary satellites, i.e., satellites that orbit the earth once every 24 hours. These satellites remain at a fixed point above Earth, providing continual coverage of the area surrounding that point. Specifically, our wildfire tracker models use the GOES-16 and GOES-18 satellites to cover North America, and the Himawari-9 and GK2A satellites to cover Australia. These provide continent-scale images every 10 minutes. The spatial resolution is 2km at nadir (the point directly below the satellite), and lower as one moves away from nadir. The goal here is to provide people with warnings as soon as possible, and refer them to authoritative sources for spatially precise, on-the-ground data, as necessary.

Smoke plumes obscuring the 2018 Camp Fire in California. [Image from NASA Worldview]

Determining the precise extent of a wildfire is nontrivial, since fires emit massive smoke plumes, which can spread far from the burn area and obscure the flames. Clouds and other meteorological phenomena further obscure the underlying fire. To overcome these challenges, it is common to rely on infrared (IR) frequencies, particularly in the 3–4 μm wavelength range. This is because wildfires (and similar hot surfaces) radiate considerably at this frequency band, and these emissions diffract with relatively minor distortions through smoke and other particulates in the atmosphere. This is illustrated in the figure below, which shows a multispectral image of a wildfire in Australia. The visible channels (blue, green, and red) mostly show the triangular smoke plume, while the 3.85 μm IR channel shows the ring-shaped burn pattern of the fire itself. Even with the added information from the IR bands, however, determining the exact extent of the fire remains challenging, as the fire has variable emission strength, and multiple other phenomena emit or reflect IR radiation.

Himawari-8 hyperspectral image of a wildfire. Note the smoke plume in the visible channels (blue, green, and red), and the ring indicating the current burn area in the 3.85μm band.

Model

Prior work on fire detection from satellite imagery is typically based on physics-based algorithms for identifying hotspots from multispectral imagery. For example, the National Oceanic and Atmospheric Administration (NOAA) fire product identifies potential wildfire pixels in each of the GOES satellites, primarily by relying on the 3.9 μm and 11.2 μm frequencies (with auxiliary information from two other frequency bands).

In our wildfire tracker, the model is trained on all satellite inputs, allowing it to learn the relative importance of different frequency bands. The model receives a sequence of the three most recent images from each band so as to compensate for temporary obstructions such as cloud cover. Additionally, the model receives inputs from two geostationary satellites, achieving a super-resolution effect whereby the detection accuracy improves upon the pixel size of either satellite. In North America, we also supply the aforementioned NOAA fire product as input. Finally, we compute the relative angles of the sun and the satellites, and provide these as additional input to the model.

All inputs are resampled to a uniform 1 km–square grid and fed into a convolutional neural network (CNN). We experimented with several architectures and settled on a CNN followed by a 1×1 convolutional layer to yield separate classification heads for fire and cloud pixels (shown below). The number of layers and their sizes are hyperparameters, which are optimized separately for Australia and North America. When a pixel is identified as a cloud, we override any fire detection since heavy clouds obscure underlying fires. Even so, separating the cloud classification task improves the performance of fire detection as we incentivize the system to better identify these edge cases.

CNN architecture for the Australia model; a similar architecture was used for North America. Adding a cloud classification head improves fire classification performance.

To train the network, we used thermal anomalies data from the MODIS and VIIRS polar-orbiting satellites as labels. MODIS and VIIRS have higher spatial accuracy (750–1000 meters) than the geostationary satellites we use as inputs. However, they cover a given location only once every few hours, which occasionally causes them to miss rapidly-advancing fires. Therefore, we use MODIS and VIIRS to construct a training set, but at inference time we rely on the high-frequency imagery from geostationary satellites.

Even when limiting attention to active fires, most pixels in an image are not currently burning. To reduce the model’s bias towards non-burning pixels, we upsampled fire pixels in the training set and applied focal loss to encourage improvements in the rare misclassified fire pixels.

The progressing boundary of the 2022 McKinney fire, and a smaller nearby fire.

Evaluation

High-resolution fire signals from polar-orbiting satellites are a plentiful source for training data. However, such satellites use sensors that are similar to geostationary satellites, which increases the risk of systemic labeling errors (e.g., cloud-related misdetections) being incorporated into the model. To evaluate our wildfire tracker model without such bias, we compared it against fire scars (i.e., the shape of the total burnt area) measured by local authorities. Fire scars are obtained after a fire has been contained and are more reliable than real-time fire detection techniques. We compare each fire scar to the union of all fire pixels detected in real time during the wildfire to obtain an image such as the one shown below. In this image, green represents correctly identified burn areas (true positive), yellow represents unburned areas detected as burn areas (false positive), and red represents burn areas that were not detected (false negative).

Example evaluation for a single fire. Pixel size is 1km x 1km.

We compare our models to official fire scars using the precision and recall metrics. To quantify the spatial severity of classification errors, we take the maximum distance between a false positive or false negative pixel and the nearest true positive fire pixel. We then average each metric across all fires. The results of the evaluation are summarized below. Most severe misdetections were found to be a result of errors in the official data, such as a missing scar for a nearby fire.

Test set metrics comparing our models to official fire scars.

We performed two additional experiments on wildfires in the United States (see table below). First, we evaluated an earlier model that relies only on NOAA’s GOES-16 and GOES-17 fire products. Our model outperforms this approach in all metrics considered, demonstrating that the raw satellite measurements can be used to enhance the existing NOAA fire product.

Next, we collected a new test set consisting of all large fires in the United States in 2022. This test set was not available during training because the model launched before the fire season began. Evaluating the performance on this test set shows performance in line with expectations from the original test set.

Comparison between models on fires in the United States.

Conclusion

Boundary tracking is part of Google’s wider commitment to bring accurate and up-to-date information to people in critical moments. This demonstrates how we use satellite imagery and ML to track wildfires, and provide real time support to affected people in times of crisis. In the future, we plan to keep improving the quality of our wildfire boundary tracking, to expand this service to more countries and continue our work helping fire authorities access critical information in real time.

Acknowledgements

This work is a collaboration between teams from Google Research, Google Maps and Crisis Response, with support from our partnerships and policy teams. We would also like to thank the fire authorities whom we partner with around the world.

Read More

TensorFlow Lite Micro with ML acceleration

TensorFlow Lite Micro with ML acceleration

Posted by Scott Main, Technical Writer, and the Coral team

In just a few years, ML models for mobile and embedded systems have come a very long way. With TensorFlow Lite (TFLite), you can now run sophisticated models that perform pose estimation and object segmentation, but these models still require a relatively powerful processor and a high-level OS in a mobile device or small computer like a Raspberry Pi. Alternatively, you can use TensorFlow Lite Micro (TFLM) on low-power microcontrollers (MCUs) to run simple models such as image and audio classification. However, the models for MCUs are much smaller, so they have limited capabilities and accuracy.

So there’s an opportunity cost when you must select between TFLM (low power but limited model performance) and regular TFLite (great model performance but higher power cost). Wouldn’t it be nice if you could get both on one board? Well, we’re happy to announce that the Coral Dev Board Micro is now available to provide exactly that.

A tiny board with big muscle

The Dev Board Micro is a microcontroller board (with a dual-core Cortex-M7 and Cortex-M4), so it’s small and power efficient, but it also includes the Coral Edge TPU™ on board, so it offers outstanding inferencing speeds for larger TFLite models. Plus, it has an on-board camera (324×324) and microphone. Naturally, there are plenty of GPIO pins and high-density connectors for add-on boards (such as our own Wireless Add-on and PoE Add-on).

against a nebulous bright white background, a hand holding up a chip board with the words 'Dev Board Micro' and the Coral Logo on it between the thumb and index finger

The Dev Board Micro executes your models using TFLM, which supports only a subset of operations in TFLite. Even if TFLM did support all the same ops, the MCU would still be much too slow for practical applications that use complex models such as for object detection and pose estimation. However, when you compile a TFLite model for the Edge TPU, all the MCU needs to do is set the model’s input, delegate the model ops to the Edge TPU, and then read the output.

As such, even though you’re still using the smaller TFLM interpreter, you can run sophisticated TFLite models that otherwise are not compatible with the TFLM interpreter, because they actually execute on the Edge TPU. For example, with the Dev Board Micro, you can run PoseNet for pose estimation, BodyPix for body segmentation, SSD MobileNet for object detection, and much more, at realtime speeds. For example:
Table showing the different models with corresponding inference time on Dev Board Micro with Edge TPU
Of course, running the Edge TPU demands more power, but the beauty of this board’s dual-core MCU is that you can run low-power apps on the M4 (which supports tiny TFLM models) and then activate the M7 and Edge TPU only as needed to run more sophisticated TFLite models.
To better understand how this board compares to our other Coral board, here’s a brief comparison of our different developer boards:
Table comparing the price (USD), size, processor,RAM, camera, microphone, wi-fi/bluetooth, ethernet, and operating system capabilities across Dev Board Micro, Dev Board Mini and Dev Board

Get started

We built a new platform for the Dev Board Micro based on FreeRTOS and included compatibility with the Arduino programming language. So you can build a C++ app with CMake and flash it to the board with our command line tools, or you can write and upload an Arduino sketch with the Arduino IDE. We call this new platform coralmicro and it’s fully open sourced on GitHub.

If you choose to code with FreeRTOS, coralmicro includes all the core FreeRTOS APIs you need to build multi-tasking apps on the MCU, plus custom coralmicro APIs for interacting with GPIOs, capturing photos, listening to audio, performing multi-core processing, and much more.

Because coralmicro uses TensorFlow Lite for Microcontrollers for inferencing, running a TensorFlow Lite model on the Dev Board Micro works almost exactly the way you expect, if you’ve used TensorFlow Lite on other platforms. One difference with TFLM, compared to TFLite, is that you need to specify the ops used by your model by adding them to the MicroMutableOpResolver. For example, if your model uses 2D convolution, then you need to call AddConv2D(). This way, you conserve memory by compiling only the op kernels you actually need to run your model on the MCU. However, if your model is compiled to run on the Edge TPU, then you also need to add the Edge TPU custom op, which accounts for all the ops that run on the Edge TPU. For example, when using SSD MobileNet for object detection on the Edge TPU, only the dequantize and post-processing ops run on the MCU, and the rest are delegated to the Edge TPU custom op, so the code to set up the MicroInterpreter looks like this:

auto tpu_context = coralmicro::EdgeTpuManager::GetSingleton()->OpenDevice();
if (!tpu_context) {
printf("ERROR: Failed to get EdgeTpu contextrn");
vTaskSuspend(nullptr);
}

tflite:
:MicroErrorReporter error_reporter;
tflite::MicroMutableOpResolver<3> resolver;
resolver.AddDequantize();
resolver.AddDetectionPostprocess();
resolver.AddCustom(coralmicro::kCustomOp, coralmicro::RegisterCustomOp());

tflite:
:MicroInterpreter interpreter(tflite::GetModel(model.data()), resolver,
tensor_arena, kTensorArenaSize,
&error_reporter)
;

Notice that you also need to turn on the Edge TPU with OpenDevice(). Other than that and AddCustom(), the code to run an inference on the Dev Board Micro is pretty standard TensorFlow code. For more details, see our API reference for TFLM, and check out our code examples for FreeRTOS.

If you prefer to code with the Arduino IDE, we offer Arduino-style APIs for most of the same features available in FreeRTOS (multi-core processing is not available in Arduino). All you need to do is install the “Coral” boards package in the Arduino IDE’s Board Manager, select the Dev Board Micro board, and then you can browse all our examples for the Dev Board Micro in File > Examples.

Table comparing the price (USD), size, processor,RAM, camera, microphone, wi-fi/bluetooth, ethernet, and operating system capabilities across Dev Board Micro, Dev Board Mini and Dev Board

You can learn more about the board and find a seller here, and start running the code examples by following our get started guide.

Read More

Predict football punt and kickoff return yards with fat-tailed distribution using GluonTS

Predict football punt and kickoff return yards with fat-tailed distribution using GluonTS

Today, the NFL is continuing their journey to increase the number of statistics provided by the Next Gen Stats Platform to all 32 teams and fans alike. With advanced analytics derived from machine learning (ML), the NFL is creating new ways to quantify football, and to provide fans with the tools needed to increase their knowledge of the games within the game of football. For the 2022 season, the NFL aimed to leverage player-tracking data  and new advanced analytics techniques to better understand special teams.

The goal of the project was to predict how many yards a returner would gain on a punt or kickoff play. One of the challenges when building predictive models for punt and kickoff returns is the availability of very rare events — such as touchdowns — that have significant importance in the dynamics of a game. A data distribution with fat tails is common in real-world applications, where rare events have significant impact on the overall performance of the models. Using a robust method to accurately model distribution over extreme events is crucial for better overall performance.

In this post, we demonstrate how to use Spliced Binned-Pareto distribution implemented in GluonTS to robustly model such fat-tailed distributions.

We first describe the dataset used. Next, we present the data preprocessing and other transformation methods applied to the dataset. We then explain the details of the ML methodology and model training procedures. Finally, we present the model performance results.

Dataset

In this post, we used two datasets to build separate models for punt and kickoff returns. The player tracking data contains the player’s position, direction, acceleration, and more (in x,y coordinates). There are around 3,000 and 4,000 plays from four NFL seasons (2018–2021) for punt and kickoff plays, respectively. In addition, there are very few punt and kickoff-related touchdowns in the datasets—only 0.23% and 0.8%, respectively. The data distribution for punt and kickoff are different. For example, the true yardage distribution for kickoff and punts are similar but shifted, as shown in the following figure.

Punts and kickoff return yards distribution

Data preprocessing and feature engineering

First, the tracking data was filtered for just the data related to punts and kickoff returns. The player data was used to derive features for model development:

  • X – Player position along the long axis of the field
  • Y – Player position along the short axis of the field
  • S – Speed in yards/second; replaced by Dis*10 to make it more accurate (Dis is the distance in the past 0.1 seconds)
  • Dir – Angle of player motion (degrees)

From the preceding data, each play was transformed into 10X11X14 of data with 10 offensive players (excluding the ball carrier), 11 defenders, and 14 derived features:

  • sX – x speed of a player
  • sY – y speed of a player
  • s – Speed of a player
  • aX – x acceleration of a player
  • aY – y acceleration of a player
  • relX – x distance of player relative to ball carrier
  • relY – y distance of player relative to ball carrier
  • relSx – x speed of player relative to ball carrier
  • relSy – y speed of player relative to ball carrier
  • relDist – Euclidean distance of player relative to ball carrier
  • oppX – x distance of offense player relative to defense player
  • oppY – y distance of offense player relative to defense player
  • oppSx –x speed of offense player relative to defense player
  • oppSy – y speed of offense player relative to defense player

To augment the data and account for the right and left positions, the X and Y position values were also mirrored to account for the right and left field positions. The data preprocessing and feature engineering was adapted from the winner of the NFL Big Data Bowl competition on Kaggle.

ML methodology and model training

Because we’re interested in all possible outcomes from the play, including the probability of a touchdown, we can’t simply predict the average yards gained as a regression problem. We need to predict the full probability distribution of all possible yard gains, so we framed the problem as a probabilistic prediction.

One way to implement probabilistic predictions is to assign the yards gained to several bins (such as less than 0, from 0–1, from 1–2, …, from 14–15, more than 15) and predict the bin as a classification problem. The downside of this approach is that we want small bins to have a high definition picture of the distribution, but small bins mean fewer data points per bin and our distribution, especially the tails, may be poorly estimated and irregular.

Another way to implement probabilistic predictions is to model the output as a continuous probability distribution with a limited number of parameters (for example, a Gaussian or Gamma distribution) and predict the parameters. This approach gives a very high definition and regular picture of the distribution, but is too rigid to fit the true distribution of yards gained, which is multi-modal and heavy tailed.

To get the best of both methods, we use Spliced Binned-Pareto distribution (SBP), which has bins for the center of the distribution where a lot of data is available, and Generalized Pareto distribution (GPD) at both ends, where rare but important events can happen, like a touchdown. The GPD has two parameters: one for scale and one for tail heaviness, as seen in the following graph (source: Wikipedia).

By splicing the GPD with the binned distribution (see the following left graph) on both sides, we obtain the following SBP on the right. The lower and upper thresholds where splicing is done are hyperparameters.

Binned and SPB distributions

As a baseline, we used the model that won our NFL Big Data Bowl competition on Kaggle. This model uses CNN layers to extract features from the prepared data, and predicts the outcome as a “1 yard per bin” classification problem. For our model, we kept the feature extraction layers from the baseline and only modified the last layer to output SBP parameters instead of probabilities for each bin, as shown in the following figure (image edited from the post 1st place solution The Zoo).

Model Architecture

We used the SBP distribution provided by GluonTS. GluonTS is a Python package for probabilistic time series modeling, but the SBP distribution is not specific to time series, and we were able to repurpose it for regression. For more information on how to use GluonTS SBP, see the following demo notebook.

Models were trained and cross-validated on the 2018, 2019, and 2020 seasons and tested on the 2021 season. To avoid leakage during cross-validation, we grouped all plays from the same game into the same fold.

For evaluation, we kept the metric used in the Kaggle competition, the continuous ranked probability score (CRPS), which can be seen as an alternative to the log-likelihood that is more robust to outliers. We also used the Pearson correlation coefficient and the RMSE as general and interpretable accuracy metrics. Furthermore, we looked at the probability of a touchdown and probability plots to evaluate calibration.

The model was trained on the CRPS loss using Stochastic Weight Averaging and early stopping.

To deal with the irregularity of the binned part of the output distributions, we used two techniques:

  • A smoothness penalty proportional to the squared difference between two consecutive bins
  • Ensembling models trained during cross-validation

Model performance results

For each dataset, we performed a grid search over the following options:

  • Probabilistic models
    • Baseline was one probability per yard
    • SBP was one probability per yard in the center, generalized SBP in the tails
  • Distribution smoothing
    • No smoothing (smoothness penalty = 0)
    • Smoothness penalty = 5
    • Smoothness penalty = 10
  • Training and inference procedure
    • 10 folds cross-validation and ensemble inference (k10)
    • Training on train and validation data for 10 epochs or 20 epochs

Then we looked at the metrics for the top five models sorted by CRPS (lower is better).

For kickoff data, the SBP model slightly over-performs in terms of CRPS but more importantly it estimates the touchdown probability better (true probability is 0.80% in the test set). We see that the best models use 10 folds ensembling (k10) and no smoothness penalty, as shown in the following table.

Training Model Smoothness CRPS RMSE CORR % P(touchdown)%
k10 SBP 0 4.071 9.641 47.15 0.78
k10 Baseline 0 4.074 9.62 47.585 0.306
k10 Baseline 5 4.075 9.626 47.43 0.274
k10 SBP 5 4.079 9.656 46.977 0.682
k10 Baseline 10 4.08 9.621 47.519 0.265

The following plot of the observed frequencies and predicted probabilities indicates a good calibration of our best model, with an RMSE of 0.27 between the two distributions. Note the occurrences of high yardage (for example, 100) that occur in the tail of the true (blue) empirical distribution, whose probabilities are more capturable by the SBP than the baseline method.

Kickoff observed frequencies and predicted probability distribution

For punt data, the baseline outperforms the SBP, perhaps because the tails of extreme yardage have fewer realizations. Therefore, it’s a better trade-off to capture the modality between 0–10 yards peaks; and contrary to kickoff data, the best model uses a smoothness penalty. The following table summarizes our findings.

Training Model Smoothness CRPS RMSE CORR % P(touchdown)%
k10 Baseline 5 3.961 8.313 35.227 0.547
k10 Baseline 0 3.972 8.346 34.227 0.579
k10 Baseline 10 3.978 8.351 34.079 0.555
k10 SBP 5 3.981 8.342 34.971 0.723
k10 SBP 0 3.991 8.378 33.437 0.677

The following plot of observed frequencies (in blue) and predicted probabilities for the two best punt models indicates that the non-smoothed model (in orange) is slightly better calibrated than the smoothed model (in green) and may be a better choice overall.

Punt true and predicted probabilities

Conclusion

In this post, we showed how to build predictive models with fat-tailed data distribution. We used Spliced Binned-Pareto distribution, implemented in GluonTS, which can robustly model such fat-tailed distributions. We used this technique to build models for punt and kickoff returns. We can apply this solution to similar use cases where there are very few events in the data, but those events have significant impact on the overall performance of the models.

If you would like help with accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program.


About the Authors

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he helps AWS customers across various industries such as healthcare and life sciences, manufacturing, automotive, and sports and media, accelerate their use of machine learning and AWS cloud services to solve their business challenges.

Marc van Oudheusden is a Senior Data Scientist with the Amazon ML Solutions Lab team at Amazon Web Services. He works with AWS customers to solve business problems with artificial intelligence and machine learning. Outside of work you may find him at the beach, playing with his children, surfing or kitesurfing.

Panpan Xu is a Senior Applied Scientist and Manager with the Amazon ML Solutions Lab at AWS. She is working on research and development of Machine Learning algorithms for high-impact customer applications in a variety of industrial verticals to accelerate their AI and cloud adoption. Her research interest includes model interpretability, causal analysis, human-in-the-loop AI and interactive data visualization.

Kyeong Hoon (Jonathan) Jung is a senior software engineer at the National Football League. He has been with the Next Gen Stats team for the last seven years helping to build out the platform from streaming the raw data, building out microservices to process the data, to building API’s that exposes the processed data. He has collaborated with the Amazon Machine Learning Solutions Lab in providing clean data for them to work with as well as providing domain knowledge about the data itself. Outside of work, he enjoys cycling in Los Angeles and hiking in the Sierras.

Michael Chi is a Senior Director of Technology overseeing Next Gen Stats and Data Engineering at the National Football League. He has a degree in Mathematics and Computer Science from the University of Illinois at Urbana Champaign. Michael first joined the NFL in 2007 and has primarily focused on technology and platforms for football statistics. In his spare time, he enjoys spending time with his family outdoors.

  Mike Band is a Senior Manager of Research and Analytics for Next Gen Stats at the National Football League. Since joining the team in 2018, he has been responsible for ideation, development, and communication of key stats and insights derived from player-tracking data for fans, NFL broadcast partners, and the 32 clubs alike. Mike brings a wealth of knowledge and experience to the team with a master’s degree in analytics from the University of Chicago, a bachelor’s degree in sport management from the University of Florida, and experience in both the scouting department of the Minnesota Vikings and the recruiting department of Florida Gator Football.

Read More

Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab

Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab

The National Football League (NFL) is one of the most popular sports leagues in the United States and is the most valuable sports league in the world. The NFL, BioCore, and AWS are committed to advancing human understanding around the diagnosis, prevention, and treatment of sports-related injuries to make the game of football safer. More information regarding the NFL Player Health and Safety efforts is available on the NFL website.

The AWS Professional Services team has partnered with the NFL and Biocore to provide machine learning (ML)-based solutions for identifying helmet impacts from game footage using computer vision (CV) techniques. With multiple camera views available from each game, we have developed solutions to identify helmet impacts from each of these views and merge the helmet impact results.

The motivation behind utilizing multiple camera views comes from the limitation of information when the impact events are captured with only one view. With only one perspective, some players might occlude each other or be blocked by other objects on the field. Therefore, adding more perspectives allows our ML system to identify more impacts that aren’t visible in a single view. To showcase the results of our fusion process and how the team uses visualization tools to help evaluate the model performance, we have developed a codebase to visually overlay the multiple view detection results. This process helps identify the actual number of impacts individual players experience by removing duplicate impacts detected in multiple views.

In this post, we use the publicly available dataset from the NFL – Impact Detection Kaggle competition and show results for merging two views. The dataset includes helmet bounding boxes at every frame and impact labels found in each video. In particular, we focus on deduplicating and visualizing videos with the ID 57583_000082 in endzone and sideline views. You can download the endzone and sideline videos, and also the ground truth labels.

Prerequisites

The solution requires the following:

Get started on SageMaker Studio Lab and install the required packages

You can run the notebook from the GitHub repository or from SageMaker Studio Lab. In this post, we run the notebook from a SageMaker Studio Lab environment. We are choosing SageMaker Studio Lab because it is free, provides powerful CPU and GPU user sessions, and 15GB of persistent storage that will automatically save your environment, enabling you to pick up where you left off.  To use SageMaker Studio Lab, request and set up a new account. After the account is approved, complete the following steps:

  1. Visit the aws-samples GitHub repo.
  2. In the README section, choose Open Studio Lab.

sagemaker-studio-button

This redirects you to your SageMaker Studio Lab environment.

  1. Select your CPU compute type, then choose Start Runtime.
  2. After the runtime starts, choose Copy to Project, which opens a new window with the Jupyter Lab environment.

Now you’re ready to use the notebook!

  1. Open fuse_and_visualize_multiview_impacts.ipynb and follow the instructions in the notebook.

The first cell in the notebook installs the necessary Python packages such as pandas and OpenCV:

%pip install pandas
%pip install opencv-contrib-python-headless

Import all the necessary Python packages and set pandas options for better visualization experience:

import os
import cv2
import pandas as pd
import numpy as np
pd.set_option('mode.chained_assignment', None)

We use pandas for ingesting and parsing through the CSV file with the annotated helmet bounding boxes as well as impacts. We use NumPy mainly for manipulating arrays and matrices. We use OpenCV for reading, writing, and manipulating image data in Python.

Prepare the data by fusing results from two views

To fuse the two perspectives together, we use the train_labels.csv from the Kaggle competition as an example because it contains ground truth impacts from both the endzone and sideline. The following function takes the input dataset and outputs a fused dataframe that is deduplicated for all the plays in the input dataset:

def prep_data(df):
    df['game_play'] = df['gameKey'].astype('str') + '_' + df['playID'].astype('str').str.zfill(6)
    return df

def dedup_view(df, windows):
    # define view
    df = df.sort_values(by='frame')
    view_columns = ['frame', 'left', 'width', 'top', 'height', 'video']
    common_columns = ['game_play', 'label', 'view', 'impactType']
    label_cleaned = df[view_columns + common_columns]
    
    # rename columns
    sideline_column_rename = {col: 'Sideline_' + col for col in view_columns}
    endzone_column_rename = {col: 'Endzone_' + col for col in view_columns}
    sideline_columns = list(sideline_column_rename.values())

    # create two dataframes, one for sideline, one for endzone
    label_endzone = label_cleaned.query('view == "Endzone"')
    label_endzone.rename(columns=endzone_column_rename, inplace=True)
    label_sideline = label_cleaned.query('view == "Sideline"')
    label_sideline.rename(columns=sideline_column_rename, inplace=True)

    # prepare sideline labels
    label_sideline['is_dup'] = False
    for columns in sideline_columns:
        label_endzone[columns] = np.nan
    label_endzone['is_dup'] = False

    # iterrate endzone rows to find matches and dedup
    for index, row in label_endzone.iterrows():
        player = row['label']
        frame = row['Endzone_frame']
        impact_type = row['impactType']
        sideline_row = label_sideline[(label_sideline['label'] == player) & 
                                      ((label_sideline['Sideline_frame'] >= frame - windows // 2) &
                                       (label_sideline['Sideline_frame'] <= frame + windows // 2 + 1)) &
                                      (label_sideline['is_dup'] == False) & 
                                      (label_sideline['impactType'] == impact_type)]

        if len(sideline_row) > 0:
            sideline_index = sideline_row.index[0]
            label_sideline['is_dup'].loc[sideline_index] = True

            for col in sideline_columns:
                label_endzone[col].loc[index] = sideline_row.iloc[0][col]
            label_endzone['is_dup'].loc[index] = True

    # calculate overlap perc
    not_dup_sideline = label_sideline[label_sideline['is_dup'] == False]
    final_output = pd.concat([not_dup_sideline, label_endzone])
    return final_output

def fuse_df(raw_df, windows):
    outputs = []
    all_game_play = raw_df['game_play'].unique()
    for game_play in all_game_play:
        df = raw_df.query('game_play ==@game_play')
        output = dedup_view(df, windows)
        outputs.append(output)

    output_df = pd.concat(outputs)
    output_df['gameKey'] = output_df['game_play'].apply(lambda x: x.split('_')[0]).map(int)
    output_df['playID'] = output_df['game_play'].apply(lambda x: x.split('_')[1]).map(int)
    return output_df

To run the function, we run the following code block to provide the location of the train_labels.csv data and then perform data preparation to add an additional column and extract only the impact rows. After running the function, we save the output to a dataframe variable called fused_df.

# read the annotated impact data from train_labels.csv
ground_truth = pd.read_csv('train_labels.csv')

# prepare game_play column using pipe(prep_data) function in pandas then filter the dataframe for just rows with impacts
ground_truth = ground_truth.pipe(prep_data).query('impact == 1')

# loop over all the unique game_plays and deduplicate the impact results from sideline and endzone
fused_df = fuse_df(ground_truth, windows=30)

The following screenshot shows the ground truth.

The following screenshot shows the fused dataframe examples.

Graph and video code

After we fuse the impact results, we use the generated fused_df to overlay the results onto our endzone and sideline videos and merge the two views together. We use the following function for this, and the inputs needed are the paths to the endzone video, sideline video, fused_df dataframe, and the final output path for the newly generated video. The functions used in this section are described in the markdown section of the notebook used in SageMaker Studio Lab.

def get_video_and_metadata(vid_path): 
    vid = cv2.VideoCapture(vid_path)
    total_frame_number = vid.get(cv2.CAP_PROP_FRAME_COUNT)
    width = int(vid.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(vid.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = vid.get(cv2.CAP_PROP_FPS)
    return vid, total_frame_number, width, height, fps

def overlay_impacts(frame, fused_df, game_key, play_id, frame_cnt, h1):
    # look for duplicates 
    duplicates = fused_df.query(f"gameKey == {int(game_key)} and 
                                  playID == {int(play_id)} and 
                                  is_dup == True and 
                                  Sideline_frame == @frame_cnt") 

    frame_has_impact = False 
    
    if len(duplicates) > 0: 
        for duplicate in duplicates.itertuples(index=False): 
            if frame_cnt == duplicate.Sideline_frame: 
                frame_has_impact = True 

            if frame_has_impact: 
                cv2.rectangle(frame, #frame to be edited 
                              (int(duplicate.Sideline_left), int(duplicate.Sideline_top)), #(x,y) of top left corner 
                              (int(duplicate.Sideline_left) + int(duplicate.Sideline_width), int(duplicate.Sideline_top) + int(duplicate.Sideline_height)), #(x,y) of bottom right corner 
                              (0,0,255), #RED boxes
                              thickness=3)

                cv2.rectangle(frame, #frame to be edited
                              (int(duplicate.Endzone_left), int(duplicate.Endzone_top)+ h1), #(x,y) of top left corner
                              (int(duplicate.Endzone_left) + int(duplicate.Endzone_width), int(duplicate.Endzone_top) + int(duplicate.Endzone_height) + h1), #(x,y) of bottom right corner
                              (0,0,255), #RED boxes
                              thickness=3)
 
                cv2.line(frame, #frame to be edited
                         (int(duplicate.Sideline_left), int(duplicate.Sideline_top)), #(x,y) of point 1 in a line
                         (int(duplicate.Endzone_left), int(duplicate.Endzone_top) + h1), #(x,y) of point 2 in a line
                         (255, 255, 255), # WHITE lines
                         thickness=4)
 
            else:
                # if no duplicates, look for sideline then endzone and add to the view
                sl_impacts = fused_df.query(f"gameKey == {int(game_key)} and 
                                              playID == {int(play_id)} and 
                                              is_dup == False and 
                                              view == 'Sideline' and 
                                              Sideline_frame == @frame_cnt")

                if len(sl_impacts) > 0:
                    for impact in sl_impacts.itertuples(index=False):
                        if frame_cnt == impact.Sideline_frame:
                            frame_has_impact = True

                        if frame_has_impact:
                            cv2.rectangle(frame, #frame to be edited
                                          (int(impact.Sideline_left), int(impact.Sideline_top)), #(x,y) of top left corner
                                          (int(impact.Sideline_left) + int(impact.Sideline_width), int(impact.Sideline_top) + int(impact.Sideline_height)), #(x,y) of bottom right corner
                                          (0, 255, 255), #YELLOW BOXES
                                          thickness=3)

                ez_impacts = fused_df.query(f"gameKey == {int(game_key)} and 
                                              playID == {int(play_id)} and 
                                              is_dup == False and 
                                              view == 'Endzone' and 
                                              Endzone_frame == @frame_cnt")

                if len(ez_impacts) > 0:
                    for impact in ez_impacts.itertuples(index=False):
                        if frame_cnt == impact.Endzone_frame:
                            frame_has_impact = True

                        if frame_has_impact:
                            cv2.rectangle(frame, #frame to be edited
                                          (int(impact.Endzone_left), int(impact.Endzone_top)+ h1), #(x,y) of top left corner
                                          (int(impact.Endzone_left) + int(impact.Endzone_width), int(impact.Endzone_top) + int(impact.Endzone_height) + h1 ), #(x,y) of bottom right corner
                                          (0, 255, 255), #YELLOW BOXES
                                          thickness=3)

    return frame, frame_has_impact

def generate_impact_video(ez_vid_path:str,
                          sl_vid_path:str,
                          fused_df:pd.DataFrame,
                          output_path:str,
                          freeze_impacts=True):
    
    #define video codec to be used for
    VIDEO_CODEC = "MP4V"

    # parse game_key and play_id information from the name of the files
    game_key = os.path.basename(ez_vid_path).split('_')[0] # parse game_key
    play_id = os.path.basename(ez_vid_path).split('_')[1] # parse play_id
 
    # get metadata such as total frame number, width, height and frames per second (FPS) from endzone (ez) and sideline (sl) videos
    ez_vid, ez_total_frame_number, ez_width, ez_height, ez_fps = get_video_and_metadata(ez_vid_path)
    sl_vid, sl_total_frame_number, sl_width, sl_height, sl_fps = get_video_and_metadata(sl_vid_path)

    # define a video writer for the output video
    output_video = cv2.VideoWriter(output_path, #output file name
                                   cv2.VideoWriter_fourcc(*VIDEO_CODEC), #Video codec
                                   ez_fps, #frames per second in the output video
                                  (ez_width, ez_height+sl_height)) # frame size with stacking video vertically

    # find shorter video and use the total frame number from the shorter video for the output video
    total_frame_number = int(min(ez_total_frame_number, sl_total_frame_number))

    # iterate through each frame from endzone and sideline
    for frame_cnt in range(total_frame_number):
        frame_has_impact = False
        frame_near_impact = False

        # reading frames from both endzone and sideline
        ez_ret, ez_frame = ez_vid.read()
        sl_ret, sl_frame = sl_vid.read()

        # creating strings to be added to the output frames
        img_name = f"Game key: {game_key}, Play ID: {play_id}, Frame: {frame_cnt}"
        video_frame = f'{game_key}_{play_id}_{frame_cnt}'

        if ez_ret == True and sl_ret == True:
            h, w, c = ez_frame.shape
            h1,w1,c1 = sl_frame.shape
 
            if h != h1 or w != w1: # resize images if they're different
                ez_frame = cv2.resize(ez_frame,(w1,h1))
 
            frame = np.concatenate((sl_frame, ez_frame), axis=0) # stack the frames vertically

            frame, frame_has_impact = overlay_impacts(frame, fused_df, game_key, play_id, frame_cnt, h1)

            cv2.putText(frame, #image frame to be modified
                        img_name, #string to be inserted
                        (30, 30), #(x,y) location of the string
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), #WHITE letters
                        thickness=2)

            cv2.putText(frame, #image frame to be modified
                        str(frame_cnt), #frame count string to be inserted
                        (w1-75, h1-20), #(x,y) location of the string in the top view
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), # WHITE letters
                        thickness=2)

            cv2.putText(frame, #image frame to be modified
                        str(frame_cnt), #frame count string to be inserted
                        (w1-75, h1+h-20), #(x,y) location of the string in the bottom view
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), # WHITE letters
                        thickness=2)

            output_video.write(frame)

            # Freeze for 60 frames on impacts
            if frame_has_impact and freeze_impacts:
                for _ in range(60):
                    output_video.write(frame)
        else:
            break

        frame_cnt += 1

    output_video.release()
    return

To run these functions, we can provide an input as shown in the following code, which generates a video called output.mp4:

generate_impact_video('57583_000082_Endzone.mp4',
                      '57583_000082_Sideline.mp4',
                      fused_df,
                      'output.mp4')

This generates a video as shown in the following example, where the red bounding boxes are impacts found in both endzone and sideline views, and the yellow bounding boxes are impacts that are found in just one view in either the endzone or sideline.

Conclusion

In this post, we demonstrated how the NFL, Biocore, and the AWS ProServe teams are working together to improve impact detection by fusing results from multiple views. This allows the teams to debug and visualize how the model is performing qualitatively. This process can easily be scaled up to three or more views; in our projects, we have utilized up to seven different views. Detecting helmet impacts by watching videos from only one view can be difficult due to view obstruction, but detecting impacts from multiple views and fusing the results allows us to improve our model performance.

To experiment with this solution, visit the aws-samples GitHub repo and refer to the fuse_and_visualize_multiview_impacts.ipynb notebook. Similar techniques can also be applied to other industries such as manufacturing, retail, and security, where having multiple views would benefit the ML system to better identify targets with a more comprehensive view.

For more information regarding NFL Player Health and Safety, visit the NFL website and NFL Explained: Innovation in Player Health & Safety.


About the authors

Chris Boomhower is a Machine Learning Engineer at AWS Professional Services. Chris has over 6 years experience developing supervised and unsupervised Machine Learning solutions across various industries. Today, he spends most his time helping customers in sports, healthcare, and agriculture industries design and build scalable, end-to-end, Machine Learning solutions.

Ben Fenker is a Senior Data Scientist in AWS Professional Services and has helped customers build and deploy ML solutions in industries ranging from sports to healthcare to manufacturing. He has a Ph.D. in physics from Texas A&M University and 6 years of industry experience. Ben enjoys baseball, reading, and raising his kids.

Sam Huddleston is a Principal Data Scientist at Biocore LLC, who serves as the Technology Lead for the NFL’s Digital Athlete program. Biocore is a team of world-class engineers based in Charlottesville, Virginia, that provides research, testing, biomechanics expertise, modeling and other engineering services to clients dedicated to the understanding and reduction of injury.

Jarvis Lee is a Senior Data Scientist with AWS Professional Services. He has been with AWS for over five years, working with customers on machine learning and computer vision problems. Outside of work, he enjoys riding bicycles.

Tyler Mullenbach is the Global Practice Lead for ML with AWS Professional Services. He is responsible for driving the strategic direction of ML for Professional Services and ensuring that customers realize transformative business achievements through the adoption of ML technologies.

Kevin Song is a Data Scientist at AWS Professional Services. He holds a PhD in Biophysics and has over 5 years of industry experience in building computer vision and machine learning solutions.

Betty Zhang is a data scientist with 10 years of experience in data and technology. Her passion is to build innovative machine learning solutions to drive transformational changes for companies. In her spare time, she enjoys traveling, reading and learning about new technologies.

Read More

Google Research, 2022 & beyond: ML & computer systems

Google Research, 2022 & beyond: ML & computer systems

(This is Part 3 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Great machine learning (ML) research requires great systems. With the increasing sophistication of the algorithms and hardware in use today and with the scale at which they run, the complexity of the software necessary to carry out day-to-day tasks only increases. In this post, we provide an overview of the numerous advances made across Google this past year in systems for ML that enable us to support the serving and training of complex models while easing the complexity of implementation for end users. This blog post also highlights our research on leveraging ML itself to help improve and design the next generations of system stacks.

Distributed systems for ML

This year, we’ve made significant strides in improving our systems to better support large-scale computation in ML and scientific computing in general. The Google TPU hardware has been designed with scaling in mind since its inception, and each year we strive to push the boundaries even further. This year, we designed state-of-the-art serving techniques for large models, improved automatic partitioning of tensor programs and reworked the APIs of our libraries to make sure all of those developments are accessible to a wide audience of users.

One of our biggest efficiency improvements this year is the CollectiveEinsum strategy for evaluating the large scale matrix multiplication operations that are at the heart of neural networks. Unlike previously popular SPMD partitioning strategies that separate communication from device-local computation, this approach uses the fast TPU ICI links to overlap them, leading to up to 1.38x performance improvements. This algorithm was also a key component of our work on efficiently scaling Transformer inference, which presents a wide variety of strategies that trade off between latency and hardware utilization, reaching state-of-the-art model FLOPs utilization (MFU) of 76% in throughput-optimized configurations.

An illustration of AllGather-Einsum with 2-way intra-layer model parallelism, proposed in CollectiveEinsum strategy. Top: Illustration of non-overlapped execution. Bottom: Illustration of the CollectiveEinsum technique.

We have also integrated SPMD-style partitioning as a first class concept into both TensorFlow, with the DTensor extension, and JAX, with the redesigned array type. In both libraries, tensors that seem complete to the programmer can be transparently sharded over a number of devices just by attaching declarative layout annotations. In fact, both approaches are compatible with existing code written for single-device computations that can now scale into a multi-device program, usually without any code modifications!

Integrating SPMD partitioning into the core of our ML frameworks means that being able to infer and optimize the way array programs are mapped onto a larger set of devices is critical for performance. In the past, this motivated the development of GSPMD, an important milestone in this area. However, GSPMD relies heavily on heuristics, and it still sometimes requires non-trivial decisions to be made manually, which often results in suboptimal performance. To make partitioning inference fully automatic, we collaborated with external colleagues to develop Alpa, a fully automated system that explores strategies for both operator-level (model) parallelism and pipeline parallelism between larger sub-computations. It successfully matches hand-tuned performance on popular models such as Transformers, but is also capable of successfully scaling up other models, such as convolutional networks and mixture-of-experts models that often cause existing automated methods to struggle.

Alpa overview. The inter-operator identifies the best way to assign a subgraph to a submesh. The intra-operator pass finds the best intra-operator parallelism plan for each pipeline stage. Finally, the runtime orchestration generates a static plan that orders the computation and communication.

In a similar vein, the recently published Pathways system adds an additional layer of virtualization on top of the usual TPU runtime — accelerators are managed by long-lived processes instead of being allocated directly to users. A single end user can then connect to an arbitrary number of Pathways-controlled devices and write their program as if all the devices were attached directly to their process, even though in reality they may even span multiple data centers. Thanks to Pathways: (1) job startup time can be reduced, (2) it is easier to achieve fault tolerance, and (3) multitenancy becomes a viable option, enabling multiple jobs to be executed simultaneously for even more efficient hardware utilization. The ease with which Pathways enables computation spanning multiple TPU pods is crucial, as it lets us avoid future scaling bottlenecks.

Pathways overview. Top Left: Distributed computation expressed as a Directed Acyclic Graph. Top Right: The resource manager allocates virtual slices of accelerator meshes for each compiled function (e.g., A, B, and C). Bottom: Centralized schedulers for gang-schedule computations that are then dispatched by per-shard executors. (See paper for details.)

Another notable release is TensorStore, a new library for multi-dimensional array storage. TensorStore is particularly useful for training large language models (LLMs) with multi-controller runtimes, where every process only manages a subset of all parameters, all of which must be collated into a consistent checkpoint. TensorStore provides database-grade guarantees (ACID) for efficient and concurrent multi-dimensional array serialization into many storage backends (e.g., Google Cloud Storage, various filesystems, HTTP servers) and has been successfully used for compute-intensive workloads such as PaLM and reconstructions of the human cortex and fruit fly brain.

A fly brain reconstruction for which the underlying data can be easily accessed and manipulated using TensorStore.

Top

Programming languages for ML

The robustness and correctness of our technical infrastructure are vital for ML efforts, which is why we remain committed to ensuring that it is built on a sound technical and theoretical basis, backed by cutting-edge research in programming languages and compiler construction.

We continued investing in the open-source MLIR compiler infrastructure, building a more controllable, composable and modular compiler stack. In addition, much progress has been made in code generation for sparse linear algebra and it is now possible to generate both dense and sparse code from almost identical MLIR programs. Finally, we also continued the development of the IREE compiler, preparing it for use on both powerful computers located in data centers and mobile devices such as smartphones.

On the more theoretical side we explored ways to formalize and verify the code-generation techniques we use. We also published a novel approach used to implement and formalize automatic differentiation (AD) systems, which are central to ML libraries. We decomposed the reverse-mode AD algorithm into three independent program transformations, which are significantly simpler and easier to verify, highlighting the unique features of JAX’s implementation.

Leveraging programming language techniques, such as abstract interpretation and program synthesis, we successfully reduced the number of resources required to perform a neural architecture search (NAS). This effort, 𝛼NAS, led to the discovery of more efficient models without degradation in accuracy.

In the past year, we published a number of new open-source libraries in the JAX ecosystem, Rax and T5X being just two examples. With the continued effort around jax2tf, JAX models can now be deployed on mobile devices using TensorFlow Lite and on the web using TensorFlow.js.

Top

Hardware accelerators & ML

Hardware design for ML

The use of customized hardware, such as TPUs and GPUs, has shown tremendous benefits in terms of both performance gain and energy efficiency (hence reducing the carbon footprint). In a recent MLPerf competition, we set new performance records on five benchmarks on TPUs v4, achieving speedups that are on average 1.42x higher than the next fastest submission. However, in order to keep up with recent advances, we are also developing customized hardware architectures for specific popular models.

TPUs demonstrated significant speedup in all five published benchmarks (MLPerf 2.0) over the fastest non-Google submission (NVIDIA on-premises). Taller bars are better. The numbers inside the bars represent the quantity of chips / accelerators used for each of the submissions.

However, building a new hardware accelerator incurs high initial cost and requires significant development and deployment time. To make single-workload accelerators viable, the design cycle time has to be reduced. Full-stack Search Technique (FAST) addresses this problem by introducing a hardware accelerator search framework that simultaneously optimizes data path, scheduling, and important compiler decisions. FAST introduces an approximate template capable of describing diverse types of architectures and versatile memory hierarchy resulting in accelerators that improve single-workload performance per Thermal Design Power (known to highly correlate with performance per Total Cost of Ownership) by 3.7x compared to TPU v3. This shows that single-workload accelerators could be practical for moderate-sized datacenter deployments.

ML for hardware design

To automate the chip design process as much as possible, we continue to push the capabilities of ML at various stages of the hardware design, including high-level architectural exploration, verification, and placement and routing.

We recently open-sourced a distributed RL infrastructure called Circuit Training, along with a circuit environment described in our recent Nature paper. We used this infrastructure in production to produce macro placements for the latest generation of TPU chips. Tackling architectural exploration, PRIME introduces an ML-based approach for searching hardware design space that utilizes only existing data (e.g., from traditional accelerator design efforts) without any further hardware simulation. This approach alleviates the need to run time-consuming simulations, even when the set of target applications changes. PRIME improves performance over state-of-the-art simulation-driven methods by about 1.2x–1.5x while reducing the simulation time by 93%–99%. AutoApprox automatically generates approximate low-power deep learning accelerators without any accuracy loss by mapping each neural network layer to an appropriate approximation level.

PRIME uses logged accelerator data, consisting of both feasible and infeasible accelerators, to train a conservative model, which is used to design accelerators while meeting design constraints. PRIME designs accelerators with up to 1.5x smaller latency, while reducing the required hardware simulation time by up to 99%.

Hardware-dependent model design

While NAS has shown tremendous capability in discovering state-of-the-art models in terms of accuracy and efficiency, it is still limited by lack of hardware knowledge. Platform-aware NAS addresses this gap by incorporating knowledge of the hardware architecture into the design of the NAS search space. The resulting EfficientNet-X model is 1.5x–2x faster than EfficientNet on TPU v3 and GPU v100, respectively, with similar accuracy. Both platform-aware NAS and EfficientNet-X have been deployed in production, demonstrating significant accuracy gains and up to ~40% efficiency improvement for various production vision models. NaaS goes even further by searching for neural network architectures and hardware architectures together. Using this approach on Edge TPUs, NaaS discovers vision models that are 2x more energy efficient with the same accuracy.

Overview of platform-aware NAS on TPUs/GPUs, highlighting the search space and search objectives.

Top

ML for navigating constrained search spaces

Apart from changing the hardware and the workload for better efficiency, we can also optimize the middle layer, including the partitioner, which maps the workload onto multiple devices, and the compiler, which translates the workload into a low-level presentation understood by the hardware. In previous years, we demonstrated how we can apply ML to find better device placement and compiler decisions. In the past year, we further explored this direction and found that many optimization search spaces are heavily constrained, where valid solutions are quite sparse.

To address this challenge, we developed several techniques to enable a learned model to effectively navigate a constrained search space. Telamalloc employs a combination of ML model plus heuristics to make a decision when multiple options are available, and leverages a constraint solver to infer further dependent decisions. Telamalloc speeds up the memory allocation pass in the Edge TPU compiler compared to a production Integer Linear Programming approach and enables important real-world models that could not otherwise be supported.

A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules” proposes a slightly different approach. It applies reinforcement learning (RL) to propose the decisions in a single step, and asks the constraint solver to adjust the proposed solution to be valid. For a BERT model on an Edge TPU-based multi-chip mesh, this approach discovers a better distribution of the model across devices using a much smaller time budget compared to non-learned search strategies.

Top

ML for large-scale production systems

We also deployed ML to improve efficiency of various large-scale systems running in production. We recently released MLGO, the first industrial-grade general framework for integrating ML techniques systematically in the LLVM infrastructure. MLGO can replace heuristics in LLVM with an RL policy to make optimization decisions. When testing on a set of internal large-scale applications, we found that the trained policy can reduce binary size by 3%–7% when optimizing inlining decisions and can improve throughput by 0.3% ~1.5% when optimizing register allocation decisions. Within our production ML compiler, XLA, a learned cost model published a few years back, was recently deployed to guide the selection of optimal tile sizes of TPU kernels for top ML workloads, saving ~2% of the total TPU compute time in our data centers overall.We also recently replaced an existing heuristic in YouTube cache replacement algorithm with a new hybrid algorithm that combines a simple heuristic with a learned model, improving byte miss ratio at the peak by ~9%.

Illustration of MLGO during inlining. “#bbs”, “#users”, and “callsite height” are example caller-callee pair features.

Top

AI & sustainability

Given the global climate change crisis, there has been understandable concern about the environmental impact of ML. In a recent paper, we showed that by following best practices, ML practitioners can reduce carbon dioxide equivalent emissions (CO2e) from training by orders of magnitude. We call the practices the “4Ms”

  1. Model. The first step is to select the most efficient ML model architecture. For example, Primer runs ~4x faster on the same hardware while achieving the same quality scores than the popular Transformer developed four years earlier.
  2. Machine. The second practice is to use the most energy efficient computer available. For example, when the Transformer model was first published in 2017, a popular GPU was the Nvidia P100. Using a recent processor optimized for ML training, such as TPU v4, improves performance per Watt by ~15x.
  3. Mechanization. Computers for training needed to be housed in a data center. Large cloud data centers are typically ~1.4x more energy-efficient than the typical smaller on-premise data center.
  4. Map. The biggest surprise in our investigation was the impact on the cleanliness of the energy supply by picking the best location. Moreover, in the cloud, location is the easiest of the four factors to change. The difference between a typical location and a well chosen location can be ~9x, even within the same country.

In this example, multiplying the 4Ms together yields a 4x × 15x × 1.4x × 9x or ~750x reduction in CO2e over four years by following the best practices over the training of the original Transformer model using GPUs of 2017.

We are continuing to explore this space and in 2023 we will be releasing a further study that demonstrates how to reduce the CO2e of current model training by up to 20x by carefully selecting the machine, mechanization and location of training.

Top

Concluding thoughts

As the field of ML advances, we continue our investment in developing high-performance, energy-efficient, and easy-to-use systems and infrastructure to enable rapid exploration of new ideas. At the same time, we continue to explore the capability of ML to improve the performance of complex systems and automate labor-intensive tasks in system design.

Google Research, 2022 & beyond

This was the second blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI ML & Computer Systems
Algorithms* Robotics Health
General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More

Open Source Vizier: Towards reliable and flexible hyperparameter and blackbox optimization

Open Source Vizier: Towards reliable and flexible hyperparameter and blackbox optimization

Google Vizier is the de-facto system for blackbox optimization over objective functions and hyperparameters across Google, having serviced some of Google’s largest research efforts and optimized a wide range of products (e.g., Search, Ads, YouTube). For research, it has not only reduced language model latency for users, designed computer architectures, accelerated hardware, assisted protein discovery, and enhanced robotics, but also provided a reliable backend interface for users to search for neural architectures and evolve reinforcement learning algorithms. To operate at the scale of optimizing thousands of users’ critical systems and tuning millions of machine learning models, Google Vizier solved key design challenges in supporting diverse use cases and workflows, while remaining strongly fault-tolerant.

Today we are excited to announce Open Source (OSS) Vizier (with an accompanying systems whitepaper published at AutoML Conference 2022), a standalone Python package based on Google Vizier. OSS Vizier is designed for two main purposes: (1) managing and optimizing experiments at scale in a reliable and distributed manner for users, and (2) developing and benchmarking algorithms for automated machine learning (AutoML) researchers.

System design

OSS Vizier works by having a server provide services, namely the optimization of blackbox objectives, or functions, from multiple clients. In the main workflow, a client sends a remote procedure call (RPC) and asks for a suggestion (i.e., a proposed input for the client’s blackbox function), from which the service begins to spawn a worker to launch an algorithm (i.e., a Pythia policy) to compute the following suggestions. The suggestions are then evaluated by clients to form their corresponding objective values and measurements, which are sent back to the service. This pipeline is repeated multiple times to form an entire tuning trajectory.

The use of the ubiquitous gRPC library, which is compatible with most programming languages, such as C++ and Rust, allows maximum flexibility and customization, where the user can also write their own custom clients and even algorithms outside of the default Python interface. Since the entire process is saved to an SQL datastore, a smooth recovery is ensured after a crash, and usage patterns can be stored as valuable datasets for research into meta-learning and multitask transfer-learning methods such as the OptFormer and HyperBO.

In the distributed pipeline, multiple clients each send a “Suggest” request to the Service API, which produces Suggestions for the clients using Pythia. The clients evaluate these suggestions and return measurements. All transactions are stored to allow fault-tolerance.

Usage

Because of OSS Vizier’s emphasis as a service, in which clients can send requests to the server at any point in time, it is thus designed for a broad range of scenarios — the budget of evaluations, or trials, can range from tens to millions, and the evaluation latency can range from seconds to weeks. Evaluations can be done asynchronously (e.g., tuning an ML model) or in synchronous batches (e.g., wet lab settings involving multiple simultaneous experiments). Furthermore, evaluations may fail due to transient errors and be retried, or may fail due to persistent errors (e.g., the evaluation is impossible) and should not be retried.

This broadly supports a variety of applications, which include hyperparameter tuning deep learning models or optimizing non-computational objectives, which can be e.g., physical, chemical, biological, mechanical, or even human-evaluated, such as cookie recipes.

The OSS Vizier API allows (1) developers to integrate other packages, with PyGlove and Vertex Vizier already included, and (2) users to optimize their experiments, such as machine learning pipelines and cookie recipes.

Integrations, algorithms, and benchmarks

As Google Vizier is heavily integrated with many of Google’s internal frameworks and products, OSS Vizier will naturally be heavily integrated with many of Google’s open source and external frameworks. Most prominently, OSS Vizier will serve as a distributed backend for PyGlove to allow large-scale evolutionary searches over combinatorial primitives such as neural architectures and reinforcement learning algorithms. Furthermore, OSS Vizier shares the same client-based API with Vertex Vizier, allowing users to quickly swap between open-source and production-quality services.

For AutoML researchers, OSS Vizier is also outfitted with a useful collection of algorithms and benchmarks (i.e., objective functions) unified under common APIs for assessing the strengths and weaknesses of proposed methods. Most notably, via TensorFlow Probability, researchers can now use the JAX-based Gaussian Process Bandit algorithm, based on the default algorithm in Google Vizier that tunes internal users’ objectives.

Resources and future direction

We provide links to the codebase, documentation, and systems whitepaper. We plan to allow user contributions, especially in the form of algorithms and benchmarks, and further integrate with the open-source AutoML ecosystem. Going forward, we hope to see OSS Vizier as a core tool for expanding research and development over blackbox optimization and hyperparameter tuning.

Acknowledgements

OSS Vizier was developed by members of the Google Vizier team in collaboration with the TensorFlow Probability team: Setareh Ariafar, Lior Belenki, Emily Fertig, Daniel Golovin, Tzu-Kuo Huang, Greg Kochanski, Chansoo Lee, Sagi Perel, Adrian Reyes, Xingyou (Richard) Song, and Richard Zhang.

In addition, we thank Srinivas Vasudevan, Jacob Burnim, Brian Patton, Ben Lee, Christopher Suter, and Rif A. Saurous for further TensorFlow Probability integrations, Daiyi Peng and Yifeng Lu for PyGlove integrations, Hao Li for Vertex/Cloud integrations, Yingjie Miao for AutoRL integrations, Tom Hennigan, Varun Godbole, Pavel Sountsov, Alexey Volkov, Mihir Paradkar, Richard Belleville, Bu Su Kim, Vytenis Sakenas, Yujin Tang, Yingtao Tian, and Yutian Chen for open source and infrastructure help, and George Dahl, Aleksandra Faust, Claire Cui, and Zoubin Ghahramani for discussions.

Finally we thank Tom Small for designing the animation for this post.

Read More

Three Cheers: GFN Thursday Celebrates Third Anniversary With 25 New Games

Three Cheers: GFN Thursday Celebrates Third Anniversary With 25 New Games

Cheers to another year of cloud gaming! GeForce NOW celebrates its third anniversary with a look at how far cloud gaming has come, a community celebration and 25 new games supported in February.

Members can celebrate all month long, starting with a sweet Dying Light 2 reward and support for nine more games this week, including Deliver Us Mars with RTX ON.

It’s Sweet to Be Three

Three years ago, GeForce NOW launched out of a beta period to let anyone sign up to experience PC gaming from the cloud. Since then, members have streamed more than 700 million hours from the cloud, bringing home the victory on devices that could never stand up to the action on their own.

Gamers have experienced the unparalleled cinematic quality of RTX ON, with more than 50 titles taking advantage of real-time ray tracing and NVIDIA DLSS. And with 1,500+ games supported in the GeForce NOW library, the action never has to stop.

GeForce NOW Third Anniversary
Three years of cloud gaming by the numbers.

Members across the globe have the gaming power they need, with GeForce NOW technology in more than 30 global data centers, including in regions powered by GeForce NOW Alliance partners in Japan, Turkey and Latin America.

The performance available to members has expanded in the past three years, too. Members could initially play at up to 1080p, 60 frames per second gameplay on the Priority membership. In 2021, an upgrade to RTX 3080-class performance at up to 4K 60 fps became available.

Now, the new Ultimate membership unlocks unrivaled performance at up to 4K 120 fps streaming, or up to 1080p 240 fps in NVIDIA Reflex-supported games on rigs powered by GeForce RTX 4080 GPUs.

Ultimate members can stream at ultrawide resolutions — a first for cloud gaming. And with the NVIDIA Ada Lovelace architecture in the cloud, members can experience full ray tracing and NVIDIA DLSS 3 in supported games across their devices for a truly cinematic experience.

GeForce NOW Ultimate membership
The Ultimate membership brings RTX 4080 performance to the cloud.

As the cloud gaming service has evolved, so have the devices members can use to keep the gaming going. GeForce NOW runs on PC and macOS from the native app, or on iOS Safari and Android for on-the-go gaming. Members can also choose to stream from Chromebooks and the Chrome browser.

Last year brought touch controls for Fortnite, Genshin Impact and more titles, removing the need to carry a gamepad everywhere. And with support for handheld devices like the Logitech Cloud G and Razer Edge 5G, plus support for the latest smart TVs from LG and Samsung, nearly any screen can become a PC-gaming battlestation.

New Ways to Play GeForce NOW
More ways to play from the cloud.

None of this would be possible without the GeForce NOW community and its more than 25 million members. Their feedback and passion is invaluable, and they believe in the future of gaming powered by the cloud: a future in which everyone’s a PC gamer, even if they’re not on a PC.

Celebrate all month on Twitter and Facebook by sharing the best ways to play from the cloud using #3YearsOfGFN for a chance to be featured in a GeForce NOW highlight reel.

It’s been a packed three years, and we’re just getting started. Cheers to all of the cloud gamers, and here’s to the future of GeForce NOW!

Rewards to Light Up Your Day

The anniversary celebration wouldn’t be complete without giving back to the community. Starting next week, GeForce NOW members can score free Dying Light 2 rewards: a new outfit dubbed “Post-Apo,” complete with a Rough Duster, Bleak Pants, Well-Worn Boots, Tattered Leather Gauntlets, Dystopian Mask and Spiked Bracers to scavenge around and parkour in.

Dying Light 2 Reward on GeForce NOW
Survive in this post-apocalyptic wasteland with a new outfit, paraglider and weapon.

Gamers who upgrade to Ultimate and Priority memberships get additional rewards, including the Patchy Paraglider and Scrap Slicer weapon.

Claim this reward beginning Thursday, Feb. 9. Make sure to visit the GeForce NOW Rewards portal and update your settings to start receiving special offers and in-game goodies. Better hurry — these rewards are available for a limited time on a first-come, first-serve basis.

The February Lineup

Deliver Us Mars on GeForce NOW
Blast off!

Take a leap to another planet with Deliver Us Mars from Frontier Foundry. In an atmospheric sci-fi adventure with a mission to recover lost ARK colony ships on Mars, members will be able to traverse the hazardous environments of the red planet with out-of-this-world, ray-traced shadows and lighting that make the dangerous terrain feel strikingly realistic.

If space isn’t your jam, check out the list of nine games that will be available to play this week:

February also brings support for 18 more games:

  • Dark and Darker playtest (Available on Steam, Feb. 6-13)
  • Labyrinth of Galleria: The Moon Society (New release on Steam, Feb. 14)
  • Wanted: Dead (New release on Steam and Epic Games, Feb. 14)
  • Elderand (New release on Steam, Feb. 16)
  • Wild West Dynasty (New release on Steam, Feb. 16)
  • The Settlers: New Allies (New release on Ubisoft Store, Feb. 17)
  • Atomic Heart (New release on Steam, Feb. 20)
  • Chef Life — A Restaurant Simulator (New release on Steam, Feb. 23)
  • Blood Bowl 3 (New release on Steam and Epic Games Store, Feb. 23)
  • Scars Above (New release on Steam, Feb. 28)
  • Heads Will Roll: Reforged (Steam)
  • Above Snakes (Steam)
  • Across the Obelisk (Steam)
  • Captain of Industry (Steam)
  • Cartel Tycoon (Steam and Epic Games Store)
  • Ember Knights (Steam)
  • Inside the Backrooms (Steam)
  • SimRail — The Railway Simulator (Steam)

January’s a Wrap

January came to a close with eight extra games on top of the 19 announced. It’s like finding an extra french fry in the bag:

Two games announced last month, Occupy Mars: The Game (Steam) and Grimstar: Crystals are the New Oil! (Steam), didn’t make it, due to shifts in their release dates.

What will you play first this weekend? Let us know in the comments below or on Twitter and Facebook, and make sure to check out #3YearsOfGFN!

Read More