Parmida Beigi, an Amazon senior research scientist, shares a lifetime worth of experience, and uses her skills to help others grow into machine learning career paths.Read More
Create powerful self-service experiences with Amazon Lex on Talkdesk CX Cloud contact center
This blog post is co-written with Bruno Mateus, Jonathan Diedrich and Crispim Tribuna at Talkdesk.
Contact centers are using artificial intelligence (AI) and natural language processing (NLP) technologies to build a personalized customer experience and deliver effective self-service support through conversational bots.
This is the first of a two-part series dedicated to the integration of Amazon Lex with the Talkdesk CX Cloud contact center. In this post, we describe a solution architecture that combines the powerful resources of Amazon Lex and Talkdesk CX Cloud for the voice channel. In the second part of this series, we describe how to use the Amazon Lex chatbot UI with Talkdesk CX Cloud to allow customers to transition from a chatbot conversation to a live agent within the same chat window.
The benefits of Amazon Lex and Talkdesk CX Cloud are exemplified by WaFd Bank, a full-service commercial US bank in 200 locations and managing $20 billion in assets. The bank has invested in a digital transformation of its contact center to provide exceptional service to its clients. WaFd has pioneered an omnichannel banking experience that combines the advanced conversational AI capabilities of Amazon Lex voice and chat bots with Talkdesk Financial Services Experience Cloud for Banking.
“We wanted to combine the power of Amazon Lex’s conversational AI capabilities with the Talkdesk modern, unified contact center solution. This gives us the best of both worlds, enabling WaFd to serve its clients in the best way possible.”
-Dustin Hubbard, Chief Technology Officer at WaFd Bank.
To support WaFd’s vision, Talkdesk has extended its self-service virtual agent voice and chat capabilities with an integration with Amazon Lex and Amazon Polly. Additionally, the combination of Talkdesk Identity voice authentication with an Amazon Lex voicebot allows WaFd clients to resolve common banking transactions on their own. Tasks like account balance lookups are completed in seconds, a 90% reduction in time compared to WaFd’s legacy system. The newly designed Amazon Lex website chatbot has led to a substantial decrease in voicemail volume as its chatbot UI seamlessly integrates with Talkdesk systems.
In the following sections, we provide an overview of the components that have this integration possible. We then present the solution architecture, highlight its main components, and describe the customer journey from interacting with Amazon Lex to escalation to an agent. We end by explaining how contact centers can keep AI models up to date using Talkdesk AI Trainer.
Solution overview
The solution consists of the following key components:
- Amazon Lex – Amazon Lex combines with Amazon Polly to automate customer service interactions by adding conversational AI capabilities to your contact center. Amazon Lex delivers fast responses to customers’ most common questions and seamlessly hands over complex cases to a human agent. Augmenting your contact center operations with Amazon Lex bots provides an enhanced customer experience and helps you build an omnichannel experience, allowing customers to engage across phone lines, websites, and messaging platforms.
- Talkdesk CX Cloud contact center – Talkdesk, Inc. is a global cloud contact center leader for customer-obsessed companies. Talkdesk CX Cloud offers enterprise scale with consumer simplicity to deliver speed, agility, reliability, and security. As an AWS Partner, Talkdesk is using AI capabilities like Amazon Transcribe, a speech-to-text service, with the Talkdesk Agent Assist and Talkdesk Customer Experience Analytics products across a number of languages and accents. Talkdesk has extended its self-service virtual agent voice and chat capabilities with an integration with Amazon Lex and Amazon Polly. These virtual agents can automate routine tasks as well as seamlessly elevate complex interactions to a live agent.
- Authentication and voice biometrics with Talkdesk Identity – Talkdesk Identity provides fraud protection through self-service authentication using voice biometrics. Voice biometrics solutions provide contact centers with improved levels of security while streamlining the authentication process for the customer. This secure and efficient authentication experience allows contact centers to handle a wide range of self-service functionalities. For example, customers can check their balance, schedule a funds transfer, or activate/deactivate a card using a banking bot.
The following diagram illustrates our solution architecture.
The voice authentication call flow implemented in Talkdesk interacts with Amazon Lex as follows:
- When a phone call is initiated, a customer lookup is performed using the incoming caller’s phone number. If multiple customers are retrieved, further information, like date of birth, is requested in order to narrow down the list to a unique customer record.
- If the caller is identified and has previously enrolled in voice biometrics, the caller will be prompted to say their voice pass code. If successful, the caller is offered an authenticated Amazon Lex experience.
- If a caller is identified and not enrolled in voice biometrics, they can work with an agent to verify their identity and record their voice print as the password. For more information, visit the Talkdesk Voice Biometric documentation.
- If the caller is not identified or not enrolled in voice biometrics, the caller can interact with Amazon Lex to perform tasks that don’t require authentication, or they can request a transfer to an agent.
How Talkdesk integrates with Amazon Lex
When the call reaches Talkdesk Virtual Agent, Talkdesk uses the continuous streaming capability of the Amazon Lex API to enable conversation with the Amazon Lex bot. Talkdesk Virtual Agent has an Amazon Lex adapter that initiates an HTTP/2 bidirectional event stream through the StartConversation API operation. Talkdesk Virtual Agent and the Amazon Lex bot start exchanging information in real time following the sequence of events for an audio conversation. For more information, refer to Starting a stream to a bot.
All the context data from Talkdesk Studio is sent to Amazon Lex through session attributes established on the initial ConfigurationEvent. The Amazon Lex voicebot has been equipped with a welcome intent, which is invoked by Talkdesk to initiate the conversation and play a welcome message. In Amazon Lex, a session attribute is set to ensure the welcome intent and its message are used only once in any conversation. The greeting message can be customized to include the name of the authenticated caller, if provided from the Talkdesk system in session attributes.
The following diagram shows the basic components and events used to enable communications.
Agent escalation from Amazon Lex
If a customer requests agent assistance, all necessary information to ensure the customer is routed to the correct agent is made available by Amazon Lex to Talkdesk Studio through session attributes.
Examples of session attributes include:
- A flag to indicate the customer requests agent assistance
- The reason for the escalation, used by Talkdesk to route the call appropriately
- Additional data regarding the call to provide the agent with contextual information about the customer and their earlier interaction with the bot
- The sentiment of the interaction
Training
Talkdesk AI Trainer is a human-in-the-loop tool that is included in the operational flow of Talkdesk CX Cloud. It performs the continuous training and improvement of AI models by real agents without the need for specialized data science teams.
Talkdesk developed a connector that allows AI Trainer to automatically collect intent data from Amazon Lex intent models. Non-technical users can easily fine-tune these models to support Talkdesk AI products such as Talkdesk Virtual Agent. The connector was built by using the Amazon Lex Model Building API with the AWS SDK for Java 2.x.
It is possible to train intent data from Amazon Lex using real-world conversations between customers and (virtual) agents by:
- Requesting feedback of intent classifications with a low confidence level
- Adding new training phrases to intents
- Adding synonyms or regular expressions to slot types
AI Trainer receives data from Amazon Lex, namely intents and slot types. This data is then displayed and managed on Talkdesk AI Trainer, along with all the events that are part of the conversational orchestration taking place in Talkdesk Virtual Agent. Through the AI Trainer quality system or agreement, supervisors or administrators decide which improvements will be introduced in the Amazon Lex model and reflected in Talkdesk Virtual Agent.
Adjustments to production can be easily published on AI Trainer and sent to Amazon Lex. Continuously training AI models ensures that AI products reflect the evolution of the business and the latest needs of customers. This in turn helps increase the automation rate via self-servicing and resolve cases faster, resulting in a higher customer satisfaction.
Conclusion
In this post, we presented how the power of Amazon Lex conversational AI capabilities can be combined with the Talkdesk modern, unified contact center solution through the Amazon Lex API. We explained how Talkdesk voice biometrics offers the caller a self-service authenticated experience and how Amazon Lex provides contextual information to the agent to assist the caller more efficiently.
We are excited about the new possibilities that the integration of Amazon Lex and Talkdesk CX Cloud solutions offers to our clients. We at AWS Professional Services and Talkdesk are available to help you and your team implement your vision of an omnichannel experience.
The next post in this series will provide guidance on how to integrate an Amazon Lex chatbot to Talkdesk Studio, and how to enable customers to interact with a live agent from the chatbot.
About the authors
Grazia Russo Lassner is a Senior Consultant with the AWS Professional Services Natural Language AI team. She specializes in designing and developing conversational AI solutions using AWS technologies for customers in various industries. Outside of work, she enjoys beach weekends, reading the latest fiction books, and family.
Cecil Patterson is a Natural Language AI consultant with AWS Professional Services based in North Texas. He has many years of experience working with large enterprises to enable and support global infrastructure solutions. Cecil uses his experience and diverse skill set to build exceptional conversational solutions for customers of all types.
Bruno Mateus is a Principal Engineer at Talkdesk. With over 20 years of experience in the software industry, he specializes in large-scale distributed systems. When not working, he enjoys spending time outside with his family, trekking, mountain bike riding, and motorcycle riding.
Jonathan Diedrich is a Principal Solutions Consultant at Talkdesk. He works on enterprise and strategic projects to ensure technical execution and adoption. Outside of work, he enjoys ice hockey and games with his family.
Crispim Tribuna is a Senior Software Engineer at Talkdesk currently focusing on the AI-based virtual agent project. He has over 17 years of experience in computer science, with a focus on telecommunications, IPTV, and fraud prevention. In his free time, he enjoys spending time with his family, running (he has completed three marathons), and riding motorcycles.
Image classification model selection using Amazon SageMaker JumpStart
Researchers continue to develop new model architectures for common machine learning (ML) tasks. One such task is image classification, where images are accepted as input and the model attempts to classify the image as a whole with object label outputs. With many models available today that perform this image classification task, an ML practitioner may ask questions like: “What model should I fine-tune and then deploy to achieve the best performance on my dataset?” And an ML researcher may ask questions like: “How can I generate my own fair comparison of multiple model architectures against a specified dataset while controlling training hyperparameters and computer specifications, such as GPUs, CPUs, and RAM?” The former question addresses model selection across model architectures, while the latter question concerns benchmarking trained models against a test dataset.
In this post, you will see how the TensorFlow image classification algorithm of Amazon SageMaker JumpStart can simplify the implementations required to address these questions. Together with the implementation details in a corresponding example Jupyter notebook, you will have tools available to perform model selection by exploring pareto frontiers, where improving one performance metric, such as accuracy, is not possible without worsening another metric, such as throughput.
Solution overview
The following figure illustrates the model selection trade-off for a large number of image classification models fine-tuned on the Caltech-256 dataset, which is a challenging set of 30,607 real-world images spanning 256 object categories. Each point represents a single model, point sizes are scaled with respect to the number of parameters comprising the model, and the points are color-coded based on their model architecture. For example, the light green points represent the EfficientNet architecture; each light green point is a different configuration of this architecture with unique fine-tuned model performance measurements. The figure shows the existence of a pareto frontier for model selection, where higher accuracy is exchanged for lower throughput. Ultimately, the selection of a model along the pareto frontier, or the set of pareto efficient solutions, depends on your model deployment performance requirements.
If you observe test accuracy and test throughput frontiers of interest, the set of pareto efficient solutions on the preceding figure are extracted in the following table. Rows are sorted such that test throughput is increasing and test accuracy is decreasing.
Model Name | Number of Parameters | Test Accuracy | Test Top 5 Accuracy | Throughput (images/s) | Duration per Epoch(s) |
swin-large-patch4-window12-384 | 195.6M | 96.4% | 99.5% | 0.3 | 2278.6 |
swin-large-patch4-window7-224 | 195.4M | 96.1% | 99.5% | 1.1 | 698.0 |
efficientnet-v2-imagenet21k-ft1k-l | 118.1M | 95.1% | 99.2% | 4.5 | 1434.7 |
efficientnet-v2-imagenet21k-ft1k-m | 53.5M | 94.8% | 99.1% | 8.0 | 769.1 |
efficientnet-v2-imagenet21k-m | 53.5M | 93.1% | 98.5% | 8.0 | 765.1 |
efficientnet-b5 | 29.0M | 90.8% | 98.1% | 9.1 | 668.6 |
efficientnet-v2-imagenet21k-ft1k-b1 | 7.3M | 89.7% | 97.3% | 14.6 | 54.3 |
efficientnet-v2-imagenet21k-ft1k-b0 | 6.2M | 89.0% | 97.0% | 20.5 | 38.3 |
efficientnet-v2-imagenet21k-b0 | 6.2M | 87.0% | 95.6% | 21.5 | 38.2 |
mobilenet-v3-large-100-224 | 4.6M | 84.9% | 95.4% | 27.4 | 28.8 |
mobilenet-v3-large-075-224 | 3.1M | 83.3% | 95.2% | 30.3 | 26.6 |
mobilenet-v2-100-192 | 2.6M | 80.8% | 93.5% | 33.5 | 23.9 |
mobilenet-v2-100-160 | 2.6M | 80.2% | 93.2% | 40.0 | 19.6 |
mobilenet-v2-075-160 | 1.7M | 78.2% | 92.8% | 41.8 | 19.3 |
mobilenet-v2-075-128 | 1.7M | 76.1% | 91.1% | 44.3 | 18.3 |
mobilenet-v1-075-160 | 2.0M | 75.7% | 91.0% | 44.5 | 18.2 |
mobilenet-v1-100-128 | 3.5M | 75.1% | 90.7% | 47.4 | 17.4 |
mobilenet-v1-075-128 | 2.0M | 73.2% | 90.0% | 48.9 | 16.8 |
mobilenet-v2-075-96 | 1.7M | 71.9% | 88.5% | 49.4 | 16.6 |
mobilenet-v2-035-96 | 0.7M | 63.7% | 83.1% | 50.4 | 16.3 |
mobilenet-v1-025-128 | 0.3M | 59.0% | 80.7% | 50.8 | 16.2 |
This post provides details on how to implement large-scale Amazon SageMaker benchmarking and model selection tasks. First, we introduce JumpStart and the built-in TensorFlow image classification algorithms. We then discuss high-level implementation considerations, such as JumpStart hyperparameter configurations, metric extraction from Amazon CloudWatch Logs, and launching asynchronous hyperparameter tuning jobs. Finally, we cover the implementation environment and parameterization leading to the pareto efficient solutions in the preceding table and figure.
Introduction to JumpStart TensorFlow image classification
JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions that solve common business problems. These features remove the heavy lifting from each step of the ML process, making it easier to develop high-quality models and reducing time to deployment. The JumpStart APIs allow you to programmatically deploy and fine-tune a vast selection of pre-trained models on your own datasets.
The JumpStart model hub provides access to a large number of TensorFlow image classification models that enable transfer learning and fine-tuning on custom datasets. As of this writing, the JumpStart model hub contains 135 TensorFlow image classification models across a variety of popular model architectures from TensorFlow Hub, to include residual networks (ResNet), MobileNet, EfficientNet, Inception, Neural Architecture Search Networks (NASNet), Big Transfer (BiT), shifted window (Swin) transformers, Class-Attention in Image Transformers (CaiT), and Data-Efficient Image Transformers (DeiT).
Vastly different internal structures comprise each model architecture. For instance, ResNet models utilize skip connections to allow for substantially deeper networks, whereas transformer-based models use self-attention mechanisms that eliminate the intrinsic locality of convolution operations in favor of more global receptive fields. In addition to the diverse feature sets these different structures provide, each model architecture has several configurations that adjust the model size, shape, and complexity within that architecture. This results in hundreds of unique image classification models available on the JumpStart model hub. Combined with built-in transfer learning and inference scripts that encompass many SageMaker features, the JumpStart API is a great launching point for ML practitioners to get started training and deploying models quickly.
Refer to Transfer learning for TensorFlow image classification models in Amazon SageMaker and the following example notebook to learn about SageMaker TensorFlow image classification in more depth, including how to run inference on a pre-trained model as well as fine-tune the pre-trained model on a custom dataset.
Large-scale model selection considerations
Model selection is the process of selecting the best model from a set of candidate models. This process may be applied across models of the same type with different parameter weights and across models of different types. Examples of model selection across models of the same type include fitting the same model with different hyperparameters (for example, learning rate) and early stopping to prevent the overfitting of model weights to the train dataset. Model selection across models of different types includes selecting the best model architecture (for example, Swin vs. MobileNet) and selecting the best model configurations within a single model architecture (for example, mobilenet-v1-025-128
vs. mobilenet-v3-large-100-224
).
The considerations outlined in this section enable all of these model selection processes on a validation dataset.
Select hyperparameter configurations
TensorFlow image classification in JumpStart has a large number of available hyperparameters that can adjust the transfer learning script behaviors uniformly for all model architectures. These hyperparameters relate to data augmentation and preprocessing, optimizer specification, overfitting controls, and trainable layer indicators. You are encouraged to adjust the default values of these hyperparameters as necessary for your application:
For this analysis and the associated notebook, all hyperparameters are set to default values except for learning rate, number of epochs, and early stopping specification. Learning rate is adjusted as a categorical parameter by the SageMaker automatic model tuning job. Because each model has unique default hyperparameter values, the discrete list of possible learning rates includes the default learning rate as well as one-fifth the default learning rate. This launches two training jobs for a single hyperparameter tuning job, and the training job with the best reported performance on the validation dataset is selected. Because the number of epochs is set to 10, which is greater than the default hyperparameter setting, the selected best training job doesn’t always correspond to the default learning rate. Finally, an early stopping criterion is utilized with a patience, or the number of epochs to continue training with no improvement, of three epochs.
One default hyperparameter setting of particular importance is train_only_on_top_layer
, where, if set to True
, the model’s feature extraction layers are not fine-tuned on the provided training dataset. The optimizer will only train parameters in the top fully connected classification layer with output dimensionality equal to the number of class labels in the dataset. By default, this hyperparameter is set to True
, which is a setting targeted for transfer learning on small datasets. You may have a custom dataset where the feature extraction from the pre-training on the ImageNet dataset is not sufficient. In these cases, you should set train_only_on_top_layer
to False
. Although this setting will increase training time, you will extract more meaningful features for your problem of interest, thereby increasing accuracy.
Extract metrics from CloudWatch Logs
The JumpStart TensorFlow image classification algorithm reliably logs a variety of metrics during training that are accessible to SageMaker Estimator
and HyperparameterTuner objects. The constructor of a SageMaker Estimator
has a metric_definitions
keyword argument, which can be used to evaluate the training job by providing a list of dictionaries with two keys: Name for the name of the metric, and Regex
for the regular expression used to extract the metric from the logs. The accompanying notebook shows the implementation details. The following table lists the available metrics and associated regular expressions for all JumpStart TensorFlow image classification models.
Metric Name | Regular Expression |
number of parameters | “- Number of parameters: ([0-9\.]+)” |
number of trainable parameters | “- Number of trainable parameters: ([0-9\.]+)” |
number of non-trainable parameters | “- Number of non-trainable parameters: ([0-9\.]+)” |
train dataset metric | f”- {metric}: ([0-9\.]+)” |
validation dataset metric | f”- val_{metric}: ([0-9\.]+)” |
test dataset metric | f”- Test {metric}: ([0-9\.]+)” |
train duration | “- Total training duration: ([0-9\.]+)” |
train duration per epoch | “- Average training duration per epoch: ([0-9\.]+)” |
test evaluation latency | “- Test evaluation latency: ([0-9\.]+)” |
test latency per sample | “- Average test latency per sample: ([0-9\.]+)” |
test throughput | “- Average test throughput: ([0-9\.]+)” |
The built-in transfer learning script provides a variety of train, validation, and test dataset metrics within these definitions, as represented by the f-string replacement values. The exact metrics available vary based on the type of classification being performed. All compiled models have a loss
metric, which is represented by a cross-entropy loss for either a binary or categorical classification problem. The former is used when there is one class label; the latter is used if there are two or more class labels. If there is only a single class label, then the following metrics are computed, logged, and extractable via the f-string regular expressions in the preceding table: number of true positives (true_pos
), number of false positives (false_pos
), number of true negatives (true_neg
), number of false negatives (false_neg
), precision
, recall
, area under the receiver operating characteristic (ROC) curve (auc
), and area under the precision-recall (PR) curve (prc
). Similarly, if there are six or more class labels, a top-5 accuracy metric (top_5_accuracy
) is also be computed, logged, and extractable via the preceding regular expressions.
During training, metrics specified to a SageMaker Estimator
are emitted to CloudWatch Logs. When the training is complete, you can invoke the SageMaker DescribeTrainingJob API and inspect the FinalMetricDataList
key in the JSON response:
This API requires only the job name to be provided to the query, so, once completed, metrics can be obtained in future analyses so long as the training job name is appropriately logged and recoverable. For this model selection task, hyperparameter tuning job names are stored and subsequent analyses reattach a HyperparameterTuner
object given the tuning job name, extract the best training job name from the attached hyperparameter tuner, and then invoke the DescribeTrainingJob
API as described earlier to obtain metrics associated with the best training job.
Launch asynchronous hyperparameter tuning jobs
Refer to the corresponding notebook for implementation details on asynchronously launching hyperparameter tuning jobs, which uses the Python standard library’s concurrent futures module, a high-level interface for asynchronously running callables. Several SageMaker-related considerations are implemented in this solution:
- Each AWS account is affiliated with SageMaker service quotas. You should view your current limits to fully utilize your resources and potentially request resource limit increases as needed.
- Frequent API calls to create many simultaneous hyperparameter tuning jobs may exceed the Python SDK rate and throw throttling exceptions. A resolution to this is to create a SageMaker Boto3 client with a custom retry configuration.
- What happens if your script encounters an error or the script is stopped before completion? For such a large model selection or benchmarking study, you can log tuning job names and provide convenience functions to reattach hyperparameter tuning jobs that already exist:
Analysis details and discussion
The analysis in this post performs transfer learning for model IDs in the JumpStart TensorFlow image classification algorithm on the Caltech-256 dataset. All training jobs were performed on the SageMaker training instance ml.g4dn.xlarge, which contains a single NVIDIA T4 GPU.
The test dataset is evaluated on the training instance at the end of training. Model selection is performed prior to the test dataset evaluation to set model weights to the epoch with the best validation set performance. Test throughput is not optimized: the dataset batch size is set to the default training hyperparameter batch size, which isn’t adjusted to maximize GPU memory usage; reported test throughput includes data loading time because the dataset isn’t pre-cached; and distributed inference across multiple GPUs isn’t utilized. For these reasons, this throughput is a good relative measurement, but actual throughput would depend heavily on your inference endpoint deployment configurations for the trained model.
Although the JumpStart model hub contains many image classification architecture types, this pareto frontier is dominated by select Swin, EfficientNet, and MobileNet models. Swin models are larger and relatively more accurate, whereas MobileNet models are smaller, relatively less accurate, and suitable for resource constraints of mobile devices. It’s important to note that this frontier is conditioned on a variety of factors, including the exact dataset used and the fine-tuning hyperparameters selected. You may find that your custom dataset produces a different set of pareto efficient solutions, and you may desire longer training times with different hyperparameters, such as more data augmentation or fine-tuning more than just the top classification layer of the model.
Conclusion
In this post, we showed how to run large-scale model selection or benchmarking tasks using the JumpStart model hub. This solution can help you choose the best model for your needs. We encourage you to try out and explore this solution on your own dataset.
References
More information is available at the following resources:
- Image Classification – TensorFlow
- Run image classification with Amazon SageMaker JumpStart
- Build high performing image classification models using Amazon SageMaker JumpStart
About the authors
Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker built-in algorithms team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana Champaign. He is an active researcher in machine learning and statistical inference and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
AAAI: Prompt engineering and reasoning in the spotlight
Methods for controlling the outputs of large generative models and integrating symbolic reasoning with machine learning are among the conference’s hot topics.Read More
Computer vision for automated quality inspection
Learn how physics and computer science influence each other and about the importance of the scientific perspective when it comes to quantum technology.Read More
Predict football punt and kickoff return yards with fat-tailed distribution using GluonTS
Today, the NFL is continuing their journey to increase the number of statistics provided by the Next Gen Stats Platform to all 32 teams and fans alike. With advanced analytics derived from machine learning (ML), the NFL is creating new ways to quantify football, and to provide fans with the tools needed to increase their knowledge of the games within the game of football. For the 2022 season, the NFL aimed to leverage player-tracking data and new advanced analytics techniques to better understand special teams.
The goal of the project was to predict how many yards a returner would gain on a punt or kickoff play. One of the challenges when building predictive models for punt and kickoff returns is the availability of very rare events — such as touchdowns — that have significant importance in the dynamics of a game. A data distribution with fat tails is common in real-world applications, where rare events have significant impact on the overall performance of the models. Using a robust method to accurately model distribution over extreme events is crucial for better overall performance.
In this post, we demonstrate how to use Spliced Binned-Pareto distribution implemented in GluonTS to robustly model such fat-tailed distributions.
We first describe the dataset used. Next, we present the data preprocessing and other transformation methods applied to the dataset. We then explain the details of the ML methodology and model training procedures. Finally, we present the model performance results.
Dataset
In this post, we used two datasets to build separate models for punt and kickoff returns. The player tracking data contains the player’s position, direction, acceleration, and more (in x,y coordinates). There are around 3,000 and 4,000 plays from four NFL seasons (2018–2021) for punt and kickoff plays, respectively. In addition, there are very few punt and kickoff-related touchdowns in the datasets—only 0.23% and 0.8%, respectively. The data distribution for punt and kickoff are different. For example, the true yardage distribution for kickoff and punts are similar but shifted, as shown in the following figure.
Data preprocessing and feature engineering
First, the tracking data was filtered for just the data related to punts and kickoff returns. The player data was used to derive features for model development:
- X – Player position along the long axis of the field
- Y – Player position along the short axis of the field
- S – Speed in yards/second; replaced by Dis*10 to make it more accurate (Dis is the distance in the past 0.1 seconds)
- Dir – Angle of player motion (degrees)
From the preceding data, each play was transformed into 10X11X14 of data with 10 offensive players (excluding the ball carrier), 11 defenders, and 14 derived features:
- sX – x speed of a player
- sY – y speed of a player
- s – Speed of a player
- aX – x acceleration of a player
- aY – y acceleration of a player
- relX – x distance of player relative to ball carrier
- relY – y distance of player relative to ball carrier
- relSx – x speed of player relative to ball carrier
- relSy – y speed of player relative to ball carrier
- relDist – Euclidean distance of player relative to ball carrier
- oppX – x distance of offense player relative to defense player
- oppY – y distance of offense player relative to defense player
- oppSx –x speed of offense player relative to defense player
- oppSy – y speed of offense player relative to defense player
To augment the data and account for the right and left positions, the X and Y position values were also mirrored to account for the right and left field positions. The data preprocessing and feature engineering was adapted from the winner of the NFL Big Data Bowl competition on Kaggle.
ML methodology and model training
Because we’re interested in all possible outcomes from the play, including the probability of a touchdown, we can’t simply predict the average yards gained as a regression problem. We need to predict the full probability distribution of all possible yard gains, so we framed the problem as a probabilistic prediction.
One way to implement probabilistic predictions is to assign the yards gained to several bins (such as less than 0, from 0–1, from 1–2, …, from 14–15, more than 15) and predict the bin as a classification problem. The downside of this approach is that we want small bins to have a high definition picture of the distribution, but small bins mean fewer data points per bin and our distribution, especially the tails, may be poorly estimated and irregular.
Another way to implement probabilistic predictions is to model the output as a continuous probability distribution with a limited number of parameters (for example, a Gaussian or Gamma distribution) and predict the parameters. This approach gives a very high definition and regular picture of the distribution, but is too rigid to fit the true distribution of yards gained, which is multi-modal and heavy tailed.
To get the best of both methods, we use Spliced Binned-Pareto distribution (SBP), which has bins for the center of the distribution where a lot of data is available, and Generalized Pareto distribution (GPD) at both ends, where rare but important events can happen, like a touchdown. The GPD has two parameters: one for scale and one for tail heaviness, as seen in the following graph (source: Wikipedia).
By splicing the GPD with the binned distribution (see the following left graph) on both sides, we obtain the following SBP on the right. The lower and upper thresholds where splicing is done are hyperparameters.
As a baseline, we used the model that won our NFL Big Data Bowl competition on Kaggle. This model uses CNN layers to extract features from the prepared data, and predicts the outcome as a “1 yard per bin” classification problem. For our model, we kept the feature extraction layers from the baseline and only modified the last layer to output SBP parameters instead of probabilities for each bin, as shown in the following figure (image edited from the post 1st place solution The Zoo).
We used the SBP distribution provided by GluonTS. GluonTS is a Python package for probabilistic time series modeling, but the SBP distribution is not specific to time series, and we were able to repurpose it for regression. For more information on how to use GluonTS SBP, see the following demo notebook.
Models were trained and cross-validated on the 2018, 2019, and 2020 seasons and tested on the 2021 season. To avoid leakage during cross-validation, we grouped all plays from the same game into the same fold.
For evaluation, we kept the metric used in the Kaggle competition, the continuous ranked probability score (CRPS), which can be seen as an alternative to the log-likelihood that is more robust to outliers. We also used the Pearson correlation coefficient and the RMSE as general and interpretable accuracy metrics. Furthermore, we looked at the probability of a touchdown and probability plots to evaluate calibration.
The model was trained on the CRPS loss using Stochastic Weight Averaging and early stopping.
To deal with the irregularity of the binned part of the output distributions, we used two techniques:
- A smoothness penalty proportional to the squared difference between two consecutive bins
- Ensembling models trained during cross-validation
Model performance results
For each dataset, we performed a grid search over the following options:
- Probabilistic models
- Baseline was one probability per yard
- SBP was one probability per yard in the center, generalized SBP in the tails
- Distribution smoothing
- No smoothing (smoothness penalty = 0)
- Smoothness penalty = 5
- Smoothness penalty = 10
- Training and inference procedure
- 10 folds cross-validation and ensemble inference (k10)
- Training on train and validation data for 10 epochs or 20 epochs
Then we looked at the metrics for the top five models sorted by CRPS (lower is better).
For kickoff data, the SBP model slightly over-performs in terms of CRPS but more importantly it estimates the touchdown probability better (true probability is 0.80% in the test set). We see that the best models use 10 folds ensembling (k10) and no smoothness penalty, as shown in the following table.
Training | Model | Smoothness | CRPS | RMSE | CORR % | P(touchdown)% |
k10 | SBP | 0 | 4.071 | 9.641 | 47.15 | 0.78 |
k10 | Baseline | 0 | 4.074 | 9.62 | 47.585 | 0.306 |
k10 | Baseline | 5 | 4.075 | 9.626 | 47.43 | 0.274 |
k10 | SBP | 5 | 4.079 | 9.656 | 46.977 | 0.682 |
k10 | Baseline | 10 | 4.08 | 9.621 | 47.519 | 0.265 |
The following plot of the observed frequencies and predicted probabilities indicates a good calibration of our best model, with an RMSE of 0.27 between the two distributions. Note the occurrences of high yardage (for example, 100) that occur in the tail of the true (blue) empirical distribution, whose probabilities are more capturable by the SBP than the baseline method.
For punt data, the baseline outperforms the SBP, perhaps because the tails of extreme yardage have fewer realizations. Therefore, it’s a better trade-off to capture the modality between 0–10 yards peaks; and contrary to kickoff data, the best model uses a smoothness penalty. The following table summarizes our findings.
Training | Model | Smoothness | CRPS | RMSE | CORR % | P(touchdown)% |
k10 | Baseline | 5 | 3.961 | 8.313 | 35.227 | 0.547 |
k10 | Baseline | 0 | 3.972 | 8.346 | 34.227 | 0.579 |
k10 | Baseline | 10 | 3.978 | 8.351 | 34.079 | 0.555 |
k10 | SBP | 5 | 3.981 | 8.342 | 34.971 | 0.723 |
k10 | SBP | 0 | 3.991 | 8.378 | 33.437 | 0.677 |
The following plot of observed frequencies (in blue) and predicted probabilities for the two best punt models indicates that the non-smoothed model (in orange) is slightly better calibrated than the smoothed model (in green) and may be a better choice overall.
Conclusion
In this post, we showed how to build predictive models with fat-tailed data distribution. We used Spliced Binned-Pareto distribution, implemented in GluonTS, which can robustly model such fat-tailed distributions. We used this technique to build models for punt and kickoff returns. We can apply this solution to similar use cases where there are very few events in the data, but those events have significant impact on the overall performance of the models.
If you would like help with accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program.
About the Authors
Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he helps AWS customers across various industries such as healthcare and life sciences, manufacturing, automotive, and sports and media, accelerate their use of machine learning and AWS cloud services to solve their business challenges.
Marc van Oudheusden is a Senior Data Scientist with the Amazon ML Solutions Lab team at Amazon Web Services. He works with AWS customers to solve business problems with artificial intelligence and machine learning. Outside of work you may find him at the beach, playing with his children, surfing or kitesurfing.
Panpan Xu is a Senior Applied Scientist and Manager with the Amazon ML Solutions Lab at AWS. She is working on research and development of Machine Learning algorithms for high-impact customer applications in a variety of industrial verticals to accelerate their AI and cloud adoption. Her research interest includes model interpretability, causal analysis, human-in-the-loop AI and interactive data visualization.
Kyeong Hoon (Jonathan) Jung is a senior software engineer at the National Football League. He has been with the Next Gen Stats team for the last seven years helping to build out the platform from streaming the raw data, building out microservices to process the data, to building API’s that exposes the processed data. He has collaborated with the Amazon Machine Learning Solutions Lab in providing clean data for them to work with as well as providing domain knowledge about the data itself. Outside of work, he enjoys cycling in Los Angeles and hiking in the Sierras.
Michael Chi is a Senior Director of Technology overseeing Next Gen Stats and Data Engineering at the National Football League. He has a degree in Mathematics and Computer Science from the University of Illinois at Urbana Champaign. Michael first joined the NFL in 2007 and has primarily focused on technology and platforms for football statistics. In his spare time, he enjoys spending time with his family outdoors.
Mike Band is a Senior Manager of Research and Analytics for Next Gen Stats at the National Football League. Since joining the team in 2018, he has been responsible for ideation, development, and communication of key stats and insights derived from player-tracking data for fans, NFL broadcast partners, and the 32 clubs alike. Mike brings a wealth of knowledge and experience to the team with a master’s degree in analytics from the University of Chicago, a bachelor’s degree in sport management from the University of Florida, and experience in both the scouting department of the Minnesota Vikings and the recruiting department of Florida Gator Football.
Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab
The National Football League (NFL) is one of the most popular sports leagues in the United States and is the most valuable sports league in the world. The NFL, BioCore, and AWS are committed to advancing human understanding around the diagnosis, prevention, and treatment of sports-related injuries to make the game of football safer. More information regarding the NFL Player Health and Safety efforts is available on the NFL website.
The AWS Professional Services team has partnered with the NFL and Biocore to provide machine learning (ML)-based solutions for identifying helmet impacts from game footage using computer vision (CV) techniques. With multiple camera views available from each game, we have developed solutions to identify helmet impacts from each of these views and merge the helmet impact results.
The motivation behind utilizing multiple camera views comes from the limitation of information when the impact events are captured with only one view. With only one perspective, some players might occlude each other or be blocked by other objects on the field. Therefore, adding more perspectives allows our ML system to identify more impacts that aren’t visible in a single view. To showcase the results of our fusion process and how the team uses visualization tools to help evaluate the model performance, we have developed a codebase to visually overlay the multiple view detection results. This process helps identify the actual number of impacts individual players experience by removing duplicate impacts detected in multiple views.
In this post, we use the publicly available dataset from the NFL – Impact Detection Kaggle competition and show results for merging two views. The dataset includes helmet bounding boxes at every frame and impact labels found in each video. In particular, we focus on deduplicating and visualizing videos with the ID 57583_000082
in endzone and sideline views. You can download the endzone and sideline videos, and also the ground truth labels.
Prerequisites
The solution requires the following:
- An Amazon SageMaker Studio Lab account
- A Kaggle account for downloading the data
Get started on SageMaker Studio Lab and install the required packages
You can run the notebook from the GitHub repository or from SageMaker Studio Lab. In this post, we run the notebook from a SageMaker Studio Lab environment. We are choosing SageMaker Studio Lab because it is free, provides powerful CPU and GPU user sessions, and 15GB of persistent storage that will automatically save your environment, enabling you to pick up where you left off. To use SageMaker Studio Lab, request and set up a new account. After the account is approved, complete the following steps:
- Visit the aws-samples GitHub repo.
- In the
README
section, choose Open Studio Lab.
This redirects you to your SageMaker Studio Lab environment.
- Select your CPU compute type, then choose Start Runtime.
- After the runtime starts, choose Copy to Project, which opens a new window with the Jupyter Lab environment.
Now you’re ready to use the notebook!
- Open
fuse_and_visualize_multiview_impacts.ipynb
and follow the instructions in the notebook.
The first cell in the notebook installs the necessary Python packages such as pandas and OpenCV:
%pip install pandas
%pip install opencv-contrib-python-headless
Import all the necessary Python packages and set pandas options for better visualization experience:
import os
import cv2
import pandas as pd
import numpy as np
pd.set_option('mode.chained_assignment', None)
We use pandas for ingesting and parsing through the CSV file with the annotated helmet bounding boxes as well as impacts. We use NumPy mainly for manipulating arrays and matrices. We use OpenCV for reading, writing, and manipulating image data in Python.
Prepare the data by fusing results from two views
To fuse the two perspectives together, we use the train_labels.csv
from the Kaggle competition as an example because it contains ground truth impacts from both the endzone and sideline. The following function takes the input dataset and outputs a fused dataframe that is deduplicated for all the plays in the input dataset:
def prep_data(df):
df['game_play'] = df['gameKey'].astype('str') + '_' + df['playID'].astype('str').str.zfill(6)
return df
def dedup_view(df, windows):
# define view
df = df.sort_values(by='frame')
view_columns = ['frame', 'left', 'width', 'top', 'height', 'video']
common_columns = ['game_play', 'label', 'view', 'impactType']
label_cleaned = df[view_columns + common_columns]
# rename columns
sideline_column_rename = {col: 'Sideline_' + col for col in view_columns}
endzone_column_rename = {col: 'Endzone_' + col for col in view_columns}
sideline_columns = list(sideline_column_rename.values())
# create two dataframes, one for sideline, one for endzone
label_endzone = label_cleaned.query('view == "Endzone"')
label_endzone.rename(columns=endzone_column_rename, inplace=True)
label_sideline = label_cleaned.query('view == "Sideline"')
label_sideline.rename(columns=sideline_column_rename, inplace=True)
# prepare sideline labels
label_sideline['is_dup'] = False
for columns in sideline_columns:
label_endzone[columns] = np.nan
label_endzone['is_dup'] = False
# iterrate endzone rows to find matches and dedup
for index, row in label_endzone.iterrows():
player = row['label']
frame = row['Endzone_frame']
impact_type = row['impactType']
sideline_row = label_sideline[(label_sideline['label'] == player) &
((label_sideline['Sideline_frame'] >= frame - windows // 2) &
(label_sideline['Sideline_frame'] <= frame + windows // 2 + 1)) &
(label_sideline['is_dup'] == False) &
(label_sideline['impactType'] == impact_type)]
if len(sideline_row) > 0:
sideline_index = sideline_row.index[0]
label_sideline['is_dup'].loc[sideline_index] = True
for col in sideline_columns:
label_endzone[col].loc[index] = sideline_row.iloc[0][col]
label_endzone['is_dup'].loc[index] = True
# calculate overlap perc
not_dup_sideline = label_sideline[label_sideline['is_dup'] == False]
final_output = pd.concat([not_dup_sideline, label_endzone])
return final_output
def fuse_df(raw_df, windows):
outputs = []
all_game_play = raw_df['game_play'].unique()
for game_play in all_game_play:
df = raw_df.query('game_play ==@game_play')
output = dedup_view(df, windows)
outputs.append(output)
output_df = pd.concat(outputs)
output_df['gameKey'] = output_df['game_play'].apply(lambda x: x.split('_')[0]).map(int)
output_df['playID'] = output_df['game_play'].apply(lambda x: x.split('_')[1]).map(int)
return output_df
To run the function, we run the following code block to provide the location of the train_labels.csv
data and then perform data preparation to add an additional column and extract only the impact rows. After running the function, we save the output to a dataframe variable called fused_df
.
# read the annotated impact data from train_labels.csv
ground_truth = pd.read_csv('train_labels.csv')
# prepare game_play column using pipe(prep_data) function in pandas then filter the dataframe for just rows with impacts
ground_truth = ground_truth.pipe(prep_data).query('impact == 1')
# loop over all the unique game_plays and deduplicate the impact results from sideline and endzone
fused_df = fuse_df(ground_truth, windows=30)
The following screenshot shows the ground truth.
The following screenshot shows the fused dataframe examples.
Graph and video code
After we fuse the impact results, we use the generated fused_df
to overlay the results onto our endzone and sideline videos and merge the two views together. We use the following function for this, and the inputs needed are the paths to the endzone video, sideline video, fused_df
dataframe, and the final output path for the newly generated video. The functions used in this section are described in the markdown section of the notebook used in SageMaker Studio Lab.
def get_video_and_metadata(vid_path):
vid = cv2.VideoCapture(vid_path)
total_frame_number = vid.get(cv2.CAP_PROP_FRAME_COUNT)
width = int(vid.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(vid.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = vid.get(cv2.CAP_PROP_FPS)
return vid, total_frame_number, width, height, fps
def overlay_impacts(frame, fused_df, game_key, play_id, frame_cnt, h1):
# look for duplicates
duplicates = fused_df.query(f"gameKey == {int(game_key)} and
playID == {int(play_id)} and
is_dup == True and
Sideline_frame == @frame_cnt")
frame_has_impact = False
if len(duplicates) > 0:
for duplicate in duplicates.itertuples(index=False):
if frame_cnt == duplicate.Sideline_frame:
frame_has_impact = True
if frame_has_impact:
cv2.rectangle(frame, #frame to be edited
(int(duplicate.Sideline_left), int(duplicate.Sideline_top)), #(x,y) of top left corner
(int(duplicate.Sideline_left) + int(duplicate.Sideline_width), int(duplicate.Sideline_top) + int(duplicate.Sideline_height)), #(x,y) of bottom right corner
(0,0,255), #RED boxes
thickness=3)
cv2.rectangle(frame, #frame to be edited
(int(duplicate.Endzone_left), int(duplicate.Endzone_top)+ h1), #(x,y) of top left corner
(int(duplicate.Endzone_left) + int(duplicate.Endzone_width), int(duplicate.Endzone_top) + int(duplicate.Endzone_height) + h1), #(x,y) of bottom right corner
(0,0,255), #RED boxes
thickness=3)
cv2.line(frame, #frame to be edited
(int(duplicate.Sideline_left), int(duplicate.Sideline_top)), #(x,y) of point 1 in a line
(int(duplicate.Endzone_left), int(duplicate.Endzone_top) + h1), #(x,y) of point 2 in a line
(255, 255, 255), # WHITE lines
thickness=4)
else:
# if no duplicates, look for sideline then endzone and add to the view
sl_impacts = fused_df.query(f"gameKey == {int(game_key)} and
playID == {int(play_id)} and
is_dup == False and
view == 'Sideline' and
Sideline_frame == @frame_cnt")
if len(sl_impacts) > 0:
for impact in sl_impacts.itertuples(index=False):
if frame_cnt == impact.Sideline_frame:
frame_has_impact = True
if frame_has_impact:
cv2.rectangle(frame, #frame to be edited
(int(impact.Sideline_left), int(impact.Sideline_top)), #(x,y) of top left corner
(int(impact.Sideline_left) + int(impact.Sideline_width), int(impact.Sideline_top) + int(impact.Sideline_height)), #(x,y) of bottom right corner
(0, 255, 255), #YELLOW BOXES
thickness=3)
ez_impacts = fused_df.query(f"gameKey == {int(game_key)} and
playID == {int(play_id)} and
is_dup == False and
view == 'Endzone' and
Endzone_frame == @frame_cnt")
if len(ez_impacts) > 0:
for impact in ez_impacts.itertuples(index=False):
if frame_cnt == impact.Endzone_frame:
frame_has_impact = True
if frame_has_impact:
cv2.rectangle(frame, #frame to be edited
(int(impact.Endzone_left), int(impact.Endzone_top)+ h1), #(x,y) of top left corner
(int(impact.Endzone_left) + int(impact.Endzone_width), int(impact.Endzone_top) + int(impact.Endzone_height) + h1 ), #(x,y) of bottom right corner
(0, 255, 255), #YELLOW BOXES
thickness=3)
return frame, frame_has_impact
def generate_impact_video(ez_vid_path:str,
sl_vid_path:str,
fused_df:pd.DataFrame,
output_path:str,
freeze_impacts=True):
#define video codec to be used for
VIDEO_CODEC = "MP4V"
# parse game_key and play_id information from the name of the files
game_key = os.path.basename(ez_vid_path).split('_')[0] # parse game_key
play_id = os.path.basename(ez_vid_path).split('_')[1] # parse play_id
# get metadata such as total frame number, width, height and frames per second (FPS) from endzone (ez) and sideline (sl) videos
ez_vid, ez_total_frame_number, ez_width, ez_height, ez_fps = get_video_and_metadata(ez_vid_path)
sl_vid, sl_total_frame_number, sl_width, sl_height, sl_fps = get_video_and_metadata(sl_vid_path)
# define a video writer for the output video
output_video = cv2.VideoWriter(output_path, #output file name
cv2.VideoWriter_fourcc(*VIDEO_CODEC), #Video codec
ez_fps, #frames per second in the output video
(ez_width, ez_height+sl_height)) # frame size with stacking video vertically
# find shorter video and use the total frame number from the shorter video for the output video
total_frame_number = int(min(ez_total_frame_number, sl_total_frame_number))
# iterate through each frame from endzone and sideline
for frame_cnt in range(total_frame_number):
frame_has_impact = False
frame_near_impact = False
# reading frames from both endzone and sideline
ez_ret, ez_frame = ez_vid.read()
sl_ret, sl_frame = sl_vid.read()
# creating strings to be added to the output frames
img_name = f"Game key: {game_key}, Play ID: {play_id}, Frame: {frame_cnt}"
video_frame = f'{game_key}_{play_id}_{frame_cnt}'
if ez_ret == True and sl_ret == True:
h, w, c = ez_frame.shape
h1,w1,c1 = sl_frame.shape
if h != h1 or w != w1: # resize images if they're different
ez_frame = cv2.resize(ez_frame,(w1,h1))
frame = np.concatenate((sl_frame, ez_frame), axis=0) # stack the frames vertically
frame, frame_has_impact = overlay_impacts(frame, fused_df, game_key, play_id, frame_cnt, h1)
cv2.putText(frame, #image frame to be modified
img_name, #string to be inserted
(30, 30), #(x,y) location of the string
cv2.FONT_HERSHEY_SIMPLEX, #font
1, #scale
(255, 255, 255), #WHITE letters
thickness=2)
cv2.putText(frame, #image frame to be modified
str(frame_cnt), #frame count string to be inserted
(w1-75, h1-20), #(x,y) location of the string in the top view
cv2.FONT_HERSHEY_SIMPLEX, #font
1, #scale
(255, 255, 255), # WHITE letters
thickness=2)
cv2.putText(frame, #image frame to be modified
str(frame_cnt), #frame count string to be inserted
(w1-75, h1+h-20), #(x,y) location of the string in the bottom view
cv2.FONT_HERSHEY_SIMPLEX, #font
1, #scale
(255, 255, 255), # WHITE letters
thickness=2)
output_video.write(frame)
# Freeze for 60 frames on impacts
if frame_has_impact and freeze_impacts:
for _ in range(60):
output_video.write(frame)
else:
break
frame_cnt += 1
output_video.release()
return
To run these functions, we can provide an input as shown in the following code, which generates a video called output.mp4
:
generate_impact_video('57583_000082_Endzone.mp4',
'57583_000082_Sideline.mp4',
fused_df,
'output.mp4')
This generates a video as shown in the following example, where the red bounding boxes are impacts found in both endzone and sideline views, and the yellow bounding boxes are impacts that are found in just one view in either the endzone or sideline.
Conclusion
In this post, we demonstrated how the NFL, Biocore, and the AWS ProServe teams are working together to improve impact detection by fusing results from multiple views. This allows the teams to debug and visualize how the model is performing qualitatively. This process can easily be scaled up to three or more views; in our projects, we have utilized up to seven different views. Detecting helmet impacts by watching videos from only one view can be difficult due to view obstruction, but detecting impacts from multiple views and fusing the results allows us to improve our model performance.
To experiment with this solution, visit the aws-samples GitHub repo and refer to the fuse_and_visualize_multiview_impacts.ipynb notebook. Similar techniques can also be applied to other industries such as manufacturing, retail, and security, where having multiple views would benefit the ML system to better identify targets with a more comprehensive view.
For more information regarding NFL Player Health and Safety, visit the NFL website and NFL Explained: Innovation in Player Health & Safety.
About the authors
Chris Boomhower is a Machine Learning Engineer at AWS Professional Services. Chris has over 6 years experience developing supervised and unsupervised Machine Learning solutions across various industries. Today, he spends most his time helping customers in sports, healthcare, and agriculture industries design and build scalable, end-to-end, Machine Learning solutions.
Ben Fenker is a Senior Data Scientist in AWS Professional Services and has helped customers build and deploy ML solutions in industries ranging from sports to healthcare to manufacturing. He has a Ph.D. in physics from Texas A&M University and 6 years of industry experience. Ben enjoys baseball, reading, and raising his kids.
Sam Huddleston is a Principal Data Scientist at Biocore LLC, who serves as the Technology Lead for the NFL’s Digital Athlete program. Biocore is a team of world-class engineers based in Charlottesville, Virginia, that provides research, testing, biomechanics expertise, modeling and other engineering services to clients dedicated to the understanding and reduction of injury.
Jarvis Lee is a Senior Data Scientist with AWS Professional Services. He has been with AWS for over five years, working with customers on machine learning and computer vision problems. Outside of work, he enjoys riding bicycles.
Tyler Mullenbach is the Global Practice Lead for ML with AWS Professional Services. He is responsible for driving the strategic direction of ML for Professional Services and ensuring that customers realize transformative business achievements through the adoption of ML technologies.
Kevin Song is a Data Scientist at AWS Professional Services. He holds a PhD in Biophysics and has over 5 years of industry experience in building computer vision and machine learning solutions.
Betty Zhang is a data scientist with 10 years of experience in data and technology. Her passion is to build innovative machine learning solutions to drive transformational changes for companies. In her spare time, she enjoys traveling, reading and learning about new technologies.
Amazon’s quantum computing papers at QIP 2023
Research on “super-Grover” optimization, quantum algorithms for topological data analysis, and simulation of physical systems displays the range of Amazon’s interests in quantum computing.Read More
How to decide between Amazon Rekognition image and video API for video moderation
Almost 80% of today’s web content is user-generated, creating a deluge of content that organizations struggle to analyze with human-only processes. The availability of consumer information helps them make decisions, from buying a new pair of jeans to securing home loans. In a recent survey, 79% of consumers stated they rely on user videos, comments, and reviews more than ever and 78% of them said that brands are responsible for moderating such content. 40% said that they would disengage with a brand after a single exposure to toxic content.
Amazon Rekognition has two sets of APIs that help you moderate images or videos to keep digital communities safe and engaged.
One approach to moderate videos is to model video data as a sample of image frames and use image content moderation models to process the frames individually. This approach allows the reuse of image-based models. Some customers have asked if they could use this approach to moderate videos by sampling image frames and sending them to the Amazon Rekognition image moderation API. They are curious about how this solution compares with the Amazon Rekognition video moderation API.
We recommend using the Amazon Rekognition video moderation API to moderate video content. It’s designed and optimized for video moderation, offering better performance and lower costs. However, there are specific use cases where the image API solution is optimal.
This post compares the two video moderation solutions in terms of accuracy, cost, performance, and architecture complexity to help you choose the best solution for your use case.
Moderate videos using the video moderation API
The Amazon Rekognition video content moderation API is the standard solution used to detect inappropriate or unwanted content in videos. It performs as an asynchronous operation on video content stored in an Amazon Simple Storage Service (Amazon S3) bucket. The analysis results are returned as an array of moderation labels along with a confidence score and timestamp indicating when the label was detected.
The video content moderation API uses the same machine learning (ML) model for image moderation. The output is filtered for noisy false positive results. The workflow is optimized for latency by parallelizing operations like decode, frame extraction, and inference.
The following diagram shows the logical steps of how to use the Amazon Rekognition video moderation API to moderate videos.
The steps are as follows:
- Upload videos to an S3 bucket.
- Call the video moderation API in an AWS Lambda function (or customized script on premises) with the video file location as a parameter. The API manages the heavy lifting of video decoding, sampling, and inference. You can either implement a heartbeat logic to check the moderation job status until it completes, or use Amazon Simple Notification Service (Amazon SNS) to implement an event-driven pattern. For details about the video moderation API, refer to the following Jupyter notebook for detailed examples.
- Store the moderation result as a file in an S3 bucket or database.
Moderate videos using the image moderation API
Instead of using the video content moderation API, some customers choose to independently sample frames from videos and detect inappropriate content by sending the images to the Amazon Rekognition DetectModerationLabels API. Image results are returned in real time with labels for inappropriate content or offensive content along with a confidence score.
The following diagram shows the logical steps of the image API solution.
The steps are as follows:
1. Use a customized application or script as an orchestrator, from loading the video to the local file system.
2. Decode the video.
3. Sample image frames from the video at a chosen interval, such as two frames per second. Then iterate through all the images to:
3.a. Send each image frame to the image moderation API.
3.b. Store the moderation results in a file or database.
Compare this with the video API solution, which requires a light Lambda function to orchestrate API calls. The image sampling solution is CPU intensive and requires more compute resources. You can host the application using AWS services such as Lambda, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, or Amazon Elastic Compute Cloud (Amazon EC2).
Evaluation dataset
To evaluate both solutions, we use a sample dataset consisting of 200 short-form videos. The videos range from 10 seconds to 45 minutes. 60% of the videos are less than 2 minutes long. This sample dataset is used to test the performance, cost, and accuracy metrics for both solutions. The results compare the Amazon Rekognition image API sampling solution to the video API solution.
To test the image API solution, we use open-source libraries (ffmpeg and OpenCV) to sample images at a rate of two frames per second (one frame every 500 milliseconds). This rate mimics the sampling frequency used by the video content moderation API. Each image is sent to the image content moderation API to generate labels.
To test the video sampling solution, we send the videos directly to the video content moderation API to generate labels.
Results summary
We focus on the following key results:
- Accuracy – Both solutions offer similar accuracy (false positive and false negative percentages) using the same sampling frequency of two frames per second
- Cost – The image API sampling solution is more expensive than the video API solution using the same sampling frequency of two frames per second
- The image API sampling solution cost can be reduced by sampling fewer frames per second
- Performance – On average, the video API has a 425% faster processing time than the image API solution for the sample dataset
- The image API solution performs better in situations with a high frame sample interval and on videos less than 90 seconds
- Architecture complexity – The video API solution has a low architecture complexity, whereas the image API sampling solution has a medium architecture complexity
Accuracy
We tested both solutions using the sample set and the same sampling frequency of two frames per second. The results demonstrated that both solutions provide a similar false positive and true positive ratio. This result is expected because under the hood, Amazon Rekognition uses the same ML model for both the video and image moderation APIs.
To learn more about metrics for evaluating content moderation, refer to Metrics for evaluating content moderation in Amazon Rekognition and other content moderation services.
Cost
The cost analysis demonstrates that the image API solution is more expensive than the video API solution if you use the same sampling frequency of two frames per second. The image API solution can be more cost effective if you reduce the number of frames sampled per second.
The two primary factors that impact the cost of a content moderation solution are the Amazon Rekognition API costs and compute costs. The default pricing for the video content moderation API is $0.10 per minute and $0.001 per image for the image content moderation API. A 60-second video produces 120 frames using a rate of two frames per second. The video API costs $0.10 to moderate a 60-second video, whereas the image API costs $0.120.
The price calculation is based on the official price in Region us-east-1 at the time of writing this post. For more information, refer to Amazon Rekognition pricing.
The cost analysis looks at the total cost to generate content moderation labels for the 200 videos in the sample set. The calculations are based on us-east-1 pricing. If you’re using another Region, modify the parameters with the pricing for that Region. The 200 videos contain 4271.39 minutes of content and generate 512,567 image frames at a sampling rate of two frames per second.
This comparison doesn’t consider other costs, such as Amazon S3 storage. We use Lambda as an example to calculate the AWS compute cost. Compute costs take into account the number of requests to Lambda and AWS Step Functions to run the analysis. The Lambda memory/CPU setting is estimated based on the Amazon EC2 specifications. This cost estimate uses a four GB, 2-second Lambda request per image API call. Lambda functions have a maximum invocation timeout limit of 15 minutes. For longer videos, the user may need to implement iteration logic using Step Functions to reduce the number of frames processed per Lambda call. The actual Lambda settings and cost patterns may differ depending on your requirements. It’s recommended to test the solution end to end for a more accurate cost estimation.
The following table summarizes the costs.
Type | Amazon Rekognition Costs | Compute Costs | Total Cost |
Video API Solution | $427.14 | $0 (Free tier) |
$427.14 |
Image API Solution: Two frames per second | $512.57 | $164.23 | $676.80 |
Image API Solution: One frame per second | $256.28 | $82.12 | $338.40 |
Performance
On average, the video API solution has a four times faster processing time than the image API solution. The image API solution performs better in situations with a high frame sample interval and on videos shorter than 90 seconds.
This analysis measures performance as the average processing time in seconds per video. It looks at the total and average time to generate content moderation labels for the 200 videos in the sample set. The processing time is measured from the video upload to the result output and includes each step in the image sampling and video API process.
The video API solution has an average processing time of 35.2 seconds per video for the sample set. This is compared to the image API solution with an average processing time of 156.24 seconds per video for the sample set. On average, the video API performs four times faster than the image API solution. The following table summarizes these findings.
Type | Average Processing Time (All Videos) | Average Processing Time (Videos Under 1.5 Minutes) |
Video API Solution | 35.2 seconds | 24.05 seconds |
Image API Solution: Two frames per second | 156.24 seconds | 8.45 seconds |
Difference | 425% | -185% |
The image API performs better than the video API when the video is shorter than 90 seconds. This is because the video API has a queue managing the tasks that has a lead time. The image API can also perform better if you have a lower sampling frequency. Increasing the frame interval to over 5 seconds can decrease the processing time by 6–10 times. It’s important to note that increasing intervals introduces the risk of missed identification of inappropriate content between frame samples.
Architecture complexity
The video API solution has a low architecture complexity. You can set up a serverless pipeline or run a script to retrieve content moderation results. Amazon Rekognition manages the heavy computing and inference. The application orchestrating the Amazon Rekognition APIs can be hosted on a light machine.
The image API solution has a medium architecture complexity. The application logic has to orchestrate additional steps to store videos on the local drive, run image processing to capture frames, and call the image API. The server hosting the application requires higher computing capacity to support the local image processing. For the evaluation, we launched an EC2 instance with 4 vCPU and 8 G RAM to support two parallel threads. Higher compute requirements may lead to additional operation overhead.
Optimal use cases for the image API solution
The image API solution is ideal for three specific use cases when processing videos.
The first is real-time video streaming. You can capture image frames from a live video stream and send the images to the image moderation API.
The second use case is content moderation with a low frame sampling rate requirement. The image API solution is more cost-effective and performant if you sample frames at a low frequency. It’s important to note that there will be a trade-off between cost and accuracy. Sampling frames at a lower rate may increase the risk of missing frames with inappropriate content.
The third use case is for the early detection of inappropriate content in video. The image API solution is flexible and allows you to stop processing and flag the video early on, saving cost and time.
Conclusion
The video moderation API is ideal for most video moderation use cases. It’s more cost effective and performant than the image API solution when you sample frames at a frequency such as two frames per second. Additionally, it has a low architectural complexity and reduced operational overhead requirements.
The following table summarizes our findings to help you maximize the use of the Amazon Rekognition image and video APIs for your specific video moderation use cases. Although these results are averages achieved during testing and by some of our customers, they should give you ideas to balance the use of each API.
. | Video API Solution | Image API Solution |
Accuracy | Same accuracy | . |
Cost | Lower cost using the default image sampling interval | Lower cost if you reduce the number of frames sampled per second (sacrifice accuracy) |
Performance | Faster for videos longer than 90 seconds | Faster for videos less than 90 seconds |
Architecture Complexity | Low complexity | Medium complexity |
Amazon Rekognition content moderation can not only help your business protect and keep customers safe and engaged, but also contribute to your ongoing efforts to maximize the return on your content moderation investment. Learn more about Content Moderation on AWS and our Content Moderation ML use cases.
About the authors
Lana Zhang is a Sr. Solutions Architect at the AWS WWSO AI Services team, with expertise in AI and ML for content moderation and computer vision. She is passionate about promoting AWS AI services and helping customers transform their business solutions.
Brigit Brown is a Solutions Architect at Amazon Web Services. Brigit is passionate about helping customers find innovative solutions to complex business challenges using machine learning and artificial intelligence. Her core areas of depth are natural language processing and content moderation.
How Amazon’s AZ3 chip makes neural networks run more efficiently
Specialized circuitry for compression and for both inline and on-the-fly decompression minimize data movement.Read More