Facebook hosts virtual 2020 Fellowship Summit

The Facebook Fellowship Program supports top PhD students from around the world in fields related to computer science and engineering. The program includes an invitation to the annual Fellowship Summit, which is an opportunity for Fellows to network with one another, present their work, meet Facebook researchers and recruiters, and more.


Due to COVID-19, the Fellowship Summit was fully virtual and spanned September 8 to September 18. “We’ve embraced this new challenge of planning virtual events, which has provided unique opportunities for the summit,” says Alisa Futriski, Program Coordinator for the Fellowship Program. “For example, because of increased scheduling flexibility and no travel restriction issues, we were able to bring together a particularly robust group of presenters from the Facebook Research community.”

One of the presenters was Facebook Chief Technology Officer Mike Schroepfer, who kicked off the virtual summit with a welcome video. Fellows also heard from research area experts and executives, such as VP of AI Jérôme Pesenti, Novi Head Economist Christian Catalini, Probability Research Scientist Mark Harman, Data for Good Public Policy Research Manager Kelsey Mulcahy, and many more.

“In previous years, Fellows have been given the opportunity to present their research in poster sessions during the summit,” says Sharon Ayalde, Fellowship Program Manager. “This year, in an effort to increase engagement virtually, we asked the Fellows to record presentations of their current research for the summit. These videos were available for all attendees and are now featured for anyone to browse on their Fellow profiles.”

The two-week event also included several Q&As with research-area-specific recruiters for Fellows interested in internships and full-time positions. To complement these Q&As, we organized a panel of several past Fellows who went on to work at Facebook as research interns, full-time researchers, or both: Mark Jeffrey (2017), Eden Litt (2014), Moses Namara (2020), Brandon Schlinker (2016), and Greg Steinbrecher (2017).

Applications for the 2021 Fellowship cohort opened on August 10 with a deadline of October 1, and winners are typically announced in the January after applications close. For more information and to apply, visit the Fellowship page.

The post Facebook hosts virtual 2020 Fellowship Summit appeared first on Facebook Research.

Read More

Data visualization and anomaly detection using Amazon Athena and Pandas from Amazon SageMaker

Data visualization and anomaly detection using Amazon Athena and Pandas from Amazon SageMaker

Many organizations use Amazon SageMaker for their machine learning (ML) requirements and source data from a data lake stored on Amazon Simple Storage Service (Amazon S3). The petabyte scale source data on Amazon S3 may not always be clean because data lakes ingest data from several source systems, such as like flat files, external feeds, databases, and Hadoop. It may contain extreme values in source attributes, considered as outliers in the data. Outliers arise due to changes in system behavior, fraudulent behavior, human error, instrument error, missing data, or simply through natural deviations in populations. Outliers in training data can easily impact model accuracy of many ML models, like linear and logistic regression. These anomalies result in ML scientists and analysts facing skewed results. Outliers can dramatically impact ML models and change the model equation completely with bad predictions or estimations.

Data scientists and analysts are looking for a way to remove outliers. Analysts come from a strong data background, and are very fluent in writing SQL queries with programming languages. The following tools are a natural choice for ML scientists to remove outliers and carry out data visualization:

  • Amazon Athena – An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
  • Pandas – An open-source, high-performance, easy-to-use library that provides for data structures and data analysis library like matplotlib for Python programming language
  • Amazon SageMaker – A fully managed service that provides you with the ability to build, train, and deploy ML models quickly

To illustrate how to use Athena with Pandas for anomaly detection and visualization using Amazon SageMaker, we clean a set of New York City Taxi and Limousine Commission (TLC) Trip Record Data by removing outlier records. In this dataset, outliers are when a taxi trip’s duration is for multiple days, 0 seconds, or less than 0 seconds. Then we use the Pandas matplotlib library to plot graphs to visualize trip duration values.

Solution overview

To implement this solution, you perform the following high-level steps:

  1. Create an AWS Glue Data Catalog and browse the data on the Athena console.
  2. Create an Amazon SageMaker Jupyter notebook and install PyAthena.
  3. Identify anomalies using Athena SQL-Pandas from the Jupyter notebook.
  4. Visualize data and remove outliers using Athena SQL-Pandas.

The following diagram illustrates the architecture of this solution.


To follow this post, you should be familiar with the following:

  • The Amazon S3 file upload process
  • AWS Glue crawlers and the Data Catalog
  • Basic SQL queries
  • Jupyter notebooks
  • Assigning a basic AWS Identity and Access Management (IAM) policy to a role

Preparing the data

For this post, we use New York City Taxi and Limousine Commission (TLC) Trip Record Data, which is a publicly available dataset.

  1. Download the file yellow_tripdata_2019-01.csv to your local machine.
  2. Create the S3 bucket s3-yellow-cab-trip-details (your name will be different).
  3. Upload the file to your bucket using the Amazon S3 console.

Creating the Data Catalog and browsing the data

After you upload the data to Amazon S3, you create the Data Catalog in AWS Glue. This allows you to run SQL queries using Athena.

  1. On the AWS Glue console, create a new database.
  2. For Database name, enter db_yellow_cab_trip_details.

  1. Create an AWS Glue crawler to gather the metadata in the file and catalog it.

For this post, I use the database (db_yellow_cab_trip_details) to save tables with the added pre-fix as src_.

  1. Run the crawler.

The crawler can take 2–3 minutes to complete. You can check the status on Amazon CloudWatch.

The following screenshot shows the crawler details on the AWS Glue console.

When the crawler is complete, the table is available in the Data Catalog. All the metadata and column-level information is displayed with corresponding data types.

We can now check the data on the Athena console to make sure we can read the file as a table and run a SQL query.

Run your query with the following code:

SELECT * FROM db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 10;

The following screenshot shows your output on the Athena console.

Creating a Jupyter notebook and installing PyAthena

You now create a new notebook instance from Amazon SageMaker and install PyAthena using Jupyter.

Amazon SageMaker has managed built-in Jupyter notebooks that allow you to write code in Python, Julia, R, or Scala to explore, analyze, and do modeling with a small set of data.

Make sure the role used for your notebook has access on Athena (use IAM policies to verify and add S3FullAccess and AmazonAthenaFullAccess).

To create your notebook, complete the following steps:

  1. On the Amazon SageMaker console, under Notebook, choose Notebook instances.
  2. Choose Create notebook instance.

  1. On the Create notebook instance page, enter a name and choose an instance type.

We recommend using an ml.m4.10xlarge instance, due to the size of the dataset. You should choose an appropriate instance depending on your data; costs vary for different instances.

Wait until the Notebook instance status shows as InService (this step can take up to 5 minutes).

  1. When the instance is ready, choose Open Jupyter.

  1. Open the conda_python3 kernel from the notebook instance.
  2. Enter the following commands to install PyAthena:
! pip install --upgrade pip
! pip install PyAthena

The following screenshot shows the output.

You can also install PyAthena when you create the notebook instance by using lifecycle configurations. See the following code:

sudo -u ec2-user -i <<'EOF'
source /home/ec2-user/anaconda3/bin/activate python3
pip install --upgrade pip
pip install --upgrade  PyAthena
source /home/ec2-user/anaconda3/bin/deactivate

The following screenshot shows where you enter the preceding code in the Scripts section when creating a lifecycle configuration.

You can run a SQL query from the notebook to validate connectivity to Athena and pull data for visualization.

To import the libraries, enter the following code:

from pyathena import connect
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

To connect to Athena, enter the following code:

conn = connect(s3_staging_dir='s3://<your-Query-result-location>',region_name='us-east-1')

To check the sample data, enter the following query:

df_sample = pd.read_sql("SELECT * FROM db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 10", conn)

The following screenshot shows the output.

Detecting anomalies with Athena, Pandas, and Amazon SageMaker

Now that we can connect to Athena, we can run SQL queries to find the records that have unusual trip_duration values.

The following Athena query checks anomalies in the trip_duration data to find the top 50 records with the maximum duration:

df_anomaly_duration= pd.read_sql("select tpep_dropoff_datetime,tpep_pickup_datetime, 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second, 
date_diff('minute', cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_minute, 
date_diff('hour',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_hour, 
date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_day 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details 
order by 3 desc limit 50", conn)


The following screenshot shows the output; there are many outliers (trips with a duration greater than 1 day).

The output shows the duration in seconds, minutes, hours, and days.

The following query checks for anomalies and shows the top 50 records with the lowest minimum duration (negative value or 0 seconds):

df_anomaly_duration= pd.read_sql("select tpep_dropoff_datetime,tpep_pickup_datetime, 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second, 
date_diff('minute', cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_minute, 
date_diff('hour',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_hour, 
date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_day 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details 
order by 3 asc limit 50", conn)


The following screenshot shows the output; multiple trips have a negative value or duration of 0.

Similarly, we can use different SQL queries using to analyze the data and find other outliers. We can also clean the data by using SQL queries and, if needed, save the data in Amazon S3 with CTAS queries.

Visualizing the data and removing outliers

Pull the data using the following Athena query in a Pandas DataFrame, and use matplotlib.pyplot to create a visual graph to see the outliers:

df_full = pd.read_sql("SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details ", conn)
plt.scatter(range(len(df_full["duration_second"])), np.sort(df_full["duration_second"]))


The process of plotting the full dataset can take 7–10 minutes. To reduce time, add a limit to the number of records in the query:

SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 100000 

The following screenshot shows the output.

To ignore outliers, run the following query. You replot the graph after removing the outlier records in which the duration is equal or less than 0 seconds or longer than 1 day:

df_clean_data = pd.read_sql("SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second from db_yellow_cab_trip_details.src_yellow_cab_trip_details where 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp))  > 0 and date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) < 1 ", conn)
plt.scatter(range(len(df_clean_data["duration_second"])), np.sort(df_clean_data["duration_second"]))

The following screenshot shows the output.

Cleaning up

When you’re done, delete the notebook instance to avoid recurring deployment costs.

  1. On the Amazon SageMaker notebook, choose your notebook instance.
  2. Choose Stop.

  1. When the status shows as Stopped, choose Delete.


This post walked you through finding and removing outliers from your dataset and data visualization. We used an Amazon SageMaker notebook to run analytical queries using Athena SQL, and used Athena to read the dataset, which is saved in Amazon S3 with the metadata catalog in AWS Glue. We used queries in Athena to find anomalies in the data and ignore these outliers. We also used notebook instances to visualize graphs using Pandas’ matplotlib.pyplot library.

You can try this solution for your use-cases to remove outliers using Athena SQL and SageMaker notebook. If you have comments or feedback, please leave them below.

About the Authors

Rahul Sonawane is a Senior Consultant, Big Data at the Shared Delivery Teams at Amazon Web Services.





Behram Irani is a Senior Solutions Architect, Data & Analytics at Amazon Web Services.

Read More

Advancing Instance-Level Recognition Research

Advancing Instance-Level Recognition Research

Posted by Cam Askew and André Araujo, Software Engineers, Google Research

Instance-level recognition (ILR) is the computer vision task of recognizing a specific instance of an object, rather than simply the category to which it belongs. For example, instead of labeling an image as “post-impressionist painting”, we’re interested in instance-level labels like “Starry Night Over the Rhone by Vincent van Gogh”, or “Arc de Triomphe de l’Étoile, Paris, France”, instead of simply “arch”. Instance-level recognition problems exist in many domains, like landmarks, artwork, products, or logos, and have applications in visual search apps, personal photo organization, shopping and more. Over the past several years, Google has been contributing to research on ILR with the Google Landmarks Dataset and Google Landmarks Dataset v2 (GLDv2), and novel models such as DELF and Detect-to-Retrieve.

Three types of image recognition problems, with different levels of label granularity (basic, fine-grained, instance-level), for objects from the artwork, landmark and product domains. In our work, we focus on instance-level recognition.

Today, we highlight some results from the Instance-Level Recognition Workshop at ECCV’20. The workshop brought together experts and enthusiasts in this area, with many fruitful discussions, some of which included our ECCV’20 paper “DEep Local and Global features” (DELG), a state-of-the-art image feature model for instance-level recognition, and a supporting open-source codebase for DELG and other related ILR techniques. Also presented were two new landmark challenges (on recognition and retrieval tasks) based on GLDv2, and future ILR challenges that extend to other domains: artwork recognition and product retrieval. The long-term goal of the workshop and challenges is to foster advancements in the field of ILR and push forward the state of the art by unifying research workstreams from different domains, which so far have mostly been tackled as separate problems.

DELG: DEep Local and Global Features
Effective image representations are the key components required to solve instance-level recognition problems. Often, two types of representations are necessary: global and local image features. A global feature summarizes the entire contents of an image, leading to a compact representation but discarding information about spatial arrangement of visual elements that may be characteristic of unique examples. Local features, on the other hand, comprise descriptors and geometry information about specific image regions; they are especially useful to match images depicting the same objects.

Currently, most systems that rely on both of these types of features need to separately adopt each of them using different models, which leads to redundant computations and lowers overall efficiency. To address this, we proposed DELG, a unified model for local and global image features.

The DELG model leverages a fully-convolutional neural network with two different heads: one for global features and the other for local features. Global features are obtained using pooled feature maps of deep network layers, which in effect summarize the salient features of the input images making the model more robust to subtle changes in input. The local feature branch leverages intermediate feature maps to detect salient image regions, with the help of an attention module, and to produce descriptors that represent associated localized contents in a discriminative manner.

Our proposed DELG model (left). Global features can be used in the first stage of a retrieval-based system, to efficiently select the most similar images (bottom). Local features can then be employed to re-rank top results (top, right), increasing the precision of the system.

This novel design allows for efficient inference since it enables extraction of global and local features within a single model. For the first time, we demonstrated that such a unified model can be trained end-to-end and deliver state-of-the-art results for instance-level recognition tasks. When compared to previous global features, this method outperforms other approaches by up to 7.5% mean average precision; and for the local feature re-ranking stage, DELG-based results are up to 7% better than previous work. Overall, DELG achieves 61.2% average precision on the recognition task of GLDv2, which outperforms all except two methods of the 2019 challenge. Note that all top methods from that challenge used complex model ensembles, while our results use only a single model.

Tensorflow 2 Open-Source Codebase
To foster research reproducibility, we are also releasing a revamped open-source codebase that includes DELG and other techniques relevant to instance-level recognition, such as DELF and Detect-to-Retrieve. Our code adopts the latest Tensorflow 2 releases, and makes available reference implementations for model training & inference, besides image retrieval and matching functionalities. We invite the community to use and contribute to this codebase in order to develop strong foundations for research in the ILR field.

New Challenges for Instance Level Recognition
Focused on the landmarks domain, the Google Landmarks Dataset v2 (GLDv2) is the largest available dataset for instance-level recognition, with 5 million images spanning 200 thousand categories. By training landmark retrieval models on this dataset, we have demonstrated improvements of up to 6% mean average precision, compared to models trained on earlier datasets. We have also recently launched a new browser interface for visually exploring the GLDv2 dataset.

This year, we also launched two new challenges within the landmark domain, one focusing on recognition and the other on retrieval. These competitions feature newly-collected test sets, and a new evaluation methodology: instead of uploading a CSV file with pre-computed predictions, participants have to submit models and code that are run on Kaggle servers, to compute predictions that are then scored and ranked. The compute restrictions of this environment put an emphasis on efficient and practical solutions.

The challenges attracted over 1,200 teams, a 3x increase over last year, and participants achieved significant improvements over our strong DELG baselines. On the recognition task, the highest scoring submission achieved a relative increase of 43% average precision score and on the retrieval task, the winning team achieved a 59% relative improvement of the mean average precision score. This latter result was achieved via a combination of more effective neural networks, pooling methods and training protocols (see more details on the Kaggle competition site).

In addition to the landmark recognition and retrieval challenges, our academic and industrial collaborators discussed their progress on developing benchmarks and competitions in other domains. A large-scale research benchmark for artwork recognition is under construction, leveraging The Met’s Open Access image collection, and with a new test set consisting of guest photos exhibiting various photometric and geometric variations. Similarly, a new large-scale product retrieval competition will capture various challenging aspects, including a very large number of products, a long-tailed class distribution and variations in object appearance and context. More information on the ILR workshop, including slides and video recordings, is available on its website.

With this research, open source code, data and challenges, we hope to spur progress in instance-level recognition and enable researchers and machine learning enthusiasts from different communities to develop approaches that generalize across different domains.

The main Google contributors of this project are André Araujo, Cam Askew, Bingyi Cao, Jack Sim and Tobias Weyand. We’d like to thank the co-organizers of the ILR workshop Ondrej Chum, Torsten Sattler, Giorgos Tolias (Czech Technical University), Bohyung Han (Seoul National University), Guangxing Han (Columbia University), Xu Zhang (Amazon), collaborators on the artworks dataset Nanne van Noord, Sarah Ibrahimi (University of Amsterdam), Noa Garcia (Osaka University), as well as our collaborators from the Metropolitan Museum of Art: Jennie Choi, Maria Kessler and Spencer Kiser. For the open-source Tensorflow codebase, we’d like to thank the help of recent contributors: Dan Anghel, Barbara Fusinska, Arun Mukundan, Yuewei Na and Jaeyoun Kim. We are grateful to Will Cukierski, Phil Culliton, Maggie Demkin for their support with the landmarks Kaggle competitions. Also we’d like to thank Ralph Keller and Boris Bluntschli for their help with data collection.

Read More

New Earth Simulator to Take on Planet’s Biggest Challenges

New Earth Simulator to Take on Planet’s Biggest Challenges

A new supercomputer under construction is designed to tackle some of the planet’s toughest life sciences challenges by speedily crunching vast quantities of environmental data.

The Japan Agency for Marine-Earth Science and Technology, or JAMSTEC, has commissioned tech giant NEC to build the fourth generation of its Earth Simulator. The new system, scheduled to become operational in March, will be based around SX-Aurora TSUBASA vector processors from NEC and NVIDIA A100 Tensor Core GPUs, all connected with NVIDIA Mellanox HDR 200Gb/s InfiniBand networking.

This will give it a maximum theoretical performance of 19.5 petaflops, putting it in the highest echelons of the TOP500 supercomputer ratings.

The new system will benefit from a multi-architecture design, making it suited to various research and development projects in the earth sciences field. In particular, it will act as an execution platform for efficient numerical analysis and information creation, coordinating data relating to the global environment.

Its work will span marine resources, earthquakes and volcanic activity. Scientists will gain deeper insights into cause-and-effect relationships in areas such as crustal movement and earthquakes.

The Earth Simulator will be deployed to predict and mitigate natural disasters, potentially minimizing loss of life and damage in the event of another natural disaster like the earthquake and tsunami that hit Japan in 2011.

Earth Simulator will achieve this by running large-scale simulations at high speed in ways previous generations of Earth Simulator couldn’t. The intent is also to have the system play a role in helping governments develop a sustainable socio-economic system.

The new Earth Simulator promises to deliver a multitude of vital environmental information. It also represents a quantum leap in terms of its own environmental footprint.

Earth Simulator 3, launched in 2015, offered a performance of 1.3 petaflops. It was a world beater at the time, outstripping Earth Simulators 1 and 2, launched in 2002 and 2009, respectively.

The fourth-generation model will deliver more than 15x the performance of its predecessor, while keeping the same level of power consumption and requiring around half the footprint. It’s able to achieve these feats thanks to major research and development efforts from NVIDIA and NEC.

The latest processing developments are also integral to the Earth Simulator’s ability to keep up with rising data levels.

Scientific applications used for earth and climate modelling are generating increasing amounts of data that require the most advanced computing and network acceleration to give researchers the power they need to simulate and predict our world.

NVIDIA Mellanox HDR 200Gb/s InfiniBand networking with in-network compute acceleration engines combined with NVIDIA A100 Tensor Core GPUs and NEC SX-Aurora TSUBASA provides JAMSTEC a world-leading marine research platform critical for expanding earth and climate science and accelerating discoveries.

The post New Earth Simulator to Take on Planet’s Biggest Challenges appeared first on The Official NVIDIA Blog.

Read More

Football tracking in the NFL with Amazon SageMaker

Football tracking in the NFL with Amazon SageMaker

With the 2020 football season kicking off, Amazon Web Services (AWS) is continuing its work with the National Football League (NFL) on several ongoing game-changing initiatives. Specifically, the NFL and AWS are teaming up to develop state-of-the-art cloud technology using machine learning (ML) aimed at aiding the officiating process through real-time football detection. As a first step in this process, the Amazon Machine Learning Solutions Lab developed a computer vision model for the challenge of football detection. In this post, we provide in-depth examples including code snippets and visualizations to demonstrate the key components of the football detection pipeline, starting with data labeling and following up with training and deployment using Amazon SageMaker and Apache MXNet Gluon.

Detecting the football in NFL Broadcast videos

The following video illustrates detecting the football frame by frame.

Football Tracking

Computer vision-based object detection techniques use deep learning algorithms to predict the location of objects in images and videos. Today, object detection has many far-reaching, high business value use cases, such as in self-driving car technology, where detecting pedestrians and vehicles is of paramount importance in ensuring safety on the roads. For the NFL, object detection technology like this is crucial as the game continues to evolve at a rapid pace. For example, they can use real-time object identification to generate new advanced analytics around player and team performance, in addition to aiding game officials in ball spotting. This technology is part of the larger suite of innovations in the AWS/NFL partnership.

The following sections of the post outline how we used NFL broadcast video data to train object detection models that analyze thousands of images to locate and classify the football from background objects.

Creating an object detection dataset with Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for ML. With the help of Ground Truth, we created a custom object detection dataset by breaking the NFL play segments into images. It offers a user interface (UI) that allowed us to quickly spin up a bounding box labeling job, in which human annotators can quickly draw bounding box labels around thousands of football image sequences stored in an Amazon Simple Storage Service (Amazon S3) bucket. The following screenshot illustrates the object detection UI.

The labeling job outputs a manifest file that contains the S3 file path of the image, the (x, y) bounding box coordinates, and the class label (for this use case, football) needed for the model training process.

The following screenshot shows the manifest file that is automatically updated in the S3 path.

The following screenshot shows the contents of the output.manifest file.

Supercharging training with Apache MXNet Gluon and Amazon SageMaker Script Mode

Our approach to model development relied on an ML technique called transfer learning. In it, we take neural networks previously trained on similar applications with strong results and fine-tune these models on our annotated data. We converted the annotations from the labeling job to RecordIO format for compact storage and faster disk access. Neural networks have the tendency to overfit training data, leading to poor out-of-sample results. The MXNet Gluon toolkit we added provides image normalization and image augmentations, such as randomized image flipping and cropping, to help reduce overfitting during training.

Amazon SageMaker provides a simple UI to train object detection models with no code, offering the Single Shot Detector (SSD) pre-trained model with several out-of-the-box configurations. For a more customized architecture, we use Amazon SageMaker Script Mode, which allows you to bring your own training algorithms and directly train models while staying within the user-friendly confines of Amazon SageMaker. We could train larger, more accurate models directly from Amazon SageMaker notebooks by combining Script Mode with pre-trained models like Yolov3 and Faster-RCNN with several backbone network combinations from the Gluon Model Zoo for object detection. See the following code:

import os
import sagemaker
from sagemaker.mxnet import MXNet
from mxnet import gluon
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

s3_output_path = "s3://<path to bucket where model weights will be saved>/"

model_estimator = MXNet(
    train_instance_count=1,  # value can be more than 1 for multi node training
    distributions={"parameter_server": {"enabled": True}},
    hyperparameters={"epochs": 15},

m.fit("s3://<bucket path for train and validation record-io files>/")

Object detection algorithm: Background

Object detectors typically combine two key components: detection of objects in images and regression for estimating bounding box coordinates of objects. During training, object detectors are optimized to reduce both detection error and localization error (bounding box prediction error) via a loss function.

Current state-of-the-art object detectors contain deep learning architectures that use pre-trained convolutional neural networks (CNNs) like VGG-16 or ResNet-50 as base networks to perform rich feature extraction from input images. SSD predicts the relative offsets to a fixed set of boxes at every location of a convolutional feature map. Empirically, SSD underperforms other object detector algorithms on small objects like football. In contrast, YOLOv3 uses DarkNet-53 for feature extraction, which concatenates multiple feature maps together to make predictions, leading to improved performance on smaller objects.

Faster-RCNN in comparison to both SSD and YOLOv3 uses an additional shared deep neural network to predict region proposals of the input image feature maps, which is aggregated in the feed downstream in the model for object classification and bounding box prediction. Faster-RCNN empirically outperformed other networks on small objects in our use case. One major consideration in addition to performance, when choosing object detectors, is model inference time. SSD and YOLOv3 tend to have fast inference times as measured in frames per second, which is a key consideration for real-time applications; larger networks like Faster-RCNN have slower inference time.

Hyperparameter optimization on Amazon SageMaker

A standard metric for evaluating object detectors is mean average precision (mAP). mAP is based on the model precision recall (PR) curve and provides a numerical metric that can be directly used across models. You can generate PR curves by setting model confidence score thresholds to different levels, resulting in precision and recall pairs. Plotting these pairs with a bit of interpolation results in a PR curve. Average precision (AP) is then defined as the area under this PR curve. Similarly, you may want to detect multiple objects in an image, such as K > 1 objects: mAP is the mean AP across all K classes.

Automatic model tuning in Amazon SageMaker, also known as hyperparameter optimization, allowed us to try over 100 models with unique parameter configurations to achieve the best model possible given our data. Hyperparameter optimization uses strategies like random search and Bayesian search to help tune hyper parameters in the ML algorithm. See the following sample code:

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter
from sagemaker.tuner import HyperparameterTuner

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    " network": CategoricalParameter(["resnet50_v1b", "resnet101_v1d"])

### Objective metric Regex based on print statements in script
objective_metric_name = "Validation: "
metric_definitions = [{"Name": "Validation: ", "Regex": "Validation: ([0-9\.]+)"}]

tuner = HyperparameterTuner(

tuner.fit("s3://<bucket path for train and validation record-io files>/")

To do this, we specified the location of our data and manifest file on Amazon S3 and chose our Amazon SageMaker instance type and object detection algorithm to use (SSD with ResNet50). Amazon SageMaker hyperparameter optimization then launched several configurations of the base model with unique hyperparameter configurations, using Bayesian search to determine which configuration achieves the best model based on a preset test metric. In our case, we optimized towards the highest mean average precision (mAP) on our held-out test data. The following graph shows a visualization of a sample set of hyperparameter optimization jobs from the hyperparameter optimization tuner object.

Deploying the model

Deploying the model required only a few additional lines of code (hosting methods) within our Amazon SageMaker notebook instance. We can simply call tuner.deploy on our hyperparameter optimization tuner to deploy the best model based on the evaluation metric that was set for the hyperparameter optimization training job. The code below demonstrates a proof-of-concept deployment on Amazon SageMaker:

predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

The model weights for each training job are stored in Amazon S3. We can deploy any of the jobs or Amazon SageMaker trained models by passing its model artifact path to an Amazon SageMaker estimator object. To do this, we referred to the preconfigured container optimized to perform inference and linked it to the model weights. After this model-container pair was created on our account, we could configure an endpoint with the instance type and number of instances needed by the NFL. See the following code:

from sagemaker.mxnet.model import MXNetModel

sagemaker_model = MXNetModel(
    model_data="s3://<path to training job model file>/model.tar.gz",

predictor = sagemaker_model.deploy(
    initial_instance_count=1, instance_type="ml.m5.xlarge"

Model inference for football detection

At runtime, a client sends a request to the endpoint hosting the model container on an Amazon Elastic Compute Cloud (Amazon EC2) instance and returns the output (inference). In production, scaling endpoints for large-scale inference on the NFL broadcast videos is significantly simplified with this pipeline.

Sensitivity and error analysis

When exploring strategies to improve model performance, you can scale up (use larger architectures) or scale out (acquire more data). After scaling up, which we discussed earlier during model exploration, data scientists commonly collect additional data in hopes of improving model generalizability. For our use case, we specifically aimed at reducing localization error. To do this, we created several test sets that quantitatively and qualitatively helped us understand mAP in relation to specific characteristics of the input video: occlusion (high vs. low) of the football, bounding box size (small vs. large) and aspect ratio (tall vs. wide) effects, camera angle (endzone vs. sideline), and contrast (high vs. low) between the football and its background area. From this, we understood which qualitative aspect of image the model was struggling to predict, and these findings led us to strategically gather additional data to target and improve upon these areas.


The NFL uses cloud computing to create innovative experiences that introduce additional ways for fans to enjoy football while making the game more efficient and fast-paced. By combining football detection with additional new technologies, the NFL can reduce game stoppage, support officiating, and bring real-time insight into what’s happening on the field, leading to a greater connection with the game that fans love. While fans take delight in “America’s game,” they can rest assured that the NFL in collaboration with AWS is utilizing the newest and best technologies to make the game more enjoyable with a broader range of data points.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, implementing HPO, and model deployment on Amazon SageMaker on the AWS Labs GitHub repo. To learn more about the ML Solutions Lab, see Amazon Machine Learning Solutions Lab.

About the Authors

Michael Lopez is the Director of Football Data and Analytics at the National Football League and a Lecturer of Statistics and Research Associate at Skidmore College. At the National Football League, his work centers on how to use data to enhance and better understand the game of football.




Colby Wise is a Senior Data Scientist and manager at the Amazon Machine Learning Solutions Lab where he works with customers across different verticals to accelerate their use of machine learning and AWS cloud services to solve their business challenges.





Divya Bhargavi is a Data Scientist at the Amazon Machine Learning Solutions Lab where she develops machine learning models to address customers’ business problems. Most recently, she worked on Computer Vision solutions involving both classical and deep learning methods for a sports customer.

Read More

Bringing the Mona Lisa Effect to Life with TensorFlow.js

Bringing the Mona Lisa Effect to Life with TensorFlow.js

A guest post by Emily Xie, Software Engineer


Urban legend says that Mona Lisa’s eyes will follow you as you move around the room. This is known as the “Mona Lisa effect.” For fun, I recently programmed an interactive digital portrait that brings this phenomenon to life through your browser and webcam.

At its core, the project leverages TensorFlow.js, deep learning, and some image processing techniques. The general idea is as follows: first, we must generate a sequence of images of Mona Lisa’s head, with eyes gazing from the left to right. From this pool, we’ll continuously select and display a single frame in real-time based on the viewer’s location.

In this post, I’ll walk through the specifics of the project’s technical design and implementation.

Animating the Mona Lisa with Deep Learning

Image animation is a technique that allows one to puppeteer a still image through a driving video. Using a deep-learning-based approach, I was able to generate a highly convincing animation of Mona Lisa’s gaze.

Specifically, I used the First Order Motion Model (FOMM), released by Aliaksandr Siarohin et al. in 2019. At a very high level, this method is composed of two modules: one for motion extraction, and another for image generation. The motion module detects keypoints and local affine transformations from the driving video. Diffs of these values between consecutive frames are then used as input to a network that predicts a dense motion field, along with an occlusion mask which specifies the image regions that either need to be modified or contextually inferred. The image generation network, then, detects facial landmarks and produces the final output––the source image, warped and in-painted according to the results of the motion module.

I chose FOMM in particular because of its ease of use. Prior models in this domain had been “object-specific”, meaning that they required detailed data of the object to be animated, whereas FOMM operated agnostically to this. More importantly, the authors released an open-source, out-of-the-box implementation with pre-trained weights for facial animation. Because of this, applying the model to the Mona Lisa became a surprisingly straightforward endeavor: I simply cloned the repo into a Colab notebook, produced a short driving video of me with my eyes moving around, and fed it through the model along with a screenshot of La Gioconda’s head. The resulting movie was stellar. From this, I ultimately sampled just 33 images to constitute the final animation.

Example of a driving video and the image animation predictions generated by FOMM.
A subsample of the final animation frames, produced using the First Order Motion Model.

Image Blending

While I could’ve re-trained the model for my project’s purposes, I decided to work within the constraints of Siarohin’s weights in order to avoid the time and computational resources that would’ve been otherwise required. This, however, meant that the resulting frames were fixed at a lower resolution than desired, and consisted of just the subject’s head. But since I wanted the final visual to include the entirety of Mona Lisa––hands, torso, and background included––my plan was to simply superimpose the output head frames onto an image of the painting.

Mona Lisa
An example of a head frame overlaid on top of the underlying image. To best illustrate the problem, the version shown here is from an earlier iteration of the project where there was further resolution loss in the head frame.

This, however, produced its own set of challenges. If you look at the example above, you’ll notice that the lower-resolution output of the model––coupled with some subtle collateral background changes due to FOMM’s warping procedure––causes the head frame to visually jut out. In other words, it was a bit obvious that this was just a picture on top of another picture. To address this, I did some image processing in Python to “blend” the head image into the underlying one.

First, I resized the head frame to its original resolution. From there, I created a new frame using a weighted average of these blurred out pixels and the corresponding pixels in the underlying image, where the weight––or alpha––of a pixel in the head frame decreases as it moves away from the midpoint.

The function to determine alpha was adapted from a 2D sigmoid, and is expressed as:

Where j determines the logistic function’s slope, k is the inflection point, and m is the midpoint of the input values. Graphed out, the function looks like:

Function graph

After I applied the above procedure to all 33 frames in the animation set, the resulting superimpositions each appeared to be a single image to the unsuspecting eye:

Tracking the Viewer’s Head via BlazeFace

All that was left at this point was to determine how to track the user via the webcam and display the corresponding frame.

Naturally, I turned to TensorFlow.js for the job. The library offered a fairly robust set of models to detect the presence of a human given visual input, but after some research and thinking, I landed on BlazeFace as my method of choice.

BlazeFace is a deep-learning based object recognition model that detects human faces and facial landmarks. It is specifically trained for using mobile camera input. This worked well for my use case, as I expected most viewers to be using their webcam in a similar manner––with their heads in frame, front-facing, and fairly close to the camera––whether through their mobile devices or on their laptops.

My foremost consideration in selecting this model, however, was its extraordinary speed of detection. To make this project convincing, I needed to be able to run the entire animation in real time, including the facial recognition step. BlazeFace adapts the single-shot detection (SSD) model, a deep-learning-based object detection algorithm that simultaneously proposes bounding boxes and detects objects in just one forward pass of the network. BlazeFace’s lightweight detector is capable of recognizing facial landmarks at speeds as fast as 200 frames per second.

A demo of what BlazeFace can capture given an input image: bounding boxes for a human head, along with facial landmarks.

Having settled on the model, I then wrote code to continually pipe the user’s webcam data into BlazeFace. On each run, the model outputted an array of facial landmarks and their corresponding 2D coordinate positions. Using this, I approximated the X coordinate of the face’s center by calculating the midpoint between the eyes.

Finally, I mapped this result to an integer between 0 and 32. These values, as you may recall, each represented a frame in the animated sequence––with 0 depicting Mona Lisa with her eyes to the left, and 32 with her eyes to the right. From there, it was just a matter of displaying the frame on the screen.

Try it out!

You can play with the project at monalisaeffect.com. To follow more of my work, feel free to check out my personal website, Github, or Twitter.


Thanks to Andrew Fu for reading this post and providing me feedback, to Nick Platt for lending his ear and thoughts on a frontend bug, and to Jason Mayes along with the rest of the team at Google for their work in reaching out and amplifying this project.

Read More

Improved OCR and structured data extraction with Amazon Textract

Improved OCR and structured data extraction with Amazon Textract

Optical character recognition (OCR) technology, which enables extracting text from an image, has been around since the mid-20th century, and continues to be a research topic today. OCR and document understanding are still vibrant areas of research because they’re both valuable and hard problems to solve.

AWS has been investing in improving OCR and document understanding technology, and our research scientists continue to publish research papers in these areas. For example, the research paper Can you read me now? Content aware rectification using angle supervision describes how to tackle the problem of document rectification which is fundamental to the OCR process on documents. Additionally, the paper SCATTER: Selective Context Attentional Scene Text Recognizer introduces a novel way to perform scene text recognition, which is the task of recognizing text against complex image backgrounds. For more recent publications in this area, see Computer Vision.

Amazon scientists also incorporate these research findings into best-of-breed technologies such as Amazon Textract, a fully managed service that uses machine learning (ML) to identify text and data from tables and forms in documents—such as tax information from a W2, or values from a table in a scanned inventory report—and recognizes a range of document formats, including those specific to financial services, insurance, and healthcare, without requiring customization or human intervention.

One of the advantages of a fully managed service is the automatic and periodic improvement to the underlying ML models to improve accuracy. You may need to extract information from documents that have been scanned or pictured in different lighting conditions, a variety of angles, and numerous document types. As the models are trained using data inputs that encompass these different conditions, they become better at detecting and extracting data.

In this post, we discuss a few recent updates to Amazon Textract that improve the overall accuracy of document detection and extraction.

Currency symbols

Amazon Textract now detects a set of currency symbols (Chinese yuan, Japanese yen, Indian rupee, British pound, and US dollar) and the degree symbol with more precision without much regression on existing symbol detection.

For example, the following is a sample table in a document from a company’s annual report.

The following screenshot shows the output on the Amazon Textract console before the latest update.

Amazon Textract detects all the text accurately. However, the Indian rupee symbol is recognized as an “R” instead of “₹”. The following screenshot shows the output using the updated model.

The rupee symbol is detected and extracted accurately. Similarly, the degree symbol and the other currency symbols (yuan, yen, pound, and dollar) are now supported in Amazon Textract.

Detecting rows and columns in large tables

Amazon Textract released a new table model update that more accurately detects rows and columns of large tables that span an entire page. Overall table detection and extraction of data and text within tables has also been improved.

The following is an example of a table in a personal investment account statement.

The following screenshot shows the Amazon Textract output prior to the new model update.

Even though all the rows, columns, and text is detected properly, the output also contains empty columns. The original table didn’t have a clear separation for columns, so the model included extra columns.

The following screenshot shows the output after the model update.

The output now is much cleaner. Amazon Textract still extracts all the data accurately from this table and now includes the correct number of columns. Similar performance improvement can be seen in tables that span an entire page and columns are not omitted.

Improved accuracy in forms

Amazon Textract now has higher accuracy on a variety of forms, especially income verification documents such as pay stubs, bank statements, and tax documents. The following screenshot shows an example of such a form.

The preceding form is not of high-quality resolution. Regardless, you may have to process such documents in your organization. The following screenshot is the Amazon Textract output using one of the previous models.

Although the older model detected many of the check boxes, it didn’t capture all of them. The following screenshot shows the output using the new model.

With this new model, Amazon Textract accurately detected all the check boxes in the document.


The improvements to the currency symbols and the degree symbol detection will be launched in the Asia Pacific (Singapore) region on September 24th, 2020, followed by other regions where Amazon Textract is available in the next few days. With the latest improvements to Amazon Textract, you can retrieve information from documents with more accuracy. Tables spanning the entire page are detected more accurately, currency symbols  (yuan, yen, rupee, pound, and dollar) and the degree symbol are now supported, and key-value pairs and check boxes in financial forms are detected with more precision. To start extracting data from your documents and images, try Amazon Textract for yourself.

About the Author

Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.

Read More

Preventing customer churn by optimizing incentive programs using stochastic programming

Preventing customer churn by optimizing incentive programs using stochastic programming

In recent years, businesses are increasingly looking for ways to integrate the power of machine learning (ML) into business decision-making. This post demonstrates the use case of creating an optimal incentive program to offer customers identified as being at risk of leaving for a competitor, or churning. It extends a popular ML use case, predicting customer churn, and shows how to optimize an incentive program to address the real business goal of preventing customer churn. We use a large phone company for our use case.

Although it’s usual to treat this as a binary classification problem, the real world is less binary: people become likely to churn for some time before they actually churn. Loss of brand loyalty occurs some time before someone actually buys from a competitor. There’s frequently a slow rise in dissatisfaction over time before someone is finally driven to act. Providing the right incentive at the right time can reset a customer’s satisfaction.

This post builds on the post Gain customer insights using Amazon Aurora machine learning. There we met a telco CEO and heard his concern about customer churn. In that post, we moved from predicting customer churn to intervening in time to prevent it. We built a solution that integrates Amazon Aurora machine learning with the Amazon SageMaker built-in XGBoost algorithm to predict which customers will churn. We then integrated Amazon Comprehend to identify the customer’s sentiment when they called customer service. Lastly, we created a naïve incentive to offer customers identified as being at risk at the time they called.

In this post, we focus on replacing this naïve incentive with an optimized incentive program. Rather than using an abstract cost function, we optimize using the actual economic value of each customer and a limited incentive budget. We use a mathematical optimization approach to calculate the optimal incentive to offer each customer, based on our estimate of the probability that they’ll churn, and the probability that they’ll accept our incentive to stay.

Solution overview

Our incentive program is intended to be used in a system such as that described in the post Gain customer insights using Amazon Aurora machine learning. For simplicity, we’ve built this post so that it can run separately.

We use a Jupyter notebook running on an Amazon SageMaker notebook instance. Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at any scale. In the Jupyter notebook, we first build and host an XGBoost model, identical to the one in the prior post. Then we run the optimization based on a given marketing budget and review the expected results.

Setting up the solution infrastructure

To set up the environment necessary to run this example in your own AWS account, follow Steps 0 and 1 in the post Simulate quantum systems on Amazon SageMaker to set up an Amazon SageMaker instance.

Then, as in Step 2, open a terminal. Enter the following command to copy the notebook to your Amazon SageMaker notebook instance:

wget https://aws-ml-blog.s3.amazonaws.com/artifacts/prevent_churn_by_optimizing_incentives/preventing_customer_churn_by_optimizing_incentives.ipynb

Alternatively, you can review a pre-run version of the notebook.

Building the XGBoost model

The first sections of the notebook—Setup, Data Exploration, Train, and Host, are the same as the sample notebook Amazon SageMaker Examples – Customer Churn. The exception is that we capture a copy of the data for later use and add a column to calculate each customer’s total spend.

At the end of these sections, we have a running XGBoost model on an Amazon SageMaker endpoint. We can use it to predict which customers will churn.

Assessing and optimizing

In this section of the post and accompanying notebook, we focus on assessing the XGBoost model and creating our optimal incentive program.

We can assess model performance by looking at the prediction scores, as shown in the original customer churn prediction post, Amazon SageMaker Examples – Customer Churn.

So how do we calculate the minimum incentive that will give the desired result? Rather than providing a single program to all customers, can we save money and gain a better outcome by using variable incentives, customized to a customer’s churn probability and value? And if so, how?

We can do so by building on components we’ve already developed.

Assigning costs to our predictions

The costs of churn for the mobile operator depend on the specific actions that the business takes. One common approach is to assign costs, treating each customer’s prediction as binary: they churn or don’t, we predicted correctly or we didn’t. To demonstrate this approach, we must make some assumptions. We assign the true negatives the cost of $0. Our model essentially correctly identified a happy customer in this case, and we won’t offer them an incentive. An alternative is to assign the actual value of the customer’s spend to the true negatives, because this is the customer’s contribution to our overall revenue.

False negatives are the most problematic because they incorrectly predict that a churning customer will stay. We lose the customer and have to pay all the costs of acquiring a replacement customer, including foregone revenue, advertising costs, administrative costs, point of sale costs, and likely a phone hardware subsidy. Such costs typically run in the hundreds of dollars, so for this use case, we assume $500 to be the cost for each false negative. For a better estimate, our marketing department should be able to give us a value to use for the overhead, and we have the actual customer spend for each customer in our dataset.

Finally, we give an incentive to customers that our model identifies as churning. For this post, we assume a one-time retention incentive of $50. This is the cost we apply to both true positive and false positive outcomes. In the case of false positives (the customer is happy, but the model mistakenly predicted churn), we waste the concession. We probably could have spent those dollars more effectively, but it’s possible we increased the loyalty of an already loyal customer, so that’s not so bad. We revise this approach later in this post.

Mapping the customer churn threshold

In previous versions of this notebook, we’ve shown the effect of false negatives that are substantially more costly than false positives. Instead of optimizing for error based on the number of customers, we’ve used a cost function that looks like the following equation:

cost_of_replacing_customer * FN(C) + customer_value * TN(C) + incentive_offered * FP(C) + incentive_offered * TP(C)

FN(C) means that the false negative percentage is a function of the cutoff, C, and similar for TN, FP, and TP. We want to find the cutoff, C, where the result of the expression is smallest.

We start by using the same values for all customers, to give us a starting point for discussion with the business. With our estimates, this equation becomes the following:

$500 * FN(C) + $0 * TN(C) + $50 * FP(C) + $50 * TP(C)

A straightforward way to understand the impact of these numbers is to simply run a simulation over a large number of possible cutoffs. We test 100 possible values, and produce the following graph.

The following output summarizes our results:

Cost is minimized near a cutoff of: 0.21000000000000002 for a cost of: $ 25800 for these 1500 customers.
Incentive is paid to 246 customers, for a total outlay of $ 12300
Total customer spend of these customers is $ 16324.36

The preceding chart shows how picking a threshold too low results in costs skyrocketing as all customers are given a retention incentive. Meanwhile, setting the threshold too high (such as 0.7 or above) results in too many lost customers, which ultimately grows to be nearly as costly. In between, there is a large grey area, where perhaps some more nuanced incentives would create better outcomes.

The overall cost can be minimized at $25,750 by setting the cutoff to 0.13, which is substantially better than the $100,000 or more we would expect to lose by not taking any action.

We can also calculate the dollar outlay of the program and compare to the total spend of the customers. Here we can see that paying the incentive to all predicted churn customers costs $13,750, and that these customers spend $183,700. (Your numbers may vary, depending on the specific customers chosen for the sample.)

What happens if we instead have a smaller budget for our campaign? We choose a budget of 1% of total customer monthly spend. The following output shows our results:

Total budget is: $895.90
Per customer incentive is $0.60

We can see that our cost changes. But it’s pretty clear that an incentive of approximately $0.60 is unlikely to change many people’s minds.

Can we do better? We could offer a range of incentives to customers that meet different criteria. For example, it’s worth more to the business to prevent a high spend customer from churning than a low spend customer. We could also target the grey area of customers that have less loyalty and could be swayed by another company’s advertising. We explore this in the following section.

Preventing customer churn using mathematical optimization of incentive programs

In this section, we use a more sophisticated approach to developing our customer retention program. We want to tailor our incentives to target the customers most likely to reconsider a churn decision.

Intuitively, we know that we don’t need to offer an incentive to customers with a low churn probability. Also, above some threshold, we’ve already lost the customer’s heart and mind, even if they haven’t actually left yet. So the best target for our incentive is between those two thresholds—these are the customers we can convince to stay.

The problem under investigation is inherently stochastic in that each customer might churn or not, and might accept the incentive (offer) or not. Stochastic programming [1, 2] is an approach for modeling optimization problems that involve uncertainty. Whereas deterministic optimization problems are formulated with known parameters, real-world problems almost invariably include parameters that are unknown at the time a decision should be made. An example would be the construction of an investment portfolio to maximize return. An efficient portfolio would be defined as the portfolio that maximizes the expected return for a given amount of risk (such as standard deviation), or the portfolio that minimizes the risk subject to a given expected return [3].

Our use case has the following elements:

  • We know the number of customers, 𝑁.
  • We can use the customer’s current spend as the (upper bound) estimate of the profit they generate, P.
  • We can use the churn score from our ML model as an estimate of the probability of churn, alpha.
  • We use 1% of our total revenue as our campaign budget, C.
  • The probability that the customer is swayed, beta, depends on how convincing the incentive is to the customer, which we represent as 𝛾.
  • The incentive, c, is what we want to calculate.

We set up our inputs: P (profit), alpha (our churn probabilities, from our preceding model), and C, our campaign budget. We then define the function we wish to optimize, f(ci) being the expected total profit across the 𝑁 customers.

Our goal is to optimally allocate the discount 𝑐𝑖 across the 𝑁 customers to maximize the expected total profit. Mathematically this is equivalent to the following optimization problem:

Now we can specify how likely we think each customer is to accept the offer and not churn—that is, how convincing they’ll find the incentive. We represent this as 𝛾 in the formulae.

Although this is a matter of business judgment, we can use the preceding graph to inform that judgment. In this case, the business believes that if the churn probability is below 0.55, they are unlikely to churn, even without an incentive; on the other hand, if the customer’s churn probability is above 0.95, the customer has little loyalty and is unlikely to be convinced. The real targets for the incentives are the customers with churn probability between 0.55–0.95.

We could include that business insight into the optimization by setting the value for the convincing factor 𝛾𝑖 as follows:

  • 𝛾𝑖 = 100. This is equivalent to giving less importance as a deciding factor to the discount for customers whose churn probability is below 0.55 (they are loyal and less likely to churn), or greater than 0.95 (they will most likely leave despite the retention campaign).
  • 𝛾𝑖 = 1. This is equivalent to saying that the probability customer i will accept the discount is equal to 𝛽=1−𝑒C𝑖 for customers with churn probability between 0.55 and 0.95.

When we start to offer these incentives, we can log whether or not each customer accepts the offer and remains a customer. With that information, we can learn this function from experience, and use that learned function to develop the next set of incentives.

Solving the optimization problem

A variety of open-source solvers are available that can solve this optimization problem for us. Examples include SciPy scipy.optimize.minimize, or faster open-source solvers like GEKKO, which is what we use for this post. For large-scale problems, we would recommend using commercial optimization solvers like CPLEX or GUROBI.

After the optimization task has run, we check how much of our budget has been allocated.

The total spend is $1,000.00 compared to our budget of $1,000.00, and the total customer spend is $89,589.73 for 1,500 customers.

Now we evaluate the expected total profit for the following scenarios:

  1. Optimal discount allocation, as calculated by our optimization algorithm
  2. Uniform discount allocation: every customer is offered the same incentive
  3. No discount

The following graph shows our outcomes.

Expected total profit compared to no campaign: 17%

Expected total profit compared to uniform discount allocation: 5%

Lastly, we add the discount to our customer data. We can see how the discount we offer varies across our customer base. The red vertical line shows the value of the uniform discount. The pattern of discounts we offer closely mirrors the pattern in the prediction scores, where many customers aren’t identified as likely churners, and a few are identified as highly likely to churn.

We can also see a sample of the discounts we’d be offering to individual customers. See the following table.

For each customer, we can see their total monthly spend and the optimal incentive to offer them. We can see that the discount varies by churn probability, and we’re assured that the incentive campaign fits within our budget.

Depending on the size of the total budget we allocate, we may occasionally find that we’re offering all customers a discount. This discount allocation problem reminds us of the water-filling algorithm in wireless communications [4,5], where the problem is of maximizing the mutual information between the input and the output of a channel composed of several subchannels (such as a frequency-selective channel, a time-varying channel, or a set of parallel subchannels arising from the use of multiple antennas at both sides of the link) with a global power constraint at the transmitter. More power is allocated to the channels with higher gains to maximize the sum of data rates or the capacity of all the channels. The solution to this class of problems can be interpreted as pouring a limited volume of water into a tank, the bottom of which has the stair levels determined by the inverse of the subchannel gains.

Unfortunately, our problem doesn’t have an intuitive explanation as for the water-filling problem. This is due to the fact that, because of the nature of the objective function, the system of equations and inequalities corresponding to the KKT conditions [6] doesn’t admit a closed form solution.

The optimal incentives calculated here are the result of an optimization routine designed to maximize an economic figure, which is the expected total profit. Although this approach provides a principled way for marketing teams to make systematic, quantitative, and analytics-driven decisions, it’s also important to recall that the objective function to be optimized is a proxy measure to the actual total profit. It goes without saying that we can’t compute the actual profit based on future decisions (this would paradoxically imply maximizing the actual return based on future values of the stocks). But we can explore new ideas using techniques such as the potential outcomes work [7], which we could use to design strategies for back-testing our solution.


We’ve now taken another step towards preventing customer churn. We built on a prior post in which we integrated our customer data with our ML model to predict churn. We can now experiment with variations on this optimization equation, and see the effect of different campaign budgets or even different theories of how they should be modeled.

To gather more data on effective incentives and customer behavior, we could also test several campaigns against different subsets of our customers. We can collect their responses—do they churn after being offered this incentive, or not—and use that data in a future ML model to further refine the incentives offered. We can use this data to learn what kinds of incentives convince customers with different characteristics to stay, and use that new function within this optimization.

We’ve empowered marketing teams with the tools to make data-driven decisions that they can quickly turn into action. This approach can drive fast iterations on incentive programs, moving at the speed with which our customers make decisions. Over to you, marketing!



[1] S. Uryasev, P. M. Pardalos, Stochastic Optimization: Algorithm and Applications, Kluwer Academic: Norwell, MA, USA, 2001.

[2] John R. Birge and François V. Louveaux. Introduction to Stochastic Programming. Springer Verlag, New York, 1997.

[3] Francis, J. C. and Kim, D. (2013). Modern portfolio theory: Foundations, analysis, and new developments (Vol. 795). John Wiley & Sons.

[4] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991.

[5] D. P. Palomar and J. R. Fonollosa, “Practical algorithms for a family of water-filling solutions,” IEEE Trans. Signal Process., vol. 53, no. 2, pp. 686–695, Feb. 2005.

[6] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[7] Imbens, G. W. and D. B. Rubin (2015): Causal Inference for Statistics, Social, and Biomedical Sciences, Cambridge University Press.

About the Authors

Marco Guerriero, PhD, is a Practice Manager for Emergent Technologies and Intelligence Platform for AWS Professional Services. He loves working on ways for emergent technologies such as AI/ML, Big Data, IoT, Quantum, and more to help businesses across different industry verticals succeed within their innovation journey.




Veronika Megler, PhD, is Principal Data Scientist for Amazon.com Consumer Packaging. Until recently she, was the Principal Data Scientist for AWS Professional Services. She enjoys adapting innovative big data, AI, and ML technologies to help companies solve new problems, and to solve old problems more efficiently and effectively. Her work has lately been focused more heavily on economic impacts of ML models and exploring causality.



Oliver Boom is a London based consultant for the Emerging Technologies and Intelligent Platforms team at AWS. He enjoys solving large-scale analytics problems using big data, data science and dev ops, and loves working at the intersection of business and technology. In his spare time he enjoys language learning, music production and surfing.




Dr Sokratis Kartakis is a UK-based Data Science Consultant for AWS. He works with enterprise customers to help them adopt and productionize innovative Machine Learning (ML) solutions at scale solving challenging business problems. His focus areas are ML algorithms, ML Industrialization, and AI/MLOps. He enjoys spending time with his family outdoors and traveling to new destinations to discover new cultures.



Read More

Selecting the right metadata to build high-performing recommendation models with Amazon Personalize

Selecting the right metadata to build high-performing recommendation models with Amazon Personalize

In this post, we show you how to select the right metadata for your use case when building a recommendation engine using Amazon Personalize. The aim is to help you optimize your models to generate more user-relevant recommendations. We look at which metadata is most relevant to include for different use cases, and where you may get better results by excluding other metadata. We also highlight a specific use case from Pulselive, one of our customers that recently used Amazon Personalize to enhance the recommendation capabilities of their customer’s websites, resulting in a 20% increase in video consumption.

Introducing Amazon Personalize

Amazon Personalize is a managed service that enables you to improve customer engagement by powering personalized product and content recommendations, and targeted marketing promotions. Amazon Personalize uses machine learning (ML) to create high-quality recommendations that you can use to personalize your user experience across digital channels such as websites, applications, and email systems. You can get started without any prior ML experience using simple APIs to easily build sophisticated personalization capabilities in just a few clicks. Amazon Personalize automatically processes and examines your metadata, identifies what is meaningful, allows you to pick an ML algorithm, and trains and optimizes a custom model based on your metadata. All your data is encrypted to be private and secure, and is only used to create recommendations for your users.

Let’s dive into how important that metadata is to get a performant model.

The role metadata selection plays in recommendations

The goal of metadata selection in recommendation engines is to select the right data to aid the training algorithm to discover valuable information about the similarities in user preferences and behavior, in addition to the properties and similarity of the items you’re trying to recommend through the engine. The ultimate goal is to provide a personalized experience, uniquely tailored for each user, and present them with the items that are the most relevant to them.

Nowadays, there are so many sources of data that a company could potentially use to capture user behavior and understand which items to present to them that it has become challenging to accurately select which metadata to consider and which to ignore. Irrespective of the use case, a commercial website can use large amounts of data about every aspect of each user’s behavior on the website, such as which items they’re frequently interacting with (watching a video or ordering an item), how long they spend on each item’s page, or even how erratic or smooth the movement of their cursor is while scrolling through a page. All this information can reveal a lot about a user’s preferences and what would be the ideal items to recommend to them.

There are two main categories of approaches to recommendation engines: collaborative filtering and content-based filtering.

Collaborative filtering compares the behavior of the users with each other and tries to calculate the similarity between them to find shared interests. Therefore, the recommendation engine knows that if user A has very similar behavior to user B, then user A would likely be interested in some of the items that user B has interacted with, and vice versa.

Content-based filtering looks at the actual items the users interact with. If a user has interacted with items A and B, and product C is very similar to A and B, then item C will likely be of interest to the user.

We also have hybrid models that use both user behavior and item-related data to find the underlying patterns that reveal the ideal items to recommend to each user.

Each method requires a different approach to metadata selection because they require different types of data to be collected and used for the training. For example, building collaborative filtering engines requires data related to the behavior of users on the website, whereas building a content-based engine requires more data related to the items (item-specific metadata and which users interacted with which items). A hybrid solution requires data related to both the users and the items.

As a general rule, authenticated experiences are most optimal. When your users have personal accounts that they log in to, you can provide them with a more personalized experience tailored to their needs because you can easily track and record every aspect of their behavior (along with additional metadata), whereas it’s harder to track anonymous or guest users and map them to their previous sessions.

The problems that can occur if metadata selection isn’t done right

If metadata selection isn’t done correctly, it can potentially lead to poor recommendations that are either too generic (showing most users the most popular and commonly interacted products) or not relevant (showing items that are completely irrelevant to the unique user).

When too much information is included in training a recommendation model, it can lead to noise in the model. Metadata that has no correlation with user preferences but was included in training skews the model and makes it harder for the algorithm to find the valuable underlying patterns that allow for a successful recommender system.

This can also apply to the depth (amount of history) of the data that is used to train a model. Perhaps relevant metadata has been selected, but the freshness of the data in many cases is a stronger indicator of relevance—the most recent metadata is more relevant than historical data for the same kind of interactions. This is because user behavior and preferences vary over time and people’s interests can change rather quickly; therefore, presenting a user with a recommendation that was considered relevant to them a few months ago doesn’t guarantee that the recommendation is relevant to them today. This is why it’s important to keep your recommender system up to date with current user behavior.

Conversely, if too little information is included, the recommendation model under-performs. If you don’t include valuable information that can aid the performance of the model, the recommendation model makes suboptimal suggestions.

A wrong approach to metadata selection can make it harder for the algorithms to find the underlying patterns that connect users and items. This means that the recommendations that the users are presented with aren’t personalized as expected.

Terminology of recommendation engines

To introduce the topic further, let’s dive into some of the terminology associated with Amazon Personalize:

  • Datasets and dataset groupsDatasets contain the data used to train a recommendation model. You can use different dataset groups to serve different purposes. For example, separate applications, with their own users and items, can have their own dataset groups.
  • Recipes and solutions – Amazon Personalize uses recipes, which are the combination of the learning algorithm with the hyperparameters and datasets used. Training a model with different recipes leads to different results. The resultant models that are deployed are referred to as a solution version.
  • Campaigns – A deployed solution version is known as a campaign. A campaign allows Amazon Personalize to make recommendations for your users.

Metadata types are dictated in the datasets used to train a model. In the following section, we look at how to do that.

Selecting metadata

Amazon Personalize uses different recipes that are aimed towards either of the two main categories of recommendation engines—collaborative filtering and content-based filtering—and also the hybrid methods. For more information about pre-defined recipes, see Choosing a Recipe.

No matter which recipe you chose to work with, Amazon Personalize has three main types of datasets that it can use to build models (solutions), and each is related to one of the following categories:

  • Users
  • Items
  • Interactions

The users and items dataset types are known as metadata types, and are only used by certain recipes. As their names imply, their metadata has unique fields that describe each individual user or item. User metadata could be age, gender, and geography. Typical item metadata is color, category, shape, price (in the case of items) or content category, ratings, and genre, if the type of item we’re trying to recommend is a video or movie.

The interactions metadata is the direct interactions of a user with an item, which is usually the most revealing information for the relationship between users and items. Some examples of interactions data can be clicks (user A clicked on item X), purchases (user actually purchased an item), amount of time spent on an item’s webpage, the addition of an item to a user’s wishlist, or even the fact that the user hovered their cursor for a few milliseconds more than usual over a certain item.

The minimum number of interactions Amazon Personalize expects in order to start making recommendations is 1,000 interactions from a minimum of 25 users. User and item metadata datasets are optional, and their importance depends on your use case and the algorithm (recipe) you’re using.

The following screenshot shows the Datasets page on the Amazon Personalize console.

What data types are supported by each category?

Each dataset has a set of required fields, reserved keywords, and required datatypes, as shown in the following table.

Dataset Type Required Fields Reserved Keywords
Users USER_ID (string)
one metadata field
Items ITEM_ID (string)
one metadata field



Interactions USER_ID (string)
ITEM_ID (string)

EVENT_TYPE (string)


EVENT_VALUE (float, null)


Before you add a dataset to Amazon Personalize, you must define a schema for that dataset. Each dataset type has specific requirements. Schemas in Amazon Personalize are defined in the Avro format.

The following example code shows an interactions schema. The EVENT_TYPE and EVENT_VALUE fields are optional, and are reserved keywords recognized by Amazon Personalize. LOCATION and DEVICE are optional contextual metadata fields.

  "type": "record",
  "name": "Interactions",
  "namespace": "com.amazonaws.personalize.schema",
  "fields": [
          "name": "USER_ID",
          "type": "string"
          "name": "ITEM_ID",
          "type": "string"
          "name": "EVENT_TYPE",
          "type": "string"
          "name": "EVENT_VALUE",
          "type": "float"
          "name": "LOCATION",
          "type": "string",
          "categorical": true
          "name": "DEVICE",
          "type": "string",
          "categorical": true
          "name": "TIMESTAMP",
          "type": "long"
  "version": "1.0"

Creating a schema using the AWS Python SDK

To create a schema using the AWS Python SDK, complete the following steps:

  1. Define the Avro format schema that you want to use.
  2. Save the schema in a JSON file in the default Python folder.
  3. Create the schema using the following code:
import boto3

personalize = boto3.client('personalize')

with open('schema.json') as f:
    createSchemaResponse = personalize.create_schema(
        name = 'YourSchema',
        schema = f.read()

schema_arn = createSchemaResponse['schemaArn']

print('Schema ARN:' + schema_arn )

Amazon Personalize returns the ARN of the new schema.

  1. Store the ARN for later use.

Filtering your metadata

Amazon Personalize allows you to experiment with building different models (or solutions) based on different metadata by enabling you to filter records from your interactions dataset and set a threshold for each event type, or simply select and leave out certain event types. You can filter records from an interactions dataset in two ways:

  • Set a threshold to exclude records based on a specific value by specifying an event value in your recipe. If the records include a value that is associated with a specific event—for example, the price a user paid is associated with the purchase of an item—you can set a specific value in a recipe as a threshold to exclude records from training. The amount is called an event value.
  • Exclude records of a certain type by specifying an event type in your recipe. A dataset often includes specific types of activities, for example, purchase, click, or wishlisted. These are called event types. To include only records for specific event types in training, filter your dataset by event type in your recipe.

To filter your metadata, call the CreateSolution API. If you want to specify the event type, for example purchase, set it in the eventType parameter. If you want to specify an event value, for example 10, set it in the eventValueThreshold parameter. You can also specify an event type and an event value. You can specify an eventType, an eventType and eventValueThreshold, or neither. You can’t specify just eventValueThreshold alone. See the following code:

import boto3
personalize = boto3.client('personalize')

# Create the solution
create_solution_response = personalize.create_solution(
    name = "your-solution-name",
    datasetGroupArn = dataset_group_arn,
    recipeArn = recipe_arn,
    "eventType": "purchase",
    solutionConfig = {
        "eventValueThreshold": "10"

# Store the solution ARN
solution_arn = create_solution_response['solutionArn']

# Use the solution ARN to get the solution status
solution_description = personalize.describe_solution(solutionArn = solution_arn)['solution']
print('Solution status: ' + solution_description['status'])

When selecting metadata for a recommendation engine, it’s helpful to ask the following questions to help guide your decisions:

  • What is likely to be the strongest indicator of a good recommendation—similar users, similar items, or their combined interactions? This can help determine which metadata to select and tag in the datasets. As described, the interactions dataset is the minimum that Amazon Personalize expects, so you have to choose wisely which types of interactions (or events) you want to capture. A combination of interactions and metadata is typically recommended, but choosing which types of interactions to record is important.
  • What is the temporal value of the data? Is old data less potent? How much less? How can you use real-time APIs with real-time data to get the most relevant recommendations that reflect the users’ change of preferences over time?
  • Which metrics best show whether the recommendation engine is working well? Can you align Amazon Personalize metrics with your own KPIs? Can you construct an A/B test with live customers?

The answers to these questions can be a good guide to improve the recommendation system.

Applying metadata selection: Pulselive use case

In a recent engagement with Pulselive, an AWS customer that builds and hosts solutions for large sports organizations, we were asked to aid them in prototyping a personalized recommendation engine for one of their customers, a renowned European football club, to suggest videos to the visitors of their website according to their preferences and past behavior. Their goal was to use all the data they could to provide the website’s visitors with a tailored, highly personalized experience by recommending videos relevant to each user to increase engagement with the content.

Our initial approach was to use some of their existing recorded historical data to extract the minimum required information needed to start building Amazon Personalize solutions that can recommend the right videos to the right users. Therefore, the metadata we initially selected was the simplest form of user-video interactions—clicks—from a historical dataset of which users had clicked on which videos and at what time. We started with 30,000 user interactions.

That allowed us to build a baseline solution that used that information to evaluate the relevance of each video to each user and considered it as our starting point. The next goal was to enrich the dataset by selecting the right metadata and observing the impact that the new models had on user engagement.

At this point, we have to mention that it’s somewhat challenging to predict how well the recommendation system will do when deployed into production. Amazon Personalize provides some standard out-of-the-box metrics when a model has finished training to give you an idea of how well it did at recommending the most relevant items higher on the recommendations list (such as having a high precision or coverage). But you can only evaluate the true impact on your customers when deploying the system into production.

Pulselive chose to do A/B testing, comparing the results of their existing recommendation methods to those produced from an Amazon Personalize campaign. They started with redirecting 5% of their traffic through the Amazon Personalize campaign. After seeing good results, they eventually rolled out to 50% of the traffic being redirected to Amazon Personalize. For more information, see Increasing engagement with personalized online sports content.

Regarding metadata selection, we quickly realized that the users and items in the initial historical dataset weren’t very recent, and most of their IDs didn’t correspond to users and items that had recent activity on their production website.

Luckily, apart from an initial historical dataset, Amazon Personalize can also enrich its models in real time by allowing you to feed in interaction data from your live website. Through the use of the Amazon Personalize PutEvents API, you can record any action users take on the website and feed it into Amazon Personalize in near-real time, updating the model with the most recent user behavior and preferences. This is an important capability because it’s natural for user preferences to change over time, and you don’t want to risk presenting them with items that are either out of date or not relevant to them anymore.

This also means that you can directly connect Amazon Personalize to your website, with no historical data or any models trained, and start feeding in events. After a while, Amazon Personalize has gathered enough data to start making accurate recommendations. For more information, see Recording Events.

We spent some time discussing what other relevant user behavior metadata we could capture, and decided to start recording some to observe whether these would result in a more accurate recommendation system that would impact user engagement on the site. Two simple measures for this were seeing if recommended videos were more frequently visited and watched for longer periods.

We started recording the source of the clicks (recommended list vs. other links in the website), the amount of time a user spent on a clicked video in seconds, and the percentage of the video that time represented (because it’s different for someone to spend 1 minute on a 20-minute video, compared to spending the same time on a 1-minute video to watch it in its entirety). These additions proved to be very important because after a while, user engagement started improving. We discussed and investigated providing more detailed information about user behavior on the website, but decided to pay more attention to the metadata.

Items metadata was important because it allowed Amazon Personalize to have more context on the nature of each video. This ranged from general and broad video categories, such as interviews and games, to more specific categories, such as “Leagues” and “Friendly games,” to more specific metadata, such as which players are featured in a video. Adding metadata about the content for each video significantly improved the personalized recommendations because the solution had a notion of context that helped determine what type on content each user preferred to watch.

Equally, on the user metadata side, more detailed information was provided, trying to capture the demographics and preferences of each user. Of course, in the case of the users, we had to deal with the cold-start problem (new users or guest users for which the system didn’t have any information yet). Luckily, the Amazon Personalize HRNN-Coldstart recipe has proved to be very sufficient in solving this problem by quickly linking the new user’s behavior to existing ones. The more time a guest or new user spends on the platform, the more Amazon Personalize understands about their preferences and adjusts its recommendations accordingly.

We had many options of what type of metadata to include in the interactions dataset, but it’s important to make sure we only use relevant metadata, and we had to pay attention to the balance between providing too much information to a model and providing too little.

For example, we considered recording the movement of each user’s cursors on the website and sending these as well to Amazon Personalize, which in theory could provide a marginal improvement to the performance of the recommendation system. But doing so proved to be expensive and tolling both on the front end (it impacted website performance) and the back end (the volume of data the system had to record, store, and send to Amazon Personalize significantly increased). Therefore, after careful consideration, we decided that cursor movement metadata wasn’t worth keeping.

After a few months, Pulselive rolled out the Amazon Personalize-based recommendation system to nearly half of their customer’s website visitors, and saw that that group’s engagement with their videos increased by 20%.


Recommendation engines can provide more pertinent results to users based on metadata about a user’s historical selections, or on the types of items of interest.

In this post, we looked at how to select the right metadata to get the best results when training a recommendation engine on Amazon Personalize by evaluating which metadata to include and which to exclude. We also looked at a specific use case and how an AWS customer, Pulselive, increased engagement with videos on their customer’s website by providing personalized recommendations to users.

For more information on creating recommendation engines with Amazon Personalize and metadata selection, see the following:

About the Authors

Andrew Hood is a Prototyping Engagement Manager at AWS.





Ion Kleopas is an ML Prototyping Architect at AWS.

Read More

Streamline modeling with Amazon SageMaker Studio and the Amazon Experiments SDK

Streamline modeling with Amazon SageMaker Studio and the Amazon Experiments SDK

The modeling phase is a highly iterative process in machine learning (ML) projects, where data scientists experiment with various data preprocessing and feature engineering strategies, intertwined with different model architectures, which are then trained with disparate sets of hyperparameter values. This highly iterative process with many moving parts can, over time, manifest into a tremendous headache in terms of keeping track of the design decisions applied in each iteration and how the training and evaluation metrics of each iteration compare to the previous versions of the model.

While your head may be spinning by now, fear not! Amazon SageMaker has a solution!

This post walks you through an end-to-end example of using Amazon SageMaker Studio and the Amazon SageMaker Experiments SDK to organize, track, visualize, and compare our iterative experimentation with a Keras model. Although this use case is specific to Keras framework, you can extend the same approach to other deep learning frameworks and ML algorithms.

Amazon SageMaker is a fully managed service, created with the goal of democratizing ML by empowering developers and data scientists to quickly and cost-effectively build, train, deploy, and monitor ML models.

What Is Amazon SageMaker Experiments?

Amazon SageMaker Experiments is a capability of Amazon SageMaker that lets you effortlessly organize, track, compare, and evaluate your ML experiments. Before we dive into the hands-on exercise, let’s first take a step back and review the building blocks of an experiment and their referential relationships. The following diagram illustrates these building blocks.

Figure 1. The building blocks of Amazon SageMaker Experiments

Amazon SageMaker Experiments is composed of the following components:

  • Experiment – An ML problem that we want to solve. Each experiment consists of a collection of trials.
  • Trial An iteration of a data science workflow related to an experiment. Each trial consists of several trial components.
  • Trial component – A stage in a given trial. For instance, as we see in our example, we create one trial component for the data preprocessing stage and one trial component for model training. In a similar fashion, we can also add a trial component for any data postprocessing.
  • Tracker – A mechanism that records various metadata about a particular trial component, including any parameters, inputs, outputs, artifacts, and metrics. A tracker can be linked to a particular training component to assign the collected metadata to it.

Now that we’ve set a rock-solid foundation on the key building blocks of the Amazon SageMaker Experiments SDK, let’s dive into the fun hands-on component.


You should have an AWS account and a sufficient level of access to create resources in the following AWS services:

Solution overview

As part of this post, we walk through the following high-level steps:

  1. Environment setup
  2. Data preprocessing and feature engineering
  3. Modeling with Amazon SageMaker Experiments
  4. Training and evaluation metric exploration
  5. Environment cleanup

Setting up the environment

We can set up our environment in a few simple steps:

  1. Clone the source code from the GitHub repo, which contains the complete demo, into your Amazon SageMaker Studio environment.
  2. Open the included Jupyter notebook and choose the Python 3 (TensorFlow 2 CPU Optimized)
  3. When the kernel is ready, install sagemaker-experiments package, which enables us to work with the Amazon SageMaker Experiments SDK, and s3fs package, to enable our pandas dataframes to easily integrate with objects in Amazon S3.
  4. Import all required packages and initialize the variables.

The following screenshot shows the environment setup.

Figure 2. Environment Setup

Data preprocessing and feature engineering

Excellent! Now, let’s dive into data preprocessing and feature engineering. In our use case, we use the abalone dataset from the UCI Machine Learning Repository.

Run the steps in the provided Jupyter notebook to complete all data preprocessing and feature engineering. After your data is preprocessed, it’s time for us to seamlessly capture our preprocessing strategy! Let’s create an experiment with the following code:

sm = boto3.client('sagemaker') 
ts = datetime.now().strftime('%Y-%m-%d-%H-%M-%S-%f')

abalone_experiment = Experiment.create(
    experiment_name = 'predict-abalone-age-' + ts,
    description = 'Predicting the age of an abalone based on a set of features describing it',

Now, we can create a Tracker to describe the Pre-processing Trial Component, including the location of the artifacts:

with Tracker.create(display_name='Pre-processing', sagemaker_boto_client=sm, artifact_bucket=sm_bucket, artifact_prefix=artifacts_path) as tracker:
        'train_test_split': 0.8
    tracker.log_input(name='raw data', media_type='s3/uri', value=source_url)
    tracker.log_output(name='preprocessed data', media_type='s3/uri', value=processed_data_path)
    tracker.log_artifact(name='preprocessors', media_type='s3/uri', file_path='preprocessors.pickle')
processing_component = tracker.trial_component

Fantastic! We now have our experiment ready and we’ve already done our due diligence to capture our data preprocessing strategy. Next, let’s dive into the modeling phase.

Modeling with Amazon SageMaker Experiments

Our Keras model has two fully connected hidden layers with a variable number of neurons and variable activation functions. This flexibility enables us to pass these values as arguments to a training job and quickly parallelize our experimentation with several model architectures.

We have mean squared logarithmic error defined as the loss function, and the model is using the Adam optimization algorithm. Finally, the model tracks mean squared logarithmic error as our metric, which automatically propagates into our training trial component in our experiment, as we see shortly:

def model(x_train, y_train, x_test, y_test, args):
    """Generate a simple model"""
    model = Sequential([
                Dense(args.l1_size, activation=args.l1_activation, kernel_initializer='normal'),
                Dense(args.l2_size, activation=args.l2_activation, kernel_initializer='normal'),
                Dense(1, activation='linear')

    model.fit(x_train, y_train, batch_size=args.batch_size, epochs=args.epochs, verbose=1)

    return model

Fantastic! Follow the steps in the provided notebook to define the hyperparameters for experimentation and instantiate the TensorFlow estimator. Finally, let’s start our training jobs and supply the names of our experiment and trial via the experiment_config dictionary:

                                        'ExperimentName': abalone_experiment.experiment_name,
                                        'TrialName': abalone_trial.trial_name,
                                        'TrialComponentDisplayName': 'Training',

Exploring the training and evaluation metrics

Upon completion of the training jobs, we can quickly visualize how different variations of the model compare in terms of the metrics collected during model training. For instance, let’s see how the loss has been decreasing by epoch for each variation of the model and observe the model architecture that is most effective in decreasing the loss:

  1. Choose the Amazon SageMaker Experiments List icon on the left sidebar.
  2. Choose your experiment to open it and press Shift to select all four trials.
  3. Choose any of the highlighted trials (right-click) and choose Open in trial component list.
  4. Press Shift to select the four trial components representing the training jobs and choose Add chart.
  5. Choose New chart and customize it to plot the collected metrics that you want to analyze. For our use case, choose the following:
    1. For Data type¸ choose Time series.
    2. For Chart type¸ choose Line.
    3. For X-axis dimension, choose epoch.
    4. For Y-axis, choose loss_TRAIN_last.

Figure 3. Generating plots based on the collected model training metrics

Wow! How quick and effortless was that?! I encourage you to further explore plotting various other metrics on your own. For instance, you can choose the Summary data type to generate a scatter plot and explore if there is a relationship between the size of the first hidden layer in your neural network and the mean squared logarithmic error. See the following screenshot.

Figure 4. Plot of the relationship between the size of the first hidden layer in the neural network and Mean-Squared Logarithmic Error during model evaluation

Next, let’s choose our best-performing trial (abalone-trial-0). As expected, we see two trial components. One represents our data Pre-processing, and the other reflects our model Training. When we open the Training trial component, we see that it contains all the hyperparameters, input data location, Amazon S3 location of this particular version of the model, and more.

Figure 5. Metadata about model training, automatically collected by Amazon SageMaker Experiments

Similarly, when we open the Pre-processing component, we see that it captures where the source data came from, where the processed data was stored in Amazon S3, and where we can easily find our trained encoder and scalers, which we’ve packaged into the preprocessors.pickle artifact.

Figure 6. Metadata about data pre-processing and feature engineering, automatically collected by Amazon SageMaker Experiments

Cleaning up

What a fun exploration this has been! Let’s now clean up after ourselves by running the cleanup function provided at the end of the notebook to hierarchically delete all elements of the experiment that we created in this post:



You have now learned to seamlessly track the design decisions that you made during data preprocessing and model training, as well as rapidly compare and analyze the performance of various iterations of your model by using the tracked metrics of the trials in your experiment.

I hope that you enjoyed diving into the intricacies of the Amazon SageMaker Experiments SDK and exploring how Amazon SageMaker Studio smoothly integrates with it, enabling you to lose yourself in experimentation with your ML model without losing track of the hard work you’ve done! I highly encourage you to leverage the Amazon SageMaker Experiments Python SDK in your next ML engagement and I invite you to consider contributing to the further evolution of this open-sourced project.

About the Author

Ivan Kopas is a Machine Learning Engineer for AWS Professional Services, based out of the United States. Ivan is passionate about working closely with AWS customers from a variety of industries and helping them leverage AWS services to spearhead their toughest AI/ML challenges. In his spare time, he enjoys spending time with his family, working out, hanging out with friends and diving deep into the fascinating realms of economics, psychology and philosophy.



Read More