September 2020 – Page 3

Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)

Posted by Konstantinos (Gus) Katsiapis on behalf of the TFX Team

Abstract

Software Engineering, as a discipline, has matured over the past 5+ decades. The modern world heavily depends on it, so the increased maturity of Software Engineering was an eventuality. Practices like testing and reliable technologies help make Software Engineering reliable enough to build industries upon. Meanwhile, Machine Learning (ML) has also grown over the past 2+ decades. ML is used more and more for research, experimentation and production workloads. ML now commonly powers widely-used products integral to our lives.

But ML Engineering, as a discipline, has not widely matured as much as its Software Engineering ancestor. Can we take what we have learned and help the nascent field of applied ML evolve into ML Engineering the way Programming evolved into Software Engineering?

In this article we will give a whirlwind tour of Sibyl and TensorFlow Extended (TFX), two successive end-to-end (E2E) ML platforms at Alphabet. We will share the lessons learned from over a decade of applied ML built on these platforms, explain both their similarities and their differences, and expand on the shifts (both mental and technical) that helped us on our journey. In addition, we will highlight some of the capabilities of TFX that help realize several aspects of ML Engineering. We argue that in order to unlock the gains ML can bring, organizations should advance the maturity of their ML teams by investing in robust ML infrastructure and promoting ML Engineering education. We also recommend that before focusing on cutting-edge ML modeling techniques, product leaders should invest more time in adopting interoperable ML platforms for their organizations. In closing, we will also share a glimpse into the future of TFX.

Where We Are Coming From

Applied ML has been an integral part of Google products and services over the last decade, and is becoming more so over time. We discovered early on from our endeavors to apply ML in production that while ML algorithms are important, they are usually insufficient in realizing the successful application of ML in a product. In particular, E2E ML platforms, which help with all aspects of the ML lifecycle, are usually needed to both accelerate ML adoption and make its use durable and sustainable.

Sibyl (2007 – 2020)

E2E ML platforms are not a new thing at Google. Sibyl, founded in 2007, was a platform that enabled massive-scale ML, catered to production use. Sibyl offered a decent amount of modeling flexibility on top of “wide” models (linear, logistic, poisson regression and later factorization machines) coupled with non-linear transformations and customizable loss functions and regularization. Importantly, Sibyl also offered tools for several aspects of the ML workflow including Data Ingestion, Data Analysis and Validation, Training (of course), Model Analysis, and Training-Serving Skew Detection. All these were packaged as a single integrated product that allowed for iterative experimentation. This holistic product offering, coupled with the Sibyl team’s user focus, rendered Sibyl to, once upon a time, be one of the most widely used E2E ML platforms at Google. Sibyl has since been decommissioned. It was in production for ~14 years, and the vast majority of its workloads migrated to TFX.

TFX (2017 – ?)

While several of us were still working on Sibyl, a notable revolution was happening in the ML algorithms fields with the popularization of Deep Learning (DL). In 2015, Google publicly released TensorFlow (which was itself a successor to a previous system called DistBelief). Since its inception, TensorFlow supported a variety of applications with a focus on DL training and inference. Its flexible programming model allowed it to be used for a lot more than DL and its popularity in both research and production positioned it as the lingua franca for authoring ML algorithms. While TensorFlow offered flexibility, it lacked a complete end-to-end production system. On the other hand, Sibyl had robust end-to-end capabilities, but lacked flexibility. It became apparent that we needed an E2E ML platform for TensorFlow in order to accelerate ML at Google; in 2017, nearly a decade after the birth of Sibyl, we launched TFX within Google. TFX is now the most widely used, general purpose E2E ML platform at Alphabet, including Google.

In the 3 years since its launch, TFX has enabled Alphabet to realize what might be described as “industrial-scale” ML: TFX is used by thousands of users within Alphabet, and it powers hundreds of popular Alphabet products, including Cloud AI services on Google Cloud Platform (GCP). On any given day there are thousands of TFX pipelines running, which are processing exabytes of data and producing tens of thousands of models, which in turn are performing hundreds of millions of inferences per second. TFX’s widespread adoption helps Alphabet realize the flow of research into production and enables very diverse use cases for both direct and indirect TFX users. This widespread adoption also enables teams to focus on model development rather than ML platform development, allowing ML to be more easily used in novel product areas, and creating a virtuous cycle of ML platform evolution from ML applications.

Based on our internal success, and the expectation that equivalents of ML engineering will be needed by organizations and individuals everywhere in the world, we decided to publicly describe the design and initial deployment of TFX within Google and to, step by step, make more of our learnings and our technology publicly available (including open source), while we continue building more of each. We were able to accomplish this in part because, like Sibyl, TFX built upon robust infrastructural dependencies. For example, Sibyl made heavy use of MapReduce and its successor Flume for its distributed data processing, and now TFX heavily uses their portable successor, Apache Beam, for the same.

Following in TensorFlow’s footsteps, the public TFX offering was released in early 2019 and widely adopted in under a year across environments including on-premises and GCP with Cloud AI Platform Pipelines. Some of our partners have also publicly shared their use cases powered by TFX, including how it radically improved their applied ML velocity.

Lessons From Our 10+ Year Journey Of ML Platform Evolution

Though the journey of ML Platform(s) evolution at Google has been a long and exciting one, we expect that the majority of excitement is yet to come! To that end, we want to share a summary of our learnings, some of which were more painfully gained than others. The learnings fall into two categories, namely what remained the same as part of the evolution, but also what changed, and why! We present the learnings in the context of two successive platforms, Sibyl and TFX, though we believe them to be widely applicable.

What Remains The Same And Why

The areas discussed in this section capture a few examples of things that seem enduring and pass the test of time. As such, we expect these to also remain applicable in the future, across different incarnations of ML platforms and frameworks. We look at these from both an applied ML perspective and an infrastructure perspective.

Applied ML

The Rules Of Machine Learning

Successfully applying ML to a product is very much a discipline. It involves a steep learning curve and necessitates some mental model shifts (or perhaps augmentations). To make this challenging task easier, we have publicly shared The Rules of Machine Learning. These are rules that represent learnings from iteratively applying ML to a lot of products at Google. Notably, the adoption of ML in Google products illustrates a common evolution:

Start with simple rules and heuristics, and generate data to learn from; this journey usually starts from the serving side.
Move to simple ML (i.e., simple models) and realize large gains; this is usually the entry point for introduction of ML pipelines.
Move to ML with more features and more advanced models to realize decent gains.
Move to state-of-the-art ML, manage refinement and complexity (for solutions to the problems that are worth it), and realize small gains.
Apply the above launch-and-iterate cycle to more aspects of products and to solve more problems, bearing in mind return on investment (and diminishing returns).

We have found The Rules of Machine Learning to be steadfast across platforms and time and we hope they end up being as valuable to others as they have been to us and our users. In particular, we believe that following the rules will help others be better at the discipline of ML engineering, including helping them avoid the mistakes that we and our users have made in the past. TFX is an attempt to codify these rules, quite literally, in code. We hope to benefit ourselves but also accelerate ML, done well, for the entire industry.

The Discipline Of ML Engineering

In developing The Rules of Machine Learning, we realized that the discipline for building robust systems where the core logic is produced by complex processes involving both code and data requires additional scrutiny beyond that which software engineering provides. As such, we define ML Engineering as a superset of the discipline of software engineering designed to handle the unique complexities of the practical application of ML.

Attempting to summarize the totality of the discipline of ML engineering would be somewhat difficult, if not impossible, especially given how our understanding of it is still limited, and the discipline itself continues to evolve. We do take solace in the following though:

The limited understanding we do have seems to be enduring across platforms and time.
Analogy can be a powerful tool, so several aspects of the better understood discipline of software engineering have helped us draw parallels of how ML engineering could evolve from ML programming, much like how software engineering evolved from programming.

An early realization we had was the following: artifacts are first class citizens in ML, on par with the processes that produce and consume them.

This realization affected the implementation and evolution of Sibyl; it was entrenched in TFX by the time we publicly wrote about it and was ultimately generalized and formalized in ML Metadata, now powering TFX.

Below we present fundamental elements of ML engineering, some examples of ML artifacts and their first class citizenship, and make an attempt to draw analogies with software engineering where possible.

Data

Similarly to how code is at the heart of software, data is at the heart of ML. Data management represents serious challenges in production ML. Perhaps the simplest analogy would be to think about what constitutes a unit test for data. Unit tests verify expectations on how code should behave, by testing the contracts of the pertinent code and instilling trustworthiness in said contracts. Similarly, setting explicit expectations on the form of the data (including its schema, invariants and value distributions), and checking that the data agrees with implicit expectations embedded in the training code can, more so together, make the data trustworthy enough to train models with. Though unit tests can be exhaustive and verify strong contracts, data contracts are in general a lot weaker even if they are necessary. Though unit tests can be exhaustively consumed and verified by humans, data can usually be meaningful to humans only in summarized fashion.

Just as code repositories and version control are pillars for managing code evolution in software engineering, systems for managing data evolution and understanding are pillars of ML engineering.

TFX’s ExampleGen, StatisticsGen, SchemaGen and ExampleValidator components help one treat data as first class citizens, by enabling data management, analysis and validation in (continuous) ML pipelines.

Models

Similarly to how a software engineer produces code that is compiled into programs, an ML engineer produces data and code which is “compiled” into ML programs, more commonly known as models. These two kinds of programs are however very different in nature. Though programs that come out of software usually have strong contracts, models have much weaker contracts. These weak contracts are usually statistical in nature and as such only verifiable in some summarized form (such as a model having sufficient accuracy on a subset of labeled data). This is not at all surprising since models are the product of code and data, and the latter itself doesn’t have strong contracts and is also only digestible in summarized form.

Just as code and data evolve over time, models also evolve over time. However, model evolution is more complicated than the evolution of its constituent code and data. For example, high test coverage (with fuzzing) can give good confidence in both the correctness and the correct evolution of a piece of code, but out-of-distribution and counterfactual yet realistic data for model evaluation can be notoriously difficult to produce.

In the same way that putting together multiple programs in a system necessitates integration testing which is a pillar of software engineering, putting together code and data necessitates end-to-end model validation and understanding which is a pillar of ML engineering.

TFX’s Evaluator and InfraValidator components provide validation and understanding of models, treating them as first class citizens of ML engineering.

Mergeable Fragments

Similarly to how a software engineer merges together pre-existing libraries (or systems) with their code in order to build useful programs, an ML engineer merges together code fragments, data fragments, analysis fragments and model fragments on a regular basis in order to build useful ML pipelines. A notable difference between software engineering and ML engineering is that even when the code is fixed for the latter, data is usually volatile for it (e.g. new data arrives on a regular basis) and as such the downstream artifacts need to be produced frequently and efficiently. For example, a new version of a model usually needs to be produced if any part of its input data has changed. As such, it is important for ML pipelines to produce artifacts that are mergeable. For example, a summary of statistics from one dataset should be easily mergeable with that of another dataset such that it is easy to summarize the statistics of the union of the two datasets. Similarly, it should be easy to transfer the learnings of one model to another model in general, and the learnings of a previous version of a model to the next version of the same model in particular.

There is however a catch, which relates to the previous discussion regarding the equivalents of test coverage for models. Merging new fragments into a model could necessitate creation of novel out-of-distribution and counterfactual evaluation data, contributing to the difficulty of (efficient) model evolution, thus rendering it a lot harder than pure code evolution.

TFX’s ExampleGen, Transform, Trainer and Tuner components, together with TensorFlow Hub, help one treat artifacts as first class citizens by enabling production and consumption of mergeable fragments in workflows that perform data caching, analyzer caching, warmstarting and transfer learning.

Artifact Lineage

Despite all the advanced methodology and tooling that exists for software engineering, the programs and systems that are built invariably need to be debugged. The same holds for ML programs, but debugging them is notoriously harder because non-proximal effects are a lot more prevalent for ML programs due to the plethora of artifacts involved. A model might be inaccurate due to bad artifacts from several sources of error, including flaws in the code, the learning algorithm, the training data, the serving path, or the serving data, to name a few. Much like how stack traces are invaluable for identifying root causes of defects in software programs, the lineage of all artifacts produced and consumed by an ML pipeline is invaluable for identifying root causes of defects in ML models. Additionally, by knowing which downstream artifacts were produced from a problematic artifact, we can identify all impacted systems and users and take mitigating actions.

TFX’s use of ML Metadata (MLMD) helps treat artifacts as first class citizens. MLMD enables advanced cataloging and querying of metadata and lineage associated with artifacts which can together increase the confidence of sharing artifacts even outside the boundaries of a pipeline. MLMD also helps with advanced debugging and, when coupled with the underlying data storage layer, forms the foundation of TFX’s ML compliance mechanisms.

Continuous Learning And Unlearning

ML production pipelines operate in a dynamic environment:

New data can arrive continuously.
The modeling code can change, particularly in the early stages of model development.
The surrounding infrastructure can change, e.g., a new version of some underlying (ML) library.

When changes happen, a pipeline needs to react, often by rerunning its steps in the new environment. This dynamicity increases the importance of provenance tracking in order to facilitate debugging and root-cause analysis. As a simple example, to debug a model failure, it is necessary to know not only which data was used to train the model, but also the versions of the modeling code and any surrounding infrastructure.

ML pipelines must also support low-friction mechanisms to handle these changes. Consider for example the arrival of new data, which necessitates retraining the model. This is a natural requirement in rapidly changing environments, like recommender systems or adversarial systems. Requiring the user to manually retrain the model can be unrealistic, given that the data can arrive at a regular and frequent rate. Instead, we can employ automation by way of “continuous training”, where the pipeline detects the presence of new data and automatically schedules the generation of updated models. In turn, this functionality requires automatically: orchestrating work based on the presence of artifacts (including data), recovering from intermittent failures, and catching up to real-time when recovering. It is common for ML pipelines to run for years ingesting code and data, continuously producing models that make predictions that inform decisions.

Another example of a low-friction mechanism is support for “backfilling” an ML pipeline. In this case, the user might need to rerun the pipeline on existing artifacts but using updated versions of the components, such as rerunning the trainer on existing data using a new version of the modeling code/library. Another use of backfilling is rerunning the pipeline with new versions of existing data, say, to fix an error in the data. These backfills are orthogonal to continuous training and can be used together. For instance, the user can manually trigger a rerun of the trainer, and the generated model artifact can then automatically trigger model evaluation and validation.

TFX was built from the ground up in a way that enables continuous learning (and unlearning) which fundamentally shaped its design. At the same time, these advanced capabilities also allow it to be used in a “one-shot”, discontinuous, fashion. In fact, within Alphabet, both modes of deployment are widely used. Moreover, TFX also supports different types of backfill operations to enable fine-grained interventions during normal pipeline execution.

Even though the public TFX offering doesn’t yet offer continuous ML pipelines, we are actively working on making our existing technology portable so that it can be made publicly available (e.g RFC).

Infrastructure

Building On The Shoulders Of Giants

Realizing ambitious goals necessitates building on top of solid foundations, collaborating with others and leveraging each other’s work. TFX reuses many of Sibyl’s system designs, hardened over a decade of Sibyl’s production ML experience. Additionally, TFX incorporates new technologies in areas where robust standards emerged:

Similarly to how Sibyl built its algorithms and workflows on top of MapReduce, TFX leverages both TensorFlow and Apache Beam for its distributed training and data processing workflows.
Similarly to how Sibyl was columnar, TFX adopted Apache Arrow as the columnar in-memory representation for its compute-intensive libraries.

Taking dependencies where robust standards have emerged has allowed TFX and its users to achieve seamless performance and scalability. It also enables TFX to focus its energy on building the deltas of what is needed for applied ML, as opposed to re-implementing difficult-to-get-right technology. Some of our dependencies, like Kubeflow Pipelines or Apache Airflow, are selected by TFX’s users themselves when the value / features they get from them outweigh the costs that the additional dependencies entail.

Taking dependencies unfortunately incurs costs. We have found that taking dependencies requires effort that is super-linear to the number of dependencies. Said costs are often absorbed by us and our sister teams but can (and sometimes do) leak to our users, usually in the form of conflicting (version) dependencies or incompatibilities between environments and dependencies.

Interoperability And Positive Externalities

ML platforms do not operate in a vacuum. They instead operate within the context of a bigger system or infrastructure, connecting to data producing sources upstream and model consuming sinks downstream, which in turn frequently produce the data that feeds the ML platform, thereby closing the loop. Strong adoption of a platform usually necessitates interoperability with other important technologies in its environment.

Similarly to how Sibyl interoperated with Google’s Ads technology stack for data ingestion and model serving, TFX offers a plethora of connectors for data ingestion and allows serving the produced model in multiple deployment environments and devices.
Similarly to how Sibyl interoperated with Google’s compute stack, TFX leverages Apache Beam to execute on Apache Flink and Apache Spark clusters as well as serverless offerings like Google Cloud Dataflow.
TFX built an orchestration abstraction on top of MLMD and provides orchestration options on top of Apache Airflow, Apache Beam, Kubeflow Pipelines as well as the primitives to integrate with one’s custom orchestrator. MLMD itself works with several relational databases like SQLite and MySQL.

Interoperability necessitates some amount of abstraction and standardization and usually enables sum-greater-than-its-parts effects. TFX is both a beneficiary and a benefactor of the positive externalities created by said interoperability, both within and outside of Alphabet. TFX’s users are also beneficiaries of the interoperability as they can more easily deploy and use TFX on top of their existing installed base.

Interoperability also comes with costs. The combination of multiple technology stacks can lead to an exponential number of distinct deployment configurations. While we test some of the distinct deployment configurations end-to-end and at-scale, like for example TFX on GCP, we have neither the expertise nor the resources to do so for the combinatorial explosion of all possible deployment options. We thus encourage the community to work with us on the deployment configurations that are most useful for them.

What Is Different And Why

The areas discussed in this section capture a few examples of things that needed to change in order for our ML platform to adapt to a new reality and as such remain useful and impactful.

Environment And Device Portability

Sibyl was a massive scale ML platform designed to be deployed on Google’s large-scale cluster, namely Borg. This made sense as applied ML at Google was, originally, primarily used in products that were widely used. As ML expertise grew across the world, and ML could be applied to more use cases (large and small) across environments both within and outside of Google, the need for portability gradually but surely became a hard constraint.

While Sibyl ran only on Google’s datacenters, TFX runs on laptops, workstations, servers, datacenters, and public Clouds. In particular, when TFX runs on Google’s Cloud, it leverages automation and optimizations offered by GCP Services, enabled by Google’s unique infrastructure.
While Sibyl ran only on CPUs, TFX leverages TensorFlow to run on different kinds of hardware including CPUs, GPUs and Google’s TPUs.
While Sibyl’s models ran on servers, TFX leverages TensorFlow to produce models that run on laptops, workstations, and servers via TensorFlow Serving and Apache Beam, on mobile and IoT devices via TensorFlow Lite, and on browsers via TensorFlow JS.

TFX’s portability enabled it to be used in a very diverse set of environments and devices, in order to solve problems from small scale to massive scale.

Unfortunately, portability comes with costs. We have found that maintaining a portable core with environment-specific and device-specific specialization requires effort that is super-linear to the number of environments / devices. Said costs are however largely absorbed by us and our sister teams and as such are frequently not visible to our users.

Modularity And Layering

Even though Sibyl’s offering as an integrated product was immensely valuable, its structure and interface were somewhat monolithic, limiting it to a specific set of “direct” users who would have to adopt it wholesale. In contrast, TFX evolved to be a modular and layered architecture, and became more so over time as partnerships with other teams and products grew. Notable layers (with examples) in TFX include:

Layer	Examples
ML Services	Cloud AutoML Cloud Recommendations AI Cloud AI Platform Cloud Dataflow Cloud BigQuery
Pipelines (of composable Components)	TensorFlow Extended (TFX)
Binaries	TensorFlow Serving (TFS)
Libraries	TensorFlow Data Validation (TFDV) TensorFlow Transform (TFT) TensorFlow Hub (TFH) TensorFlow Model Analysis (TFMA) TFX Basic Shared Libraries (TFX_BSL) ML Metadata (MLMD)

TFX’s layered architecture enables it to be used by a very diverse set of users whether that’s piecemeal via its libraries, wholesale via its pipelines (with or without the pertinent services), or in a fashion that’s completely oblivious to the end users (e.g. by them using ML services which TFX powers under the hood)!

Unfortunately, layering comes with costs. We have found that maintaining multiple publicly accessible layers of our product requires effort that is roughly linear to the number of layers. Said costs occasionally leak to our users in the form of confusion regarding what layer makes the most sense for them to use.

Multi-faceted Flexibility

Even though Sibyl was more flexible in terms of modeling capabilities compared to available alternatives at the time, aspects of its flexibility across several parts of the ML workflow fell short of Google’s needs for accelerating ML for novel use cases, which led to the development of TFX.

While Sibyl only offered specific kinds of data analysis, TFX’s StatisticGen component offers more built-in capabilities and the ability to realize custom analyses, via TensorFlow Data Validation.
While Sibyl only offered transformations that were pure composable mappers, TFX’s Transform component offers more mappers, custom mappers, more analyzers, custom analyzers, as well as arbitrarily composed (custom) mappers and (custom) analyzers, via TensorFlow Transform.
While Sibyl only offered “wide” models, TFX’s Trainer component offers any model that can be realized on top of TensorFlow, including models that can be shared and can transfer-learn, via TensorFlow Hub.
While Sibyl only offered automatic feature crossing (a.k.a. feature conjunctions) on top of “wide” models, TFX’s Tuner component allows for arbitrary hyper parameter optimization based on state of the art algorithms.
While Sibyl only offered specific kinds of model analysis, TFX’s Evaluator component offers more built-in metrics, custom metrics, confidence intervals and fairness indicators, via TensorFlow Model Analysis.
While Sibyl’s pipeline topology was fixed (albeit somewhat customizable), TFX’s SDK allows one to create custom (optionally containerized) components and use them together with standard components in a flexible and fully customizable pipeline topology.

The increase of flexibility in all these dimensions enabled improved experimentation, wider reach, more use cases, as well as accelerated flow from research to production.

Flexibility does not come without costs. A more flexible system is one that is harder to get right in the first place as well as harder for us to maintain and to evolve as developers of the ML platform. Users may also have to manage increased complexity as they take advantage of this flexibility. Furthermore, we might not be able to offer as strong of a support story on top of an ML platform that is Turing complete.

Where We Are Going

Armed with the knowledge of the past, we present a glimpse of what we plan for the future of TFX, as of 2020. We will continue our work on enabling ML Engineering in order to democratize applied ML, and help everyone practice responsible AI and apply it in a fashion that upholds Google’s AI Principles.

Drive Interoperability And Standards

In order to meet the demand for the burgeoning variety of ML solutions, we will continue to increase our technology’s interoperability. Our work on interoperability and standards as well as open-sourcing more of our technology, reflects our principle to “be socially beneficial” as well as to “be made available for uses that accord with these principles” by making it easier for everyone to follow these practices. As part of this mission, we will empower the industry to build advanced ML systems by open-sourcing more of our technology, and by standardizing ML artifacts and metadata. Some select examples of this work include:

TFX Standardized Inputs.
Advanced TFX DSL semantics, Data Model and IR.
Standardization of ML artifacts and metadata.
Standardization of distributed workloads on heterogeneous runtime environments.
Inference on distributed and streaming models.
Improvements to interoperability with mobile and edge ML deployments.
Improvements for ML framework interoperability and artifact sharing.

Increase Automation

Automation is the backbone of reliable production systems, and TFX is heavily invested in improving and expanding its use of automation. Our work in increased automation reflects our principles of helping make ML deployments “be built and tested for safety” and “avoid creating or reinforcing unfair bias”. Some upcoming efforts include a TFX Pipeline testing framework, automated model improvement in the TFX Tuner, auto-detecting surprising model behavior on multidimensional slices, facilitating automatic production of Model Cards and improving our training-serving skew detection capabilities. TFX on GCP will also continue driving requirements for new (and will better make use of existing) advanced automation features of pertinent services.

Improve ML Understanding

ML understanding is an important aspect of deploying production ML, and TFX is well positioned to provide significant gains in this field. Our work on improving ML understanding reflects our principles to help “avoid creating or reinforcing unfair bias” and help make ML deployments “be accountable to people”. Critical to understanding is to be able to track the lineage of artifacts used to produce a model, an area TFX will continue to invest in. Improvements to TFX technologies like struct2tensor will further enable training, serving, and analyzing models on structured data, thus allowing reasoning about models closer to the original data semantics. We also plan to utilize TFX as a vehicle to expand support for fairness evaluation, remediation, and documentation.

Uphold High Standards And Best Practices

As a vehicle for amplification of ML technology, TFX must continue to “uphold high standards of scientific excellence” and promote best practices. The team will continue publishing scientific papers and conducting public outreach via our existing channels, as well as offer educational courses in partnership with established institutions. We will also improve trust in our model analysis tools using integrated uncertainty measures by, for example, enabling scalable computation of confidence intervals for model metrics, and we will improve our training-serving skew detection capabilities. It’s also critical for research and production to be able to have reproducible ML artifacts, enabled by our work in precise provenance tracking for auditing and reproducing models. Also key is reproducibility of measurements, driven by efforts like NitroML, which will provide tooling for benchmarking AutoML pipelines.

Given that several of the areas where we expand our technology are new to us, we will make an effort to distinguish the battle-tested from the experimental aspects of our technology, in order to enable our users to confidently choose the set of capabilities that meet their desires and needs.

Improve Tooling

Despite TFX providing tools for aspects of ML engineering and several phases of the ML lifecycle, we believe this is still a nascent area. While improving tooling is a natural fit for TFX, it also reflects our principle of helping ML deployments “be made available for uses that accord with these principles”, “supporting scientific excellence,” and being “built and tested for safety” .

One area of improvement is applying ML to the data itself, be it through sensing anomalies or finding patterns in data or enriching data with predictions from ML models. Making it easy to enrich large volumes of data (especially critical streaming data used for low-latency, high volume actions) has always been a challenge. Bringing TFX capabilities into data processing frameworks is our first step here. We have already made it possible to enrich streaming events with labels or make predictions in Apache Beam and, by extension, Cloud Dataflow. We plan to follow this work by leveraging pre-built models (served out of Cloud AI Pipelines and TensorFlow Serving) to make adding a new field in a distributed dataset representing predictions from streams of models trivially easy.

Furthermore, while there are many tools for detecting, discovering, and auditing ML workflows, there is still a need for automated (or assisted) mitigation of discovered issues, and we will invest in this area. For example, proactively predicting which pipeline runs won’t result in better models based on the currently-executing pipeline, perhaps even before training, can significantly reduce time and resources spent on creating poor models.

A Joint Journey

Building TFX and exploring the fundamentals of ML engineering was the cumulative effort of many people over many years. As we continue to make strides and further develop this field, it’s important we recognize the collaborative effort of those who got us here.

Of course, it will take many more collaborations to drive the future of this field, and as such, we invite you to join us on this journey “Towards ML Engineering”!

The TFX Team

The TFX project is realized via collaboration of multiple organizations within Google. Different organizations usually focus on different technology and product layers, though there is a lot of overlap on the portable parts of our technology. Overall we consider ourselves a single team and below we present an alphabetically sorted list of current TFX team members who are contributors to the ideation, research, design, implementation, execution, deployment, management, and advocacy (to name a few) aspects of TFX; they continue to inspire, help, teach, and challenge each other to advance our field:

Abhijit Karmarkar, Adam Wood, Aleksandr Zaks, Alina Shinkarsky, Neoklis Polyzotis, Amy Jang, Amy McDonald Sandjideh, Amy Skerry-Ryan, Andrew Audibert, Andrew Brown, Andy Lou, Anh Tuan Nguyen, Anirudh Sriram, Anna Ukhanova, Anusha Ramesh, Archana Jain, Arun Venkatesan, Ashley Oldacre, Baishun Wu, Ben Mathes, Billy Lamberta, Chandni Shah, Chansoo Lee, Chao Xie, Charles Chen, Chi Chen, Chloe Chao, Christer Leusner, Christina Greer, Christina Sorokin, Chuan Yu Foo, CK Luk, Connie Huang, Daisy Wong, David Smalling, David Zats, Dayeong Lee, Dhruvesh Talati, Doojin Park, Elias Moradi, Emily Caveness, Eric Johnson, Evan Rosen, Florian Feldhaus, Gal Oshri, Gautam Vasudevan, Gene Huang, Goutham Bhat, Guanxin Qiao, Gus Katsiapis, Gus Martins, Haiming Bao, Huanming Fang, Hui Miao, Hyeonji Lee, Ian Nappier, Ihor Indyk, Irene Giannoumis, Jae Chung, Jan Pfeifer, Jarek Wilkiewicz, Jason Mayes, Jay Shi, Jiayi Zhao, Jingyu Shao, Jiri Simsa, Jiyong Jung, Joana Carrasqueira, Jocelyn Becker, Joe Liedtke, Jongbin Park, Jordan Grimstad, Josh Gordon, Josh Yellin, Jungshik Jang, Juram Park, Justin Hong, Karmel Allison, Kemal El Moujahid, Kenneth Yang, Khanh LeViet, Kostik Shtoyk, Lance Strait, Laurence Moroney, Li Lao, Liam Crawford, Magnus Hyttsten, Makoto Uchida, Manasi Joshi, Mani Varadarajan, Marcus Chang, Mark Daoust, Martin Wicke, Megha Malpani, Mehadi Hassen, Melissa Tang, Mia Roh, Mig Gerard, Mike Dreves, Mike Liang, Mingming Liu, Mingsheng Hong, Mitch Trott, Muyang Yu, Naveen Kumar, Ning Niu, Noah Hadfield-Menell, Noé Lutz, Nomi Felidae, Olga Wichrowska, Paige Bailey, Paul Suganthan, Pavel Dournov, Pedram Pejman, Peter Brandt, Priya Gupta, Quentin de Laroussilhe, Rachel Lim, Rajagopal Ananthanarayanan, Rene van de Veerdonk, Robert Crowe, Romina Datta, Ron Yang, Rose Liu, Ruoyu Liu, Sagi Perel, Sai Ganesh Bandiatmakuri, Sandeep Gupta, Sanjana Woonna, Sanjay Kumar Chotakur, Sarah Sirajuddin, Sheryl Luo, Shivam Jindal, Shohini Ghosh, Sina Chavoshi, Sydney Lin, Tanya Grunina, Thea Lamkin, Tianhao Qiu, Tim Davis, Tris Warkentin, Varshaa Naganathan, Vilobh Meshram, Volodya Shtenovych, Wei Wei, Wolff Dobson, Woohyun Han, Xiaodan Song, Yash Katariya, Yifan Mai, Yiming Zhang, Yuewei Na, Zhitao Li, Zhuo Peng, Zhuoshu Li, Ziqi Huang, Zoey Sun, Zohar Yahav

Thank you, all!

The TFX Team … Extended

Beyond the current TFX team members, there have been many collaborators both within and outside of Alphabet whose discussions, technology, as well as direct and indirect contributions, have materially influenced our journey. Below we present an alphabetically sorted list of these collaborators:

Abdulrahman Salem, Ahmet Altay, Ajay Gopinathan‎, Alexandre Passos, Alexey Volkov, Anand Iyer, Andrew Bernard‎, Andrew Pritchard‎, Chary Aasuri, Chenkai Kuang, Chenyu Zhao, Chiu Yuen Koo, Chris Harris, Chris Olston, Christine Robson, Clemens Mewald, Corinna Cortes, Craig Chambers, Cyril Bortolato, D. Sculley, Daniel Duckworth‎, Daniel Golovin, David Soergel, Denis Baylor, Derek Murray, Devi Krishna, Ed Chi, Fangwei Li, Farhana Bandukwala, Gal Elidan, Gary Holt, George Roumpos, Glen Anderson, Greg Steuck, Grzegorz Czajkowski, Haakan Younes, Heng-Tze Cheng, Hossein Attar, Hubert Pham, Hussein Mehanna, Irene Cai, James L. Pine, James Pine, James Wu, Jeffrey Hetherly, Jelena Pjesivac-Grbovic, Jeremiah Harmsen, Jessie Zhu, Jiaxiao Zheng, Joe Lee, Jordan Soyke, Josh Cai, Judah Jacobson, Kaan Ege Ozgun‎, Kenny Song, Kester Tong, Kevin Haas, Kevin Serafini, Kiril Gorovoy, Kostik Steuck, Kristen LeFevre, Kyle Weaver, Kym Hines, Lana Webb, Lichan Hong, Lukasz Lew, Mark Omernick, Martin Zinkevich, Matthieu Monsch, Michel Adar, Michelle Tsai‎, Mike Gunter, Ming Zhong, Mohamed Hammad, Mona Attariyan, Mustafa Ispir, Neda Mirian, Nicholas Edelman‎, Noah Fiedel, Panagiotis Voulgaris‎, Paul Yang, Peter Dolan, Pushkar Joshi‎, Rajat Monga, Raz Mathias‎, Reiner Pope, Rezsa Farahani, Robert Bradshaw, Roberto Bayardo, Rohan Khot, Salem Haykal, Sam McVeety, Sammy Leong, Samuel Ieong, Shahar Jamshy, Slaven Bilac, Sol Ma, Stan Jedrus, Steffen Rendle, Steven Hemingray‎, Steven Ross, Steven Whang, Sudip Roy, Sukriti Ramesh, Susan Shannon, Tal Shaked, Tushar Chandra, Tyler Akidau, Venkat Basker, Vic Liu, Vinu Rajashekhar, Xin Zhang, Yan Zhu‎, Yaxin Liu, Younghee Kwon, Yury Bychenkov‎, Zhenyu Tan

Thank you, all!

Active learning workflow for Amazon Comprehend custom classification models – Part 1.

Amazon Comprehend Custom Classification API enables you to easily build custom text classification models using your business-specific labels without learning ML. For example, your customer support organization can use Custom Classification to automatically categorize inbound requests by problem type based on how the customer has described the issue. You can use custom classifiers to automatically label support emails with appropriate issue types, routing customer phone calls to the right agents, and categorizing social media posts into user segments.

For custom classification, you start by creating a training job with a ground truth dataset comprising a collection of text and corresponding category labels. Upon completing the job, you have a classifier that can classify any new text into one or more named categories. When the custom classification model classifies a new unlabeled text document, it predicts what it has learned from the training data. Sometimes you may not have a training dataset with various language patterns, or once you deploy the model, you start seeing completely new data patterns. In these cases, the model may not be able to classify these new data patterns accurately. How can we ensure continuous model training to keep it up to date with new data and patterns?

In this two part blog series, we discuss an architecture pattern that allows you to build an active learning workflow for Amazon Comprehend custom classification models. The first post will describe a workflow comprising real-time classification, feedback pipelines and human review workflows using Amazon Augmented AI (Amazon A2I). The second post will cover the automated model building using the human reviewed data, selecting the best model, and automated deployment of an endpoint of the chosen model.

Feedback loops play a pivotal role in keeping the models up to date. This feedback helps the models learn about their misclassifications and learn the right ones. This process of teaching the models continuously through feedback and deploying them is called active learning.

For every prediction Amazon Comprehend Custom Classification makes, it also gives a confidence score associated with its prediction. This architecture proposes that you set an acceptable threshold and only accept the predictions with a confidence score that exceeds the threshold. All the predictions that have a confidence score less than the desired threshold are flagged for human review. The human decides whether to accept the model’s prediction or correct it.

In some instances, the model may be confident about its predictions, but the classification might be wrong. In these scenarios, the end-user applications that receive the model predictions can request explicit feedback from its users on the prediction quality. A human moderator reviews this explicit feedback and reclassifies instances where the feedback was negative. This process of generating human-verified data and using it for model retraining helps keep the models up to date, reduce data drift, and achieve higher model accuracy.

Feedback Workflow Architecture.

In this section, we discuss an architectural pattern for implementing an end-to-end active learning workflow for custom classification models in Amazon Comprehend using Amazon A2I. The active learning workflow comprises the following components:

Real-time classification
Feedback loops
Human classification
Model building
Model selection
Model deployment

The following diagram illustrates this architecture covering the first three components. In the following sections, we walk you through each step in the workflow.

Real-time classification

To use custom classification in Amazon Comprehend, you need to create a custom classification job that reads a ground truth dataset from an Amazon Simple Storage Service (Amazon S3) bucket and builds a classification model. After the model builds successfully, you can create an endpoint that allows you to make real-time classifications of unlabeled text. This stage is represented by steps 1–3 in the preceding architecture:

The end-user application calls an API Gateway endpoint with a text that needs to be classified.
The API Gateway endpoint then calls an AWS Lambda function configured to call an Amazon Comprehend endpoint.
The Lambda function calls the Amazon Comprehend endpoint, which returns the unlabeled text classification and a confidence score.

Feedback collection

When the endpoint returns the classification and the confidence score during the real-time classification, you can send instances with low-confidence scores to human review. This type of feedback is called implicit feedback.

The Lambda function sends the implicit feedback to an Amazon Kinesis Data Firehose.

The other type of feedback is called explicit feedback and comes from the application’s end-users that use the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. Explicit feedback can be sent either in real-time through an API or a batch process.

End-users of the application submit explicit real-time feedback through an API Gateway endpoint.
The Lambda function backing the API endpoint transforms the data into a standard feedback format and writes it to the Kinesis Data Firehose delivery stream.
End-users of the application can also submit explicit feedback as a batch file by uploading it to an S3 bucket.
A trigger configured on the S3 bucket triggers a Lambda function.
The Lambda function transforms the data into a standard feedback format and writes it to the delivery stream.
Both the implicit and explicit feedback data gets sent to a delivery stream in a standard format. All this data is buffered and written to an S3 bucket.

Human classification

The human classification stage includes the following steps:

A trigger configured on the feedback bucket in Step 10 invokes a Lambda function.
The Lambda function creates Amazon A2I human review tasks for all the feedback data received.
Workers assigned to the classification jobs log in to the human review portal and either approve the classification by the model or classify the text with the right labels.
After the human review, all these instances are stored in an S3 bucket and used for retraining the models. Part 2 of this series covers the retraining workflow.

Solution overview

The next few sections of the post go over how to set up this architecture in your AWS account. We classify news into four categories: World, Sports, Business, and Sci/Tech, using the AG News dataset for custom classification, and set up the implicit and explicit feedback loop. You need to complete two manual steps:

Create an Amazon Comprehend custom classifier and an endpoint.
Create an Amazon SageMaker private workforce, worker task template, and human review workflow.

After this, you run the provided AWS CloudFormation template to set up the rest of the architecture.

Prerequisites

Before you get started, download the dataset and upload it to Amazon S3. This dataset comprises a collection of news articles and their corresponding category labels. We have created a training dataset called train.csv from the original dataset and made it available for download.

The following screenshot shows a sample of the train.csv file.

After you download the train.csv file, upload it to an S3 bucket in your account for reference during training. For more information about uploading files, see How do I upload files and folders to an S3 bucket?

Creating a custom classifier and an endpoint

To create your classifier for classifying news, complete the following steps:

On the Amazon Comprehend console, choose Custom Classification.
Choose Train classifier.
For Name, enter news-classifier-demo.
Select Using Multi-class mode.
For Training data S3 location, enter the path for train.csv in your S3 bucket, for example, s3://<your-bucketname>/train.csv.
For Output data S3 location, enter the S3 bucket path where you want the output, such as s3://<your-bucketname>/.
For IAM role, select Create an IAM role.
For Permissions to access, choose Input and output (if specified) S3 bucket.
For Name suffix, enter ComprehendCustom.

Scroll down and choose Train Classifier to start the training process.

The training takes some time to complete. You can either wait to create an endpoint or come back to this step later after finishing the steps in the section Creating a private workforce, worker task template, and human review workflow.

Creating a custom classifier real-time endpoint

To create your endpoint, complete the following steps:

On the Amazon Comprehend console, choose Custom Classification.
From the Classifiers list, choose the name of the custom model for which you want to create the endpoint and select your model news-classifier-demo.
From the Actions drop-down menu, choose Create endpoint.
For Endpoint name, enter classify-news-endpoint and give it one inference unit.
Choose Create endpoint
Copy the endpoint ARN as shown in the following screenshot. You use it when running the CloudFormation template in a future step.

Creating a private workforce, worker task template, and human review workflow.

This section walks you through creating a private workforce in Amazon SageMaker, a worker task template, and your human review workflow.

Creating a labeling workforce

For this post, you will create a private work team and add only one user (you) to it. For instructions, see Create a Private Workforce (Amazon SageMaker Console).
Once the user accepts the invitation, you will have to add him to the workforce. For instructions, see the Add a Worker to a Work Team section the Manage a Workforce (Amazon SageMaker Console)

Creating a worker task template

To create a worker task template, complete the following steps:

On the Amazon A2I console, choose Worker task templates.
Choose to Create a template.
For Template name, enter custom-classification-template.
For Template type, choose Custom,
In the Template editor, enter the following GitHub UI template code.
Choose Create.

Creating a human review workflow

To create your human review workflow, complete the following steps:

On the Amazon A2I console, choose Human review workflows.
Choose Create human review workflow.
For Name, enter classify-workflow.
Specify an S3 bucket to store output: s3://<your bucketname>/.

Use the same bucket where you downloaded your train.csv in the prerequisite step.

For IAM role, select Create a new role.
For Task type, choose Custom.
Under Worker task template creation, select the custom classification template you created.
For Task description, enter Read the instructions and review the document.
Under Workers, select Private.
Use the drop-down list to choose the private team that you created.
Choose Create.
Copy the workflow ARN (see the following screenshot) to use when initializing the CloudFormation parameters.

Deploying the CloudFormation template to set up active learning feedback

Now that you have completed the manual steps, you can run the CloudFormation template to set up this architecture’s building blocks, including the real-time classification, feedback collection, and the human classification.

Before deploying the CloudFormation template, make sure you have the following to pass as parameters:

Custom classifier endpoint ARN
Amazon A2I workflow ARN

Choose Launch Stack:

Enter the following parameters:
1. ComprehendEndpointARN – The endpoint ARN you copied.
2. HumanReviewWorkflowARN – The workflow ARN you copied.
3. ComrehendClassificationScoreThreshold – Enter 0.5, which means a 50% threshold for low confidence score.

Choose Next until the Capabilities
Select the check-box to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.

For more information about these resources, see AWS IAM resources.

Choose Create stack.

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE.

On the Outputs tab of the stack (see the following screenshot), copy the value for BatchUploadS3Bucket, FeedbackAPIGatewayID, and TextClassificationAPIGatewayID to interact with the feedback loop.
Both the TextClassificationAPI and FeedbackAPI will require and API key to interact with them. The Cloudformtion output ApiGWKey refers to the name of the API key. Currently this API key is associated with a usage plan that allows 2000 requests per month.
On the API Gateway console, choose either the TextClassification API or the the FeedbackAPI. Choose API Keys from the left navigation. Choose your API key from step 7. Expand the API key section in the right pane and copy the value.

You can manage the usage plan by following the instructions on, Create, configure, and test usage plans with the API Gateway console.
You can also add fine grained authentication and authorization to your APIs. For more information on securing your APIs, you can follow instructions on Controlling and managing access to a REST API in API Gateway.

Testing the feedback loop

In this section, we walk you through testing your feedback loop, including real-time classification, implicit and explicit feedback, and human review tasks.

Real-time classification

To interact and test these APIs, you need to download Postman.

The API Gateway endpoint receives an unlabeled text document from a client application and internally calls the custom classification endpoint, which returns the predicted label and a confidence score.

Open Postman and enter the TextClassificationAPIGateway URL in POST method.
In the Headers section, configure the API key. x-api-key : << Your API key >>
In the text field, enter the following JSON code (make sure you have JSON selected and enable raw):

{"classifier":"<your custom classifier name>", "sentence":"MS Dhoni retires and a billion people had mixed feelings."}

Choose Send.

You get a response back with a confidence score and class, as seen in the following screenshot.

Implicit feedback

When the endpoint returns the classification and the confidence score during the real-time classification, you can route all the instances where the confidence score doesn’t meet the threshold to human review. This type of feedback is called implicit feedback. For this post, we set the threshold as 0.5 as an input to the CloudFormation stack parameter.

You can change this threshold when deploying the CloudFormation template based on your needs.

Explicit feedback

The explicit feedback comes from the end-users of the application that uses the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. You can send the predicted label by the model’s explicit feedback through the following methods:

Real time through an API, which is usually triggered through a like/dislike button on a UI.
Batch process, where a file with a collection of misclassified utterances is put together based on a user survey conducted by the customer outreach team.

Invoking the explicit real-time feedback loop

To test the Feedback API, complete the following steps:

Open Postman and enter the FeedbackAPIGatewayID value from your CloudFormation stack output in POST method.
In the Headers section, configure the API key. x-api-key : << Your API key >>
In the text field, enter the following JSON code (for classifier, enter the classifier you created, such as news-classifier-demo, and make sure you have JSON selected and enable raw):

{"classifier":"<your custom classifier name>","sentence":"Sachin is Indian Cricketer."}

Choose Send.

Submitting explicit feedback as a batch file

Download the following test feedback JSON file, populate it with your data, and upload it into the BatchUploadS3Bucket created when you deployed your CloudFormation template. The following code shows some sample data in the file:

{
   "classifier":"news-classifier-demo",
   "sentences":[
      "US music firms take legal action against 754 computer users alleged to illegally swap music online.",
      "A gamer spends $26,500 on a virtual island that exists only in a PC role-playing game."
   ]
}

Uploading the file triggers the Lambda function that starts your human review loop.

Human review tasks

All the feedback collected through the implicit and explicit methods is sent for human classification. The labeling workforce can include Amazon Mechanical Turk, private teams, or AWS Marketplace vendors. For this post, we create a private workforce. The URL to the labeling portal is located on the Amazon SageMaker console, on the Labeling workforces page, on the Private tab.

After you log in, you can see the human review tasks assigned to you. Select the task to complete and choose Start working.

You see the tasks displayed based on the worker template used when creating the human workflow.

After you complete the human classification and submit the tasks, the human-reviewed data is stored in the S3 bucket you configured when creating the human review workflow. Go to Amazon Sagemaker-> Human review workflows->output location:

This human-reviewed data is used to retrain the custom classification model to learn newer patterns and improve its overall accuracy. Below is screenshot of the human annotated output file output.json in S3 bucket:

The process of retraining the models with human-reviewed data, selecting the best model, and automatically deploying the new endpoints completes the active learning workflow. We cover these remaining steps in Part 2 of this series.

Cleaning up

To remove all resources created throughout this process and prevent additional costs, complete the following steps:

On the Amazon S3 console, delete the S3 bucket that contains the training dataset.
On the Amazon Comprehend console, delete the endpoint and the classifier.
On the Amazon A2I console, delete the human review workflow, worker template, and the private workforce.
On the AWS CloudFormation console, delete the stack you created. (This removes the resources the CloudFormation template created.)

Conclusion

Amazon Comprehend helps you build scalable and accurate natural language processing capabilities without any machine learning experience. This post provides a reusable pattern and infrastructure for active learning workflows for custom classification models. The feedback pipelines and human review workflow help the custom classifier learn new data patterns continuously. The second part of this series covers the automatic model building, selection, and deployment of custom classification models.

For more information, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.

About the Authors

Shanthan Kesharaju is a Senior Architect in the AWS ProServe team. He helps our customers with AI/ML strategy, architecture, and develop products with a purpose. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML Explainability areas in AI/ML.

Joyson Neville Lewis obtained his master’s in Information Technology from Rutgers University in 2018. He has worked as a Software/Data engineer before diving into the Conversational AI domain in 2019, where he works with companies to connect the dots between business and AI using voice and chatbot solutions. Joyson joined Amazon Web Services in February of 2018 as a Big Data Consultant for AWS Professional Services team in NYC.

Facebook hosts virtual 2020 Fellowship Summit

The Facebook Fellowship Program supports top PhD students from around the world in fields related to computer science and engineering. The program includes an invitation to the annual Fellowship Summit, which is an opportunity for Fellows to network with one another, present their work, meet Facebook researchers and recruiters, and more.

Apply

Due to COVID-19, the Fellowship Summit was fully virtual and spanned September 8 to September 18. “We’ve embraced this new challenge of planning virtual events, which has provided unique opportunities for the summit,” says Alisa Futriski, Program Coordinator for the Fellowship Program. “For example, because of increased scheduling flexibility and no travel restriction issues, we were able to bring together a particularly robust group of presenters from the Facebook Research community.”

Data visualization and anomaly detection using Amazon Athena and Pandas from Amazon SageMaker

Many organizations use Amazon SageMaker for their machine learning (ML) requirements and source data from a data lake stored on Amazon Simple Storage Service (Amazon S3). The petabyte scale source data on Amazon S3 may not always be clean because data lakes ingest data from several source systems, such as like flat files, external feeds, databases, and Hadoop. It may contain extreme values in source attributes, considered as outliers in the data. Outliers arise due to changes in system behavior, fraudulent behavior, human error, instrument error, missing data, or simply through natural deviations in populations. Outliers in training data can easily impact model accuracy of many ML models, like linear and logistic regression. These anomalies result in ML scientists and analysts facing skewed results. Outliers can dramatically impact ML models and change the model equation completely with bad predictions or estimations.

Data scientists and analysts are looking for a way to remove outliers. Analysts come from a strong data background, and are very fluent in writing SQL queries with programming languages. The following tools are a natural choice for ML scientists to remove outliers and carry out data visualization:

Amazon Athena – An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
Pandas – An open-source, high-performance, easy-to-use library that provides for data structures and data analysis library like matplotlib for Python programming language
Amazon SageMaker – A fully managed service that provides you with the ability to build, train, and deploy ML models quickly

To illustrate how to use Athena with Pandas for anomaly detection and visualization using Amazon SageMaker, we clean a set of New York City Taxi and Limousine Commission (TLC) Trip Record Data by removing outlier records. In this dataset, outliers are when a taxi trip’s duration is for multiple days, 0 seconds, or less than 0 seconds. Then we use the Pandas matplotlib library to plot graphs to visualize trip duration values.

Solution overview

To implement this solution, you perform the following high-level steps:

Create an AWS Glue Data Catalog and browse the data on the Athena console.
Create an Amazon SageMaker Jupyter notebook and install PyAthena.
Identify anomalies using Athena SQL-Pandas from the Jupyter notebook.
Visualize data and remove outliers using Athena SQL-Pandas.

The following diagram illustrates the architecture of this solution.

Prerequisites

To follow this post, you should be familiar with the following:

The Amazon S3 file upload process
AWS Glue crawlers and the Data Catalog
Basic SQL queries
Jupyter notebooks
Assigning a basic AWS Identity and Access Management (IAM) policy to a role

Preparing the data

For this post, we use New York City Taxi and Limousine Commission (TLC) Trip Record Data, which is a publicly available dataset.

Download the file yellow_tripdata_2019-01.csv to your local machine.
Create the S3 bucket s3-yellow-cab-trip-details (your name will be different).
Upload the file to your bucket using the Amazon S3 console.

Creating the Data Catalog and browsing the data

After you upload the data to Amazon S3, you create the Data Catalog in AWS Glue. This allows you to run SQL queries using Athena.

On the AWS Glue console, create a new database.
For Database name, enter db_yellow_cab_trip_details.

Create an AWS Glue crawler to gather the metadata in the file and catalog it.

For this post, I use the database (db_yellow_cab_trip_details) to save tables with the added pre-fix as src_.

Run the crawler.

The crawler can take 2–3 minutes to complete. You can check the status on Amazon CloudWatch.

The following screenshot shows the crawler details on the AWS Glue console.

When the crawler is complete, the table is available in the Data Catalog. All the metadata and column-level information is displayed with corresponding data types.

We can now check the data on the Athena console to make sure we can read the file as a table and run a SQL query.

Run your query with the following code:

SELECT * FROM db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 10;

The following screenshot shows your output on the Athena console.

Creating a Jupyter notebook and installing PyAthena

You now create a new notebook instance from Amazon SageMaker and install PyAthena using Jupyter.

Amazon SageMaker has managed built-in Jupyter notebooks that allow you to write code in Python, Julia, R, or Scala to explore, analyze, and do modeling with a small set of data.

Make sure the role used for your notebook has access on Athena (use IAM policies to verify and add S3FullAccess and AmazonAthenaFullAccess).

To create your notebook, complete the following steps:

On the Amazon SageMaker console, under Notebook, choose Notebook instances.
Choose Create notebook instance.

On the Create notebook instance page, enter a name and choose an instance type.

We recommend using an ml.m4.10xlarge instance, due to the size of the dataset. You should choose an appropriate instance depending on your data; costs vary for different instances.

Wait until the Notebook instance status shows as InService (this step can take up to 5 minutes).

When the instance is ready, choose Open Jupyter.

Open the conda_python3 kernel from the notebook instance.
Enter the following commands to install PyAthena:

! pip install --upgrade pip
! pip install PyAthena

The following screenshot shows the output.

You can also install PyAthena when you create the notebook instance by using lifecycle configurations. See the following code:

#!/bin/bash
sudo -u ec2-user -i <<'EOF'
source /home/ec2-user/anaconda3/bin/activate python3
pip install --upgrade pip
pip install --upgrade  PyAthena
source /home/ec2-user/anaconda3/bin/deactivate
EOF

The following screenshot shows where you enter the preceding code in the Scripts section when creating a lifecycle configuration.

You can run a SQL query from the notebook to validate connectivity to Athena and pull data for visualization.

To import the libraries, enter the following code:

from pyathena import connect
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

To connect to Athena, enter the following code:

conn = connect(s3_staging_dir='s3://<your-Query-result-location>',region_name='us-east-1')

To check the sample data, enter the following query:

df_sample = pd.read_sql("SELECT * FROM db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 10", conn)
df_sample

The following screenshot shows the output.

Detecting anomalies with Athena, Pandas, and Amazon SageMaker

Now that we can connect to Athena, we can run SQL queries to find the records that have unusual trip_duration values.

The following Athena query checks anomalies in the trip_duration data to find the top 50 records with the maximum duration:

df_anomaly_duration= pd.read_sql("select tpep_dropoff_datetime,tpep_pickup_datetime, 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second, 
date_diff('minute', cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_minute, 
date_diff('hour',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_hour, 
date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_day 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details 
order by 3 desc limit 50", conn)

df_anomaly_duration

The following screenshot shows the output; there are many outliers (trips with a duration greater than 1 day).

The output shows the duration in seconds, minutes, hours, and days.

The following query checks for anomalies and shows the top 50 records with the lowest minimum duration (negative value or 0 seconds):

df_anomaly_duration= pd.read_sql("select tpep_dropoff_datetime,tpep_pickup_datetime, 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second, 
date_diff('minute', cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_minute, 
date_diff('hour',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_hour, 
date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_day 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details 
order by 3 asc limit 50", conn)

df_anomaly_duration

The following screenshot shows the output; multiple trips have a negative value or duration of 0.

Similarly, we can use different SQL queries using to analyze the data and find other outliers. We can also clean the data by using SQL queries and, if needed, save the data in Amazon S3 with CTAS queries.

Visualizing the data and removing outliers

Pull the data using the following Athena query in a Pandas DataFrame, and use matplotlib.pyplot to create a visual graph to see the outliers:

df_full = pd.read_sql("SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details ", conn)
plt.figure(figsize=(12,12))
plt.scatter(range(len(df_full["duration_second"])), np.sort(df_full["duration_second"]))
plt.xlabel('index')
plt.ylabel('duration_second')

plt.show()

The process of plotting the full dataset can take 7–10 minutes. To reduce time, add a limit to the number of records in the query:

SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 100000

The following screenshot shows the output.

To ignore outliers, run the following query. You replot the graph after removing the outlier records in which the duration is equal or less than 0 seconds or longer than 1 day:

df_clean_data = pd.read_sql("SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second from db_yellow_cab_trip_details.src_yellow_cab_trip_details where 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp))  > 0 and date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) < 1 ", conn)
plt.figure(figsize=(12,12))
plt.scatter(range(len(df_clean_data["duration_second"])), np.sort(df_clean_data["duration_second"]))
plt.xlabel('index')
plt.ylabel('duration_second')
plt.show()

The following screenshot shows the output.

Cleaning up

When you’re done, delete the notebook instance to avoid recurring deployment costs.

On the Amazon SageMaker notebook, choose your notebook instance.
Choose Stop.

When the status shows as Stopped, choose Delete.

Conclusion

This post walked you through finding and removing outliers from your dataset and data visualization. We used an Amazon SageMaker notebook to run analytical queries using Athena SQL, and used Athena to read the dataset, which is saved in Amazon S3 with the metadata catalog in AWS Glue. We used queries in Athena to find anomalies in the data and ignore these outliers. We also used notebook instances to visualize graphs using Pandas’ matplotlib.pyplot library.

You can try this solution for your use-cases to remove outliers using Athena SQL and SageMaker notebook. If you have comments or feedback, please leave them below.

About the Authors

Rahul Sonawane is a Senior Consultant, Big Data at the Shared Delivery Teams at Amazon Web Services.

Behram Irani is a Senior Solutions Architect, Data & Analytics at Amazon Web Services.

Advancing Instance-Level Recognition Research

Posted by Cam Askew and André Araujo, Software Engineers, Google Research

Instance-level recognition (ILR) is the computer vision task of recognizing a specific instance of an object, rather than simply the category to which it belongs. For example, instead of labeling an image as “post-impressionist painting”, we’re interested in instance-level labels like “Starry Night Over the Rhone by Vincent van Gogh”, or “Arc de Triomphe de l’Étoile, Paris, France”, instead of simply “arch”. Instance-level recognition problems exist in many domains, like landmarks, artwork, products, or logos, and have applications in visual search apps, personal photo organization, shopping and more. Over the past several years, Google has been contributing to research on ILR with the Google Landmarks Dataset and Google Landmarks Dataset v2 (GLDv2), and novel models such as DELF and Detect-to-Retrieve.

Three types of image recognition problems, with different levels of label granularity (basic, fine-grained, instance-level), for objects from the artwork, landmark and product domains. In our work, we focus on instance-level recognition.

Today, we highlight some results from the Instance-Level Recognition Workshop at ECCV’20. The workshop brought together experts and enthusiasts in this area, with many fruitful discussions, some of which included our ECCV’20 paper “DEep Local and Global features” (DELG), a state-of-the-art image feature model for instance-level recognition, and a supporting open-source codebase for DELG and other related ILR techniques. Also presented were two new landmark challenges (on recognition and retrieval tasks) based on GLDv2, and future ILR challenges that extend to other domains: artwork recognition and product retrieval. The long-term goal of the workshop and challenges is to foster advancements in the field of ILR and push forward the state of the art by unifying research workstreams from different domains, which so far have mostly been tackled as separate problems.

DELG: DEep Local and Global Features
Effective image representations are the key components required to solve instance-level recognition problems. Often, two types of representations are necessary: global and local image features. A global feature summarizes the entire contents of an image, leading to a compact representation but discarding information about spatial arrangement of visual elements that may be characteristic of unique examples. Local features, on the other hand, comprise descriptors and geometry information about specific image regions; they are especially useful to match images depicting the same objects.

Currently, most systems that rely on both of these types of features need to separately adopt each of them using different models, which leads to redundant computations and lowers overall efficiency. To address this, we proposed DELG, a unified model for local and global image features.

The DELG model leverages a fully-convolutional neural network with two different heads: one for global features and the other for local features. Global features are obtained using pooled feature maps of deep network layers, which in effect summarize the salient features of the input images making the model more robust to subtle changes in input. The local feature branch leverages intermediate feature maps to detect salient image regions, with the help of an attention module, and to produce descriptors that represent associated localized contents in a discriminative manner.

Our proposed DELG model (left). Global features can be used in the first stage of a retrieval-based system, to efficiently select the most similar images (bottom). Local features can then be employed to re-rank top results (top, right), increasing the precision of the system.

This novel design allows for efficient inference since it enables extraction of global and local features within a single model. For the first time, we demonstrated that such a unified model can be trained end-to-end and deliver state-of-the-art results for instance-level recognition tasks. When compared to previous global features, this method outperforms other approaches by up to 7.5% mean average precision; and for the local feature re-ranking stage, DELG-based results are up to 7% better than previous work. Overall, DELG achieves 61.2% average precision on the recognition task of GLDv2, which outperforms all except two methods of the 2019 challenge. Note that all top methods from that challenge used complex model ensembles, while our results use only a single model.

Tensorflow 2 Open-Source Codebase
To foster research reproducibility, we are also releasing a revamped open-source codebase that includes DELG and other techniques relevant to instance-level recognition, such as DELF and Detect-to-Retrieve. Our code adopts the latest Tensorflow 2 releases, and makes available reference implementations for model training & inference, besides image retrieval and matching functionalities. We invite the community to use and contribute to this codebase in order to develop strong foundations for research in the ILR field.

New Challenges for Instance Level Recognition
Focused on the landmarks domain, the Google Landmarks Dataset v2 (GLDv2) is the largest available dataset for instance-level recognition, with 5 million images spanning 200 thousand categories. By training landmark retrieval models on this dataset, we have demonstrated improvements of up to 6% mean average precision, compared to models trained on earlier datasets. We have also recently launched a new browser interface for visually exploring the GLDv2 dataset.

This year, we also launched two new challenges within the landmark domain, one focusing on recognition and the other on retrieval. These competitions feature newly-collected test sets, and a new evaluation methodology: instead of uploading a CSV file with pre-computed predictions, participants have to submit models and code that are run on Kaggle servers, to compute predictions that are then scored and ranked. The compute restrictions of this environment put an emphasis on efficient and practical solutions.

The challenges attracted over 1,200 teams, a 3x increase over last year, and participants achieved significant improvements over our strong DELG baselines. On the recognition task, the highest scoring submission achieved a relative increase of 43% average precision score and on the retrieval task, the winning team achieved a 59% relative improvement of the mean average precision score. This latter result was achieved via a combination of more effective neural networks, pooling methods and training protocols (see more details on the Kaggle competition site).

In addition to the landmark recognition and retrieval challenges, our academic and industrial collaborators discussed their progress on developing benchmarks and competitions in other domains. A large-scale research benchmark for artwork recognition is under construction, leveraging The Met’s Open Access image collection, and with a new test set consisting of guest photos exhibiting various photometric and geometric variations. Similarly, a new large-scale product retrieval competition will capture various challenging aspects, including a very large number of products, a long-tailed class distribution and variations in object appearance and context. More information on the ILR workshop, including slides and video recordings, is available on its website.

With this research, open source code, data and challenges, we hope to spur progress in instance-level recognition and enable researchers and machine learning enthusiasts from different communities to develop approaches that generalize across different domains.

Acknowledgements
The main Google contributors of this project are André Araujo, Cam Askew, Bingyi Cao, Jack Sim and Tobias Weyand. We’d like to thank the co-organizers of the ILR workshop Ondrej Chum, Torsten Sattler, Giorgos Tolias (Czech Technical University), Bohyung Han (Seoul National University), Guangxing Han (Columbia University), Xu Zhang (Amazon), collaborators on the artworks dataset Nanne van Noord, Sarah Ibrahimi (University of Amsterdam), Noa Garcia (Osaka University), as well as our collaborators from the Metropolitan Museum of Art: Jennie Choi, Maria Kessler and Spencer Kiser. For the open-source Tensorflow codebase, we’d like to thank the help of recent contributors: Dan Anghel, Barbara Fusinska, Arun Mukundan, Yuewei Na and Jaeyoun Kim. We are grateful to Will Cukierski, Phil Culliton, Maggie Demkin for their support with the landmarks Kaggle competitions. Also we’d like to thank Ralph Keller and Boris Bluntschli for their help with data collection.

New Earth Simulator to Take on Planet’s Biggest Challenges

A new supercomputer under construction is designed to tackle some of the planet’s toughest life sciences challenges by speedily crunching vast quantities of environmental data.

The Japan Agency for Marine-Earth Science and Technology, or JAMSTEC, has commissioned tech giant NEC to build the fourth generation of its Earth Simulator. The new system, scheduled to become operational in March, will be based around SX-Aurora TSUBASA vector processors from NEC and NVIDIA A100 Tensor Core GPUs, all connected with NVIDIA Mellanox HDR 200Gb/s InfiniBand networking.

This will give it a maximum theoretical performance of 19.5 petaflops, putting it in the highest echelons of the TOP500 supercomputer ratings.

The new system will benefit from a multi-architecture design, making it suited to various research and development projects in the earth sciences field. In particular, it will act as an execution platform for efficient numerical analysis and information creation, coordinating data relating to the global environment.

Its work will span marine resources, earthquakes and volcanic activity. Scientists will gain deeper insights into cause-and-effect relationships in areas such as crustal movement and earthquakes.

The Earth Simulator will be deployed to predict and mitigate natural disasters, potentially minimizing loss of life and damage in the event of another natural disaster like the earthquake and tsunami that hit Japan in 2011.

Earth Simulator will achieve this by running large-scale simulations at high speed in ways previous generations of Earth Simulator couldn’t. The intent is also to have the system play a role in helping governments develop a sustainable socio-economic system.

The new Earth Simulator promises to deliver a multitude of vital environmental information. It also represents a quantum leap in terms of its own environmental footprint.

Earth Simulator 3, launched in 2015, offered a performance of 1.3 petaflops. It was a world beater at the time, outstripping Earth Simulators 1 and 2, launched in 2002 and 2009, respectively.

The fourth-generation model will deliver more than 15x the performance of its predecessor, while keeping the same level of power consumption and requiring around half the footprint. It’s able to achieve these feats thanks to major research and development efforts from NVIDIA and NEC.

The latest processing developments are also integral to the Earth Simulator’s ability to keep up with rising data levels.

Scientific applications used for earth and climate modelling are generating increasing amounts of data that require the most advanced computing and network acceleration to give researchers the power they need to simulate and predict our world.

NVIDIA Mellanox HDR 200Gb/s InfiniBand networking with in-network compute acceleration engines combined with NVIDIA A100 Tensor Core GPUs and NEC SX-Aurora TSUBASA provides JAMSTEC a world-leading marine research platform critical for expanding earth and climate science and accelerating discoveries.

The post New Earth Simulator to Take on Planet’s Biggest Challenges appeared first on The Official NVIDIA Blog.

Football tracking in the NFL with Amazon SageMaker

With the 2020 football season kicking off, Amazon Web Services (AWS) is continuing its work with the National Football League (NFL) on several ongoing game-changing initiatives. Specifically, the NFL and AWS are teaming up to develop state-of-the-art cloud technology using machine learning (ML) aimed at aiding the officiating process through real-time football detection. As a first step in this process, the Amazon Machine Learning Solutions Lab developed a computer vision model for the challenge of football detection. In this post, we provide in-depth examples including code snippets and visualizations to demonstrate the key components of the football detection pipeline, starting with data labeling and following up with training and deployment using Amazon SageMaker and Apache MXNet Gluon.

Detecting the football in NFL Broadcast videos

The following video illustrates detecting the football frame by frame.

Computer vision-based object detection techniques use deep learning algorithms to predict the location of objects in images and videos. Today, object detection has many far-reaching, high business value use cases, such as in self-driving car technology, where detecting pedestrians and vehicles is of paramount importance in ensuring safety on the roads. For the NFL, object detection technology like this is crucial as the game continues to evolve at a rapid pace. For example, they can use real-time object identification to generate new advanced analytics around player and team performance, in addition to aiding game officials in ball spotting. This technology is part of the larger suite of innovations in the AWS/NFL partnership.

The following sections of the post outline how we used NFL broadcast video data to train object detection models that analyze thousands of images to locate and classify the football from background objects.

Creating an object detection dataset with Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for ML. With the help of Ground Truth, we created a custom object detection dataset by breaking the NFL play segments into images. It offers a user interface (UI) that allowed us to quickly spin up a bounding box labeling job, in which human annotators can quickly draw bounding box labels around thousands of football image sequences stored in an Amazon Simple Storage Service (Amazon S3) bucket. The following screenshot illustrates the object detection UI.

The labeling job outputs a manifest file that contains the S3 file path of the image, the (x, y) bounding box coordinates, and the class label (for this use case, football) needed for the model training process.

The following screenshot shows the manifest file that is automatically updated in the S3 path.

The following screenshot shows the contents of the output.manifest file.

Supercharging training with Apache MXNet Gluon and Amazon SageMaker Script Mode

Our approach to model development relied on an ML technique called transfer learning. In it, we take neural networks previously trained on similar applications with strong results and fine-tune these models on our annotated data. We converted the annotations from the labeling job to RecordIO format for compact storage and faster disk access. Neural networks have the tendency to overfit training data, leading to poor out-of-sample results. The MXNet Gluon toolkit we added provides image normalization and image augmentations, such as randomized image flipping and cropping, to help reduce overfitting during training.

Amazon SageMaker provides a simple UI to train object detection models with no code, offering the Single Shot Detector (SSD) pre-trained model with several out-of-the-box configurations. For a more customized architecture, we use Amazon SageMaker Script Mode, which allows you to bring your own training algorithms and directly train models while staying within the user-friendly confines of Amazon SageMaker. We could train larger, more accurate models directly from Amazon SageMaker notebooks by combining Script Mode with pre-trained models like Yolov3 and Faster-RCNN with several backbone network combinations from the Gluon Model Zoo for object detection. See the following code:

import os
import sagemaker
from sagemaker.mxnet import MXNet
from mxnet import gluon
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

s3_output_path = "s3://<path to bucket where model weights will be saved>/"

model_estimator = MXNet(
    entry_point="train.py",
    role=role,
    train_instance_count=1,  # value can be more than 1 for multi node training
    train_instance_type="ml.p3.16xlarge",
    framework_version="1.6.0",
    output_path=s3_output_path,
    py_version="py3",
    distributions={"parameter_server": {"enabled": True}},
    hyperparameters={"epochs": 15},
)

m.fit("s3://<bucket path for train and validation record-io files>/")

Object detection algorithm: Background

Object detectors typically combine two key components: detection of objects in images and regression for estimating bounding box coordinates of objects. During training, object detectors are optimized to reduce both detection error and localization error (bounding box prediction error) via a loss function.

Current state-of-the-art object detectors contain deep learning architectures that use pre-trained convolutional neural networks (CNNs) like VGG-16 or ResNet-50 as base networks to perform rich feature extraction from input images. SSD predicts the relative offsets to a fixed set of boxes at every location of a convolutional feature map. Empirically, SSD underperforms other object detector algorithms on small objects like football. In contrast, YOLOv3 uses DarkNet-53 for feature extraction, which concatenates multiple feature maps together to make predictions, leading to improved performance on smaller objects.

Faster-RCNN in comparison to both SSD and YOLOv3 uses an additional shared deep neural network to predict region proposals of the input image feature maps, which is aggregated in the feed downstream in the model for object classification and bounding box prediction. Faster-RCNN empirically outperformed other networks on small objects in our use case. One major consideration in addition to performance, when choosing object detectors, is model inference time. SSD and YOLOv3 tend to have fast inference times as measured in frames per second, which is a key consideration for real-time applications; larger networks like Faster-RCNN have slower inference time.

Hyperparameter optimization on Amazon SageMaker

A standard metric for evaluating object detectors is mean average precision (mAP). mAP is based on the model precision recall (PR) curve and provides a numerical metric that can be directly used across models. You can generate PR curves by setting model confidence score thresholds to different levels, resulting in precision and recall pairs. Plotting these pairs with a bit of interpolation results in a PR curve. Average precision (AP) is then defined as the area under this PR curve. Similarly, you may want to detect multiple objects in an image, such as K > 1 objects: mAP is the mean AP across all K classes.

Automatic model tuning in Amazon SageMaker, also known as hyperparameter optimization, allowed us to try over 100 models with unique parameter configurations to achieve the best model possible given our data. Hyperparameter optimization uses strategies like random search and Bayesian search to help tune hyper parameters in the ML algorithm. See the following sample code:

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter
from sagemaker.tuner import HyperparameterTuner

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    " network": CategoricalParameter(["resnet50_v1b", "resnet101_v1d"])
}

### Objective metric Regex based on print statements in script
objective_metric_name = "Validation: "
metric_definitions = [{"Name": "Validation: ", "Regex": "Validation: ([0-9\.]+)"}]

tuner = HyperparameterTuner(
    model_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=100,
    max_parallel_jobs=10,
)

tuner.fit("s3://<bucket path for train and validation record-io files>/")

To do this, we specified the location of our data and manifest file on Amazon S3 and chose our Amazon SageMaker instance type and object detection algorithm to use (SSD with ResNet50). Amazon SageMaker hyperparameter optimization then launched several configurations of the base model with unique hyperparameter configurations, using Bayesian search to determine which configuration achieves the best model based on a preset test metric. In our case, we optimized towards the highest mean average precision (mAP) on our held-out test data. The following graph shows a visualization of a sample set of hyperparameter optimization jobs from the hyperparameter optimization tuner object.

Deploying the model

Deploying the model required only a few additional lines of code (hosting methods) within our Amazon SageMaker notebook instance. We can simply call tuner.deploy on our hyperparameter optimization tuner to deploy the best model based on the evaluation metric that was set for the hyperparameter optimization training job. The code below demonstrates a proof-of-concept deployment on Amazon SageMaker:

predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

The model weights for each training job are stored in Amazon S3. We can deploy any of the jobs or Amazon SageMaker trained models by passing its model artifact path to an Amazon SageMaker estimator object. To do this, we referred to the preconfigured container optimized to perform inference and linked it to the model weights. After this model-container pair was created on our account, we could configure an endpoint with the instance type and number of instances needed by the NFL. See the following code:

from sagemaker.mxnet.model import MXNetModel

sagemaker_model = MXNetModel(
    model_data="s3://<path to training job model file>/model.tar.gz",
    role=role,
    py_version="py3",
    framework_version="1.6.0",
    entry_point="train.py",
)

predictor = sagemaker_model.deploy(
    initial_instance_count=1, instance_type="ml.m5.xlarge"
)

Model inference for football detection

At runtime, a client sends a request to the endpoint hosting the model container on an Amazon Elastic Compute Cloud (Amazon EC2) instance and returns the output (inference). In production, scaling endpoints for large-scale inference on the NFL broadcast videos is significantly simplified with this pipeline.

Sensitivity and error analysis

When exploring strategies to improve model performance, you can scale up (use larger architectures) or scale out (acquire more data). After scaling up, which we discussed earlier during model exploration, data scientists commonly collect additional data in hopes of improving model generalizability. For our use case, we specifically aimed at reducing localization error. To do this, we created several test sets that quantitatively and qualitatively helped us understand mAP in relation to specific characteristics of the input video: occlusion (high vs. low) of the football, bounding box size (small vs. large) and aspect ratio (tall vs. wide) effects, camera angle (endzone vs. sideline), and contrast (high vs. low) between the football and its background area. From this, we understood which qualitative aspect of image the model was struggling to predict, and these findings led us to strategically gather additional data to target and improve upon these areas.

Summary

The NFL uses cloud computing to create innovative experiences that introduce additional ways for fans to enjoy football while making the game more efficient and fast-paced. By combining football detection with additional new technologies, the NFL can reduce game stoppage, support officiating, and bring real-time insight into what’s happening on the field, leading to a greater connection with the game that fans love. While fans take delight in “America’s game,” they can rest assured that the NFL in collaboration with AWS is utilizing the newest and best technologies to make the game more enjoyable with a broader range of data points.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, implementing HPO, and model deployment on Amazon SageMaker on the AWS Labs GitHub repo. To learn more about the ML Solutions Lab, see Amazon Machine Learning Solutions Lab.

About the Authors

Michael Lopez is the Director of Football Data and Analytics at the National Football League and a Lecturer of Statistics and Research Associate at Skidmore College. At the National Football League, his work centers on how to use data to enhance and better understand the game of football.

Colby Wise is a Senior Data Scientist and manager at the Amazon Machine Learning Solutions Lab where he works with customers across different verticals to accelerate their use of machine learning and AWS cloud services to solve their business challenges.

Divya Bhargavi is a Data Scientist at the Amazon Machine Learning Solutions Lab where she develops machine learning models to address customers’ business problems. Most recently, she worked on Computer Vision solutions involving both classical and deep learning methods for a sports customer.

Bringing the Mona Lisa Effect to Life with TensorFlow.js

A guest post by Emily Xie, Software Engineer

Background

Urban legend says that Mona Lisa’s eyes will follow you as you move around the room. This is known as the “Mona Lisa effect.” For fun, I recently programmed an interactive digital portrait that brings this phenomenon to life through your browser and webcam.

At its core, the project leverages TensorFlow.js, deep learning, and some image processing techniques. The general idea is as follows: first, we must generate a sequence of images of Mona Lisa’s head, with eyes gazing from the left to right. From this pool, we’ll continuously select and display a single frame in real-time based on the viewer’s location.

In this post, I’ll walk through the specifics of the project’s technical design and implementation.

Animating the Mona Lisa with Deep Learning

Image animation is a technique that allows one to puppeteer a still image through a driving video. Using a deep-learning-based approach, I was able to generate a highly convincing animation of Mona Lisa’s gaze.

Specifically, I used the First Order Motion Model (FOMM), released by Aliaksandr Siarohin et al. in 2019. At a very high level, this method is composed of two modules: one for motion extraction, and another for image generation. The motion module detects keypoints and local affine transformations from the driving video. Diffs of these values between consecutive frames are then used as input to a network that predicts a dense motion field, along with an occlusion mask which specifies the image regions that either need to be modified or contextually inferred. The image generation network, then, detects facial landmarks and produces the final output––the source image, warped and in-painted according to the results of the motion module.

I chose FOMM in particular because of its ease of use. Prior models in this domain had been “object-specific”, meaning that they required detailed data of the object to be animated, whereas FOMM operated agnostically to this. More importantly, the authors released an open-source, out-of-the-box implementation with pre-trained weights for facial animation. Because of this, applying the model to the Mona Lisa became a surprisingly straightforward endeavor: I simply cloned the repo into a Colab notebook, produced a short driving video of me with my eyes moving around, and fed it through the model along with a screenshot of La Gioconda’s head. The resulting movie was stellar. From this, I ultimately sampled just 33 images to constitute the final animation.

Example of a driving video and the image animation predictions generated by FOMM.

A subsample of the final animation frames, produced using the First Order Motion Model.

Image Blending

While I could’ve re-trained the model for my project’s purposes, I decided to work within the constraints of Siarohin’s weights in order to avoid the time and computational resources that would’ve been otherwise required. This, however, meant that the resulting frames were fixed at a lower resolution than desired, and consisted of just the subject’s head. But since I wanted the final visual to include the entirety of Mona Lisa––hands, torso, and background included––my plan was to simply superimpose the output head frames onto an image of the painting.

An example of a head frame overlaid on top of the underlying image. To best illustrate the problem, the version shown here is from an earlier iteration of the project where there was further resolution loss in the head frame.

This, however, produced its own set of challenges. If you look at the example above, you’ll notice that the lower-resolution output of the model––coupled with some subtle collateral background changes due to FOMM’s warping procedure––causes the head frame to visually jut out. In other words, it was a bit obvious that this was just a picture on top of another picture. To address this, I did some image processing in Python to “blend” the head image into the underlying one.

First, I resized the head frame to its original resolution. From there, I created a new frame using a weighted average of these blurred out pixels and the corresponding pixels in the underlying image, where the weight––or alpha––of a pixel in the head frame decreases as it moves away from the midpoint.

The function to determine alpha was adapted from a 2D sigmoid, and is expressed as:

Where j determines the logistic function’s slope, k is the inflection point, and m is the midpoint of the input values. Graphed out, the function looks like:

After I applied the above procedure to all 33 frames in the animation set, the resulting superimpositions each appeared to be a single image to the unsuspecting eye:

Tracking the Viewer’s Head via BlazeFace

All that was left at this point was to determine how to track the user via the webcam and display the corresponding frame.

Naturally, I turned to TensorFlow.js for the job. The library offered a fairly robust set of models to detect the presence of a human given visual input, but after some research and thinking, I landed on BlazeFace as my method of choice.

BlazeFace is a deep-learning based object recognition model that detects human faces and facial landmarks. It is specifically trained for using mobile camera input. This worked well for my use case, as I expected most viewers to be using their webcam in a similar manner––with their heads in frame, front-facing, and fairly close to the camera––whether through their mobile devices or on their laptops.

My foremost consideration in selecting this model, however, was its extraordinary speed of detection. To make this project convincing, I needed to be able to run the entire animation in real time, including the facial recognition step. BlazeFace adapts the single-shot detection (SSD) model, a deep-learning-based object detection algorithm that simultaneously proposes bounding boxes and detects objects in just one forward pass of the network. BlazeFace’s lightweight detector is capable of recognizing facial landmarks at speeds as fast as 200 frames per second.

A demo of what BlazeFace can capture given an input image: bounding boxes for a human head, along with facial landmarks.

Having settled on the model, I then wrote code to continually pipe the user’s webcam data into BlazeFace. On each run, the model outputted an array of facial landmarks and their corresponding 2D coordinate positions. Using this, I approximated the X coordinate of the face’s center by calculating the midpoint between the eyes.

Finally, I mapped this result to an integer between 0 and 32. These values, as you may recall, each represented a frame in the animated sequence––with 0 depicting Mona Lisa with her eyes to the left, and 32 with her eyes to the right. From there, it was just a matter of displaying the frame on the screen.

Try it out!

You can play with the project at monalisaeffect.com. To follow more of my work, feel free to check out my personal website, Github, or Twitter.

Acknowledgements

Thanks to Andrew Fu for reading this post and providing me feedback, to Nick Platt for lending his ear and thoughts on a frontend bug, and to Jason Mayes along with the rest of the team at Google for their work in reaching out and amplifying this project.

Improved OCR and structured data extraction with Amazon Textract

Optical character recognition (OCR) technology, which enables extracting text from an image, has been around since the mid-20th century, and continues to be a research topic today. OCR and document understanding are still vibrant areas of research because they’re both valuable and hard problems to solve.

AWS has been investing in improving OCR and document understanding technology, and our research scientists continue to publish research papers in these areas. For example, the research paper Can you read me now? Content aware rectification using angle supervision describes how to tackle the problem of document rectification which is fundamental to the OCR process on documents. Additionally, the paper SCATTER: Selective Context Attentional Scene Text Recognizer introduces a novel way to perform scene text recognition, which is the task of recognizing text against complex image backgrounds. For more recent publications in this area, see Computer Vision.

Amazon scientists also incorporate these research findings into best-of-breed technologies such as Amazon Textract, a fully managed service that uses machine learning (ML) to identify text and data from tables and forms in documents—such as tax information from a W2, or values from a table in a scanned inventory report—and recognizes a range of document formats, including those specific to financial services, insurance, and healthcare, without requiring customization or human intervention.

One of the advantages of a fully managed service is the automatic and periodic improvement to the underlying ML models to improve accuracy. You may need to extract information from documents that have been scanned or pictured in different lighting conditions, a variety of angles, and numerous document types. As the models are trained using data inputs that encompass these different conditions, they become better at detecting and extracting data.

In this post, we discuss a few recent updates to Amazon Textract that improve the overall accuracy of document detection and extraction.

Currency symbols

Amazon Textract now detects a set of currency symbols (Chinese yuan, Japanese yen, Indian rupee, British pound, and US dollar) and the degree symbol with more precision without much regression on existing symbol detection.

For example, the following is a sample table in a document from a company’s annual report.

The following screenshot shows the output on the Amazon Textract console before the latest update.

Amazon Textract detects all the text accurately. However, the Indian rupee symbol is recognized as an “R” instead of “₹”. The following screenshot shows the output using the updated model.

The rupee symbol is detected and extracted accurately. Similarly, the degree symbol and the other currency symbols (yuan, yen, pound, and dollar) are now supported in Amazon Textract.

Detecting rows and columns in large tables

Amazon Textract released a new table model update that more accurately detects rows and columns of large tables that span an entire page. Overall table detection and extraction of data and text within tables has also been improved.

The following is an example of a table in a personal investment account statement.

The following screenshot shows the Amazon Textract output prior to the new model update.

Even though all the rows, columns, and text is detected properly, the output also contains empty columns. The original table didn’t have a clear separation for columns, so the model included extra columns.

The following screenshot shows the output after the model update.

The output now is much cleaner. Amazon Textract still extracts all the data accurately from this table and now includes the correct number of columns. Similar performance improvement can be seen in tables that span an entire page and columns are not omitted.

Improved accuracy in forms

Amazon Textract now has higher accuracy on a variety of forms, especially income verification documents such as pay stubs, bank statements, and tax documents. The following screenshot shows an example of such a form.

The preceding form is not of high-quality resolution. Regardless, you may have to process such documents in your organization. The following screenshot is the Amazon Textract output using one of the previous models.

Although the older model detected many of the check boxes, it didn’t capture all of them. The following screenshot shows the output using the new model.

With this new model, Amazon Textract accurately detected all the check boxes in the document.

Summary

The improvements to the currency symbols and the degree symbol detection will be launched in the Asia Pacific (Singapore) region on September 24th, 2020, followed by other regions where Amazon Textract is available in the next few days. With the latest improvements to Amazon Textract, you can retrieve information from documents with more accuracy. Tables spanning the entire page are detected more accurately, currency symbols (yuan, yen, rupee, pound, and dollar) and the degree symbol are now supported, and key-value pairs and check boxes in financial forms are detected with more precision. To start extracting data from your documents and images, try Amazon Textract for yourself.

About the Author

Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.

Preventing customer churn by optimizing incentive programs using stochastic programming

In recent years, businesses are increasingly looking for ways to integrate the power of machine learning (ML) into business decision-making. This post demonstrates the use case of creating an optimal incentive program to offer customers identified as being at risk of leaving for a competitor, or churning. It extends a popular ML use case, predicting customer churn, and shows how to optimize an incentive program to address the real business goal of preventing customer churn. We use a large phone company for our use case.

Although it’s usual to treat this as a binary classification problem, the real world is less binary: people become likely to churn for some time before they actually churn. Loss of brand loyalty occurs some time before someone actually buys from a competitor. There’s frequently a slow rise in dissatisfaction over time before someone is finally driven to act. Providing the right incentive at the right time can reset a customer’s satisfaction.

This post builds on the post Gain customer insights using Amazon Aurora machine learning. There we met a telco CEO and heard his concern about customer churn. In that post, we moved from predicting customer churn to intervening in time to prevent it. We built a solution that integrates Amazon Aurora machine learning with the Amazon SageMaker built-in XGBoost algorithm to predict which customers will churn. We then integrated Amazon Comprehend to identify the customer’s sentiment when they called customer service. Lastly, we created a naïve incentive to offer customers identified as being at risk at the time they called.

In this post, we focus on replacing this naïve incentive with an optimized incentive program. Rather than using an abstract cost function, we optimize using the actual economic value of each customer and a limited incentive budget. We use a mathematical optimization approach to calculate the optimal incentive to offer each customer, based on our estimate of the probability that they’ll churn, and the probability that they’ll accept our incentive to stay.

Solution overview

Our incentive program is intended to be used in a system such as that described in the post Gain customer insights using Amazon Aurora machine learning. For simplicity, we’ve built this post so that it can run separately.

We use a Jupyter notebook running on an Amazon SageMaker notebook instance. Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at any scale. In the Jupyter notebook, we first build and host an XGBoost model, identical to the one in the prior post. Then we run the optimization based on a given marketing budget and review the expected results.

Setting up the solution infrastructure

To set up the environment necessary to run this example in your own AWS account, follow Steps 0 and 1 in the post Simulate quantum systems on Amazon SageMaker to set up an Amazon SageMaker instance.

Then, as in Step 2, open a terminal. Enter the following command to copy the notebook to your Amazon SageMaker notebook instance:

wget https://aws-ml-blog.s3.amazonaws.com/artifacts/prevent_churn_by_optimizing_incentives/preventing_customer_churn_by_optimizing_incentives.ipynb

Alternatively, you can review a pre-run version of the notebook.

Building the XGBoost model

The first sections of the notebook—Setup, Data Exploration, Train, and Host, are the same as the sample notebook Amazon SageMaker Examples – Customer Churn. The exception is that we capture a copy of the data for later use and add a column to calculate each customer’s total spend.

At the end of these sections, we have a running XGBoost model on an Amazon SageMaker endpoint. We can use it to predict which customers will churn.

Assessing and optimizing

In this section of the post and accompanying notebook, we focus on assessing the XGBoost model and creating our optimal incentive program.

We can assess model performance by looking at the prediction scores, as shown in the original customer churn prediction post, Amazon SageMaker Examples – Customer Churn.

So how do we calculate the minimum incentive that will give the desired result? Rather than providing a single program to all customers, can we save money and gain a better outcome by using variable incentives, customized to a customer’s churn probability and value? And if so, how?

We can do so by building on components we’ve already developed.

Assigning costs to our predictions

The costs of churn for the mobile operator depend on the specific actions that the business takes. One common approach is to assign costs, treating each customer’s prediction as binary: they churn or don’t, we predicted correctly or we didn’t. To demonstrate this approach, we must make some assumptions. We assign the true negatives the cost of $0. Our model essentially correctly identified a happy customer in this case, and we won’t offer them an incentive. An alternative is to assign the actual value of the customer’s spend to the true negatives, because this is the customer’s contribution to our overall revenue.

False negatives are the most problematic because they incorrectly predict that a churning customer will stay. We lose the customer and have to pay all the costs of acquiring a replacement customer, including foregone revenue, advertising costs, administrative costs, point of sale costs, and likely a phone hardware subsidy. Such costs typically run in the hundreds of dollars, so for this use case, we assume $500 to be the cost for each false negative. For a better estimate, our marketing department should be able to give us a value to use for the overhead, and we have the actual customer spend for each customer in our dataset.

Finally, we give an incentive to customers that our model identifies as churning. For this post, we assume a one-time retention incentive of $50. This is the cost we apply to both true positive and false positive outcomes. In the case of false positives (the customer is happy, but the model mistakenly predicted churn), we waste the concession. We probably could have spent those dollars more effectively, but it’s possible we increased the loyalty of an already loyal customer, so that’s not so bad. We revise this approach later in this post.

Mapping the customer churn threshold

In previous versions of this notebook, we’ve shown the effect of false negatives that are substantially more costly than false positives. Instead of optimizing for error based on the number of customers, we’ve used a cost function that looks like the following equation:

cost_of_replacing_customer * FN(C) + customer_value * TN(C) + incentive_offered * FP(C) + incentive_offered * TP(C)

FN(C) means that the false negative percentage is a function of the cutoff, C, and similar for TN, FP, and TP. We want to find the cutoff, C, where the result of the expression is smallest.

We start by using the same values for all customers, to give us a starting point for discussion with the business. With our estimates, this equation becomes the following:

$500 * FN(C) + $0 * TN(C) + $50 * FP(C) + $50 * TP(C)

A straightforward way to understand the impact of these numbers is to simply run a simulation over a large number of possible cutoffs. We test 100 possible values, and produce the following graph.

The following output summarizes our results:

Cost is minimized near a cutoff of: 0.21000000000000002 for a cost of: $ 25800 for these 1500 customers.
Incentive is paid to 246 customers, for a total outlay of $ 12300
Total customer spend of these customers is $ 16324.36

The preceding chart shows how picking a threshold too low results in costs skyrocketing as all customers are given a retention incentive. Meanwhile, setting the threshold too high (such as 0.7 or above) results in too many lost customers, which ultimately grows to be nearly as costly. In between, there is a large grey area, where perhaps some more nuanced incentives would create better outcomes.

The overall cost can be minimized at $25,750 by setting the cutoff to 0.13, which is substantially better than the $100,000 or more we would expect to lose by not taking any action.

We can also calculate the dollar outlay of the program and compare to the total spend of the customers. Here we can see that paying the incentive to all predicted churn customers costs $13,750, and that these customers spend $183,700. (Your numbers may vary, depending on the specific customers chosen for the sample.)

What happens if we instead have a smaller budget for our campaign? We choose a budget of 1% of total customer monthly spend. The following output shows our results:

Total budget is: $895.90
Per customer incentive is $0.60

We can see that our cost changes. But it’s pretty clear that an incentive of approximately $0.60 is unlikely to change many people’s minds.

Can we do better? We could offer a range of incentives to customers that meet different criteria. For example, it’s worth more to the business to prevent a high spend customer from churning than a low spend customer. We could also target the grey area of customers that have less loyalty and could be swayed by another company’s advertising. We explore this in the following section.

Preventing customer churn using mathematical optimization of incentive programs

In this section, we use a more sophisticated approach to developing our customer retention program. We want to tailor our incentives to target the customers most likely to reconsider a churn decision.

Intuitively, we know that we don’t need to offer an incentive to customers with a low churn probability. Also, above some threshold, we’ve already lost the customer’s heart and mind, even if they haven’t actually left yet. So the best target for our incentive is between those two thresholds—these are the customers we can convince to stay.

The problem under investigation is inherently stochastic in that each customer might churn or not, and might accept the incentive (offer) or not. Stochastic programming [1, 2] is an approach for modeling optimization problems that involve uncertainty. Whereas deterministic optimization problems are formulated with known parameters, real-world problems almost invariably include parameters that are unknown at the time a decision should be made. An example would be the construction of an investment portfolio to maximize return. An efficient portfolio would be defined as the portfolio that maximizes the expected return for a given amount of risk (such as standard deviation), or the portfolio that minimizes the risk subject to a given expected return [3].

Our use case has the following elements:

We know the number of customers, 𝑁.
We can use the customer’s current spend as the (upper bound) estimate of the profit they generate, P.
We can use the churn score from our ML model as an estimate of the probability of churn, alpha.
We use 1% of our total revenue as our campaign budget, C.
The probability that the customer is swayed, beta, depends on how convincing the incentive is to the customer, which we represent as 𝛾.
The incentive, c, is what we want to calculate.

We set up our inputs: P (profit), alpha (our churn probabilities, from our preceding model), and C, our campaign budget. We then define the function we wish to optimize, f(c_i) being the expected total profit across the 𝑁 customers.

Our goal is to optimally allocate the discount 𝑐_𝑖 across the 𝑁 customers to maximize the expected total profit. Mathematically this is equivalent to the following optimization problem:

Now we can specify how likely we think each customer is to accept the offer and not churn—that is, how convincing they’ll find the incentive. We represent this as 𝛾 in the formulae.

Although this is a matter of business judgment, we can use the preceding graph to inform that judgment. In this case, the business believes that if the churn probability is below 0.55, they are unlikely to churn, even without an incentive; on the other hand, if the customer’s churn probability is above 0.95, the customer has little loyalty and is unlikely to be convinced. The real targets for the incentives are the customers with churn probability between 0.55–0.95.

We could include that business insight into the optimization by setting the value for the convincing factor 𝛾_𝑖 as follows:

𝛾_𝑖 = 100. This is equivalent to giving less importance as a deciding factor to the discount for customers whose churn probability is below 0.55 (they are loyal and less likely to churn), or greater than 0.95 (they will most likely leave despite the retention campaign).
𝛾_𝑖 = 1. This is equivalent to saying that the probability customer i will accept the discount is equal to 𝛽=1−𝑒^−C^𝑖 for customers with churn probability between 0.55 and 0.95.

When we start to offer these incentives, we can log whether or not each customer accepts the offer and remains a customer. With that information, we can learn this function from experience, and use that learned function to develop the next set of incentives.

Solving the optimization problem

A variety of open-source solvers are available that can solve this optimization problem for us. Examples include SciPy scipy.optimize.minimize, or faster open-source solvers like GEKKO, which is what we use for this post. For large-scale problems, we would recommend using commercial optimization solvers like CPLEX or GUROBI.

After the optimization task has run, we check how much of our budget has been allocated.

The total spend is $1,000.00 compared to our budget of $1,000.00, and the total customer spend is $89,589.73 for 1,500 customers.

Now we evaluate the expected total profit for the following scenarios:

Optimal discount allocation, as calculated by our optimization algorithm
Uniform discount allocation: every customer is offered the same incentive
No discount

The following graph shows our outcomes.

Expected total profit compared to no campaign: 17%

Expected total profit compared to uniform discount allocation: 5%

Lastly, we add the discount to our customer data. We can see how the discount we offer varies across our customer base. The red vertical line shows the value of the uniform discount. The pattern of discounts we offer closely mirrors the pattern in the prediction scores, where many customers aren’t identified as likely churners, and a few are identified as highly likely to churn.

We can also see a sample of the discounts we’d be offering to individual customers. See the following table.

For each customer, we can see their total monthly spend and the optimal incentive to offer them. We can see that the discount varies by churn probability, and we’re assured that the incentive campaign fits within our budget.

Depending on the size of the total budget we allocate, we may occasionally find that we’re offering all customers a discount. This discount allocation problem reminds us of the water-filling algorithm in wireless communications [4,5], where the problem is of maximizing the mutual information between the input and the output of a channel composed of several subchannels (such as a frequency-selective channel, a time-varying channel, or a set of parallel subchannels arising from the use of multiple antennas at both sides of the link) with a global power constraint at the transmitter. More power is allocated to the channels with higher gains to maximize the sum of data rates or the capacity of all the channels. The solution to this class of problems can be interpreted as pouring a limited volume of water into a tank, the bottom of which has the stair levels determined by the inverse of the subchannel gains.

Unfortunately, our problem doesn’t have an intuitive explanation as for the water-filling problem. This is due to the fact that, because of the nature of the objective function, the system of equations and inequalities corresponding to the KKT conditions [6] doesn’t admit a closed form solution.

The optimal incentives calculated here are the result of an optimization routine designed to maximize an economic figure, which is the expected total profit. Although this approach provides a principled way for marketing teams to make systematic, quantitative, and analytics-driven decisions, it’s also important to recall that the objective function to be optimized is a proxy measure to the actual total profit. It goes without saying that we can’t compute the actual profit based on future decisions (this would paradoxically imply maximizing the actual return based on future values of the stocks). But we can explore new ideas using techniques such as the potential outcomes work [7], which we could use to design strategies for back-testing our solution.

Conclusion

We’ve now taken another step towards preventing customer churn. We built on a prior post in which we integrated our customer data with our ML model to predict churn. We can now experiment with variations on this optimization equation, and see the effect of different campaign budgets or even different theories of how they should be modeled.

To gather more data on effective incentives and customer behavior, we could also test several campaigns against different subsets of our customers. We can collect their responses—do they churn after being offered this incentive, or not—and use that data in a future ML model to further refine the incentives offered. We can use this data to learn what kinds of incentives convince customers with different characteristics to stay, and use that new function within this optimization.

We’ve empowered marketing teams with the tools to make data-driven decisions that they can quickly turn into action. This approach can drive fast iterations on incentive programs, moving at the speed with which our customers make decisions. Over to you, marketing!

Bibliography

[1] S. Uryasev, P. M. Pardalos, Stochastic Optimization: Algorithm and Applications, Kluwer Academic: Norwell, MA, USA, 2001.

[2] John R. Birge and François V. Louveaux. Introduction to Stochastic Programming. Springer Verlag, New York, 1997.

[3] Francis, J. C. and Kim, D. (2013). Modern portfolio theory: Foundations, analysis, and new developments (Vol. 795). John Wiley & Sons.

[4] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991.

[5] D. P. Palomar and J. R. Fonollosa, “Practical algorithms for a family of water-filling solutions,” IEEE Trans. Signal Process., vol. 53, no. 2, pp. 686–695, Feb. 2005.

[6] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[7] Imbens, G. W. and D. B. Rubin (2015): Causal Inference for Statistics, Social, and Biomedical Sciences, Cambridge University Press.

About the Authors

Marco Guerriero, PhD, is a Practice Manager for Emergent Technologies and Intelligence Platform for AWS Professional Services. He loves working on ways for emergent technologies such as AI/ML, Big Data, IoT, Quantum, and more to help businesses across different industry verticals succeed within their innovation journey.

Veronika Megler, PhD, is Principal Data Scientist for Amazon.com Consumer Packaging. Until recently she, was the Principal Data Scientist for AWS Professional Services. She enjoys adapting innovative big data, AI, and ML technologies to help companies solve new problems, and to solve old problems more efficiently and effectively. Her work has lately been focused more heavily on economic impacts of ML models and exploring causality.

Oliver Boom is a London based consultant for the Emerging Technologies and Intelligent Platforms team at AWS. He enjoys solving large-scale analytics problems using big data, data science and dev ops, and loves working at the intersection of business and technology. In his spare time he enjoys language learning, music production and surfing.

Dr Sokratis Kartakis is a UK-based Data Science Consultant for AWS. He works with enterprise customers to help them adopt and productionize innovative Machine Learning (ML) solutions at scale solving challenging business problems. His focus areas are ML algorithms, ML Industrialization, and AI/MLOps. He enjoys spending time with his family outdoors and traveling to new destinations to discover new cultures.

Table of Contents

Abstract

Where We Are Coming From

Sibyl (2007 – 2020)

TFX (2017 – ?)

Lessons From Our 10+ Year Journey Of ML Platform Evolution

What Remains The Same And Why

Applied ML

The Rules Of Machine Learning

The Discipline Of ML Engineering

Infrastructure

Building On The Shoulders Of Giants

Interoperability And Positive Externalities

What Is Different And Why

Environment And Device Portability

Modularity And Layering

Multi-faceted Flexibility

Where We Are Going

Drive Interoperability And Standards

Increase Automation

Improve ML Understanding

Uphold High Standards And Best Practices

Improve Tooling

A Joint Journey

The TFX Team

The TFX Team … Extended

Feedback Workflow Architecture.

Real-time classification

Feedback collection

Human classification

Solution overview

Prerequisites

Creating a custom classifier and an endpoint

Creating a custom classifier real-time endpoint

Creating a private workforce, worker task template, and human review workflow.

Creating a labeling workforce

Creating a worker task template

Creating a human review workflow

Deploying the CloudFormation template to set up active learning feedback

Testing the feedback loop

Real-time classification

Implicit feedback

Explicit feedback

Invoking the explicit real-time feedback loop

Submitting explicit feedback as a batch file

Human review tasks

Cleaning up

Conclusion

About the Authors

Solution overview

Prerequisites

Preparing the data

Creating the Data Catalog and browsing the data

Creating a Jupyter notebook and installing PyAthena

Detecting anomalies with Athena, Pandas, and Amazon SageMaker

Visualizing the data and removing outliers

Cleaning up

Conclusion

About the Authors

Detecting the football in NFL Broadcast videos

Creating an object detection dataset with Amazon SageMaker Ground Truth

Supercharging training with Apache MXNet Gluon and Amazon SageMaker Script Mode

Object detection algorithm: Background

Hyperparameter optimization on Amazon SageMaker

Deploying the model

Model inference for football detection

Sensitivity and error analysis

Summary

About the Authors

Background

Animating the Mona Lisa with Deep Learning

Image Blending

Tracking the Viewer’s Head via BlazeFace

Try it out!

Acknowledgements

Currency symbols

Detecting rows and columns in large tables

Improved accuracy in forms

Summary

About the Author