The Power of Two: VMware, NVIDIA Bring AI to the Virtual Data Center

The Power of Two: VMware, NVIDIA Bring AI to the Virtual Data Center

Two key components of enterprise AI just snapped in place thanks to longtime partners who pioneered virtual desktops, virtual graphics workstations and more.

Taking their partnership to a new level, VMware and NVIDIA are uniting accelerated computing and virtualization to bring the power of AI to every company.

It’s a collaboration that will enable users to run data analytics and machine learning workloads in containers or virtual machines, secured and managed with familiar VMware tools. It will create a new sweet spot in hybrid cloud computing with greater control, lowered costs and expanded performance.

The partnership plants behind the firewalls of private companies the power of AI that public clouds provide from the world’s largest AI data centers.

The two companies will demonstrate these capabilities this week at VMworld.

Welcome to the Modern, Accelerated Data Center

Thanks to this collaboration, users will be able to run AI and data science software from NGC Catalog, NVIDIA’s hub for GPU-optimized AI software, using containers or virtual machines in a hybrid cloud based on VMware Cloud Foundation. It’s the kind of accelerated computing that’s a hallmark of the modern data center.

NVIDIA and VMware also launched a related effort enabling users to build a more secure and powerful hybrid cloud accelerated by NVIDIA BlueField-2 DPUs. These data processing units are built to offload and accelerate software-defined storage, security and networking tasks, freeing up CPU resources for enterprise applications.

Enterprises Gear Up for AI

Machine learning lets computers write software humans never could. It’s a capability born in research labs that’s rapidly spreading to data centers across every industry from automotive and banking to healthcare, retail and more.

The partnership will let VMware users train and run neural networks across multiple GPUs in public and private clouds. It also will enable them to share a single GPU across multiple jobs or users thanks to the multi-instance capabilities in the latest NVIDIA A100 GPUs.

To achieve these goals, the two companies will bring GPU acceleration to VMware vSphere to run AI and data-science jobs at near bare-metal performance next to existing enterprise apps on standard enterprise servers. In addition, software and models in NGC will support VMware Tanzu.

With these links, AI workloads can be virtualized and virtual environments become AI-ready without sacrificing system performance. And users can create hybrid clouds that give them the choice to run jobs in private or public data centers.

Companies will no longer need standalone AI systems for machine learning or big data analytics that are separate from their IT resources. Now a single enterprise infrastructure can run AI and traditional workloads managed by VMware tools and administrators.

“We’re providing the best of both worlds by bringing mature management capabilities to bare-metal systems and great performance to virtualized AI workloads,” said Kit Colbert, vice president and CTO of VMware’s cloud platform group.

Demos Show the Power of Two

Demos at VMworld will show a platform that delivers AI results fast as the public cloud and robust enough to tackle critical jobs like fighting COVID-19. They will run containers from NVIDIA NGC, managed by Tanzu, on VMware Cloud Foundation.

We’ll show those same VMware environments also tapping into the power of BlueField-2 DPUs to secure and accelerate hybrid clouds that let remote designers collaborate in an immersive, real-time environment.

That’s just the beginning. NVIDIA is committed to giving VMware the support to be a first-class platform for everything we build. In the background, VMware and NVIDIA engineers are driving a multi-year effort to deliver game-changing capabilities.

Colbert of VMware agreed. “We view the two initiatives we’re announcing today as initial steps, and there is so much more we can do. We invite customers to tell us what they need most to help prioritize our work,” he said.

To learn more, register for the early-access program and tune in to VMware sessions at GTC 2020 next week.



The post The Power of Two: VMware, NVIDIA Bring AI to the Virtual Data Center appeared first on The Official NVIDIA Blog.

Read More

Networks on Steroids: VMware, NVIDIA Power the Data Center with DPUs

Networks on Steroids: VMware, NVIDIA Power the Data Center with DPUs

The data center’s grid is about to plug in to a new source of power.

It rides a kind of network interface card called a SmartNIC. Its smarts and speed spring from an ASIC called a data processing unit.

In short, the DPU packs the power of data center infrastructure on a chip.

DPU-enabled SmartNICs will be available for millions of virtualized servers thanks to a collaboration between VMware and NVIDIA. They bring advances in security and storage as well as networking that will stretch from the core to the edge of the corporate network.

What’s more, the companies announced a related initiative that will put the power of the public AI cloud behind the corporate firewall. It enables enterprise AI managed with familiar VMware tools.

Lighting Up the Modern Data Center

Together, these efforts will give users the choice to run machine learning workloads in containers or virtual machines, secured and managed with familiar VMware tools. And they will create a new sweet spot in hybrid cloud computing with greater control, lowered costs and the highest performance.

Laying the foundation for these capabilities, the partnership will help users build more secure and powerful distributed networks inside VMware Cloud Foundation, powered by the NVIDIA BlueField-2 DPU. It’s the Swiss Army knife of data center infrastructure that can accelerate security, storage, networking, and management tasks, freeing up CPUs to focus on enterprise applications.

The DPU’s jobs include:

  • Blocking malware
  • Advanced encryption
  • Network virtualization
  • Load balancing
  • Intrusion detection and prevention
  • Data compression
  • Packet switching
  • Packet inspection
  • Managing pools of solid-state and hard-disk storage

Our DPUs can run these tasks today across two ports, each carrying traffic at 100 Gbit/second. That’s an order of magnitude faster than CPUs geared for enterprise apps. The DPU is taking on these jobs so CPU cores can run more apps, boosting vSphere and data center efficiency.

As a result, data centers can handle more apps and their networks will run faster, too.

“The BlueField-2 SmartNIC is a fundamental building block for us because we can take advantage of its DPU hardware for better network performance and dramatically reduced cost to operate data center infrastructure,” said Kit Colbert, vice president and CTO of VMware’s cloud platform group.

NVIDIA BlueField-2 DPU in VMware's Project Monterey
Running VMware Cloud Foundation on the NVIDIA BlueField-2 DPU provides security isolation and lets CPUs support more apps per server.

Securing the Data Center with DPUs

DPUs also will usher in a new era of advanced security.

Today, most companies run their security policies on the same CPUs that run their applications. That kind of multitasking leaves IT departments vulnerable to malware or attacks in the guise of a new app.

With the BlueField DPU, all apps and requests can be vetted on a processor isolated from the application domain, enforcing security and other policies. Many cloud computing services already use this approach to create so-called zero-trust environments where software authenticates everything.

VMware is embracing SmartNICs in its products as part of an initiative called Project Monterey. With SmartNICs, corporate data centers can take advantage of the same advances Web giants enjoy.

“These days the traditional security perimeter is gone. So, we believe you need to root security in the hardware of the SmartNIC to monitor servers and network traffic very fast and without performance impacts,” said Colbert.

BlueField-2 DPU demo with VMwar
A demo shows an NVIDIA BlueField-2 DPU preventing a DDOS attack that swamps a CPU.

See DPUs in Action at VMworld

The companies are demonstrating these capabilities this week at VMworld. For example, the demo below shows how virtual servers running VMware ESXi clients can use Bluefield-2 DPUs to stop a distributed denial-of-service attack in a server cluster.

Leading OEMs are already preparing to bring the capabilities of DPUs to market. NVIDIA also plans to support BlueField-2 SmartNICs across its portfolio of platforms including its EGX systems for enterprise and edge computing.

You wouldn’t hammer a nail with a monkey wrench or pound in a screw with a hammer — you need to use the right tool for the job. To build the modern data center network, that means using an NVIDIA DPU enabled by VMware.

The post Networks on Steroids: VMware, NVIDIA Power the Data Center with DPUs appeared first on The Official NVIDIA Blog.

Read More

Drug Discovery in the Age of COVID-19

Drug Discovery in the Age of COVID-19

Drug discovery is like searching for the right jigsaw tile — in a puzzle box with 1060 molecular-size pieces. AI and HPC tools help researchers more quickly narrow down the options, like picking out a subset of correctly shaped and colored puzzle pieces to experiment with.

An effective small-molecule drug will bind to a target enzyme, receptor or other critical protein along the disease pathway. Like the perfect puzzle piece, a successful drug will be the ideal fit, possessing the right shape, flexibility and interaction energy to attach to its target.

But it’s not enough just to interact strongly with the target. An effective therapeutic must modify the function of the protein in just the right way, and also possess favorable absorption, distribution, metabolism, excretion and toxicity properties — creating a complex optimization problem for scientists.

Researchers worldwide are racing to find effective vaccine and drug candidates to inhibit infection with and replication of SARS-CoV-2, the virus that causes COVID-19. Using NVIDIA GPUs, they’re accelerating this lengthy discovery process — whether for structure-based drug design, molecular docking, generative AI models, virtual screening or high-throughput screening.

Identifying Protein Targets with Genomics

To develop an effective drug, researchers have to know where to start. A disease pathway — a chain of signals between molecules that trigger different cell functions — may involve thousands of interacting proteins. Genomic analyses can provide invaluable insights for researchers, helping them identify promising proteins to target with a specific drug.

With the NVIDIA Clara Parabricks genome analysis toolkit, researchers can sequence and analyze genomes up to 50x faster. Given the unprecedented spread of the COVID pandemic, getting results in hours versus days can have an extraordinary impact on understanding the virus and developing treatments.

To date, hundreds of institutions, including hospitals, universities and supercomputing centers, in 88 countries have downloaded the software to accelerate their work — to sequence the viral genome itself, as well as to sequence the DNA of COVID patients and investigate why some are more severely affected by the virus than others.

Another method, cryo-EM, uses electron microscopes to directly observe flash-frozen proteins — and can harness GPUs to shorten processing time for the complex, massive datasets involved.

Using CryoSPARC, a GPU-accelerated software built by Toronto startup Structura Biotechnology, researchers at the National Institutes of Health and the University of Texas at Austin created the first 3D, atomic-scale map of the coronavirus, providing a detailed view into the virus’ spike proteins, a key target for vaccines, therapeutic antibodies and diagnostics.

GPU-Accelerated Compound Screening

Once a target protein has been identified, researchers search for candidate compounds that have the right properties to bind with it. To evaluate how effective drug candidates will be, researchers can screen drug candidates virtually, as well as in real-world labs.

New York-based Schrödinger creates drug discovery software that can model the properties of potential drug molecules. Used by the world’s biggest biopharma companies, the Schrödinger platform allows its users to determine the binding affinity of a candidate molecule on NVIDIA Tensor Core GPUs in under an hour and with just a few dollars of compute cost — instead of many days and thousands of dollars using traditional methods.

Generative AI Models for Drug Discovery

Rather than evaluating a dataset of known drug candidates, a generative AI model starts from scratch. Tokyo-based startup Elix, Inc., a member of the NVIDIA Inception virtual accelerator program, uses generative models trained on NVIDIA DGX Station systems to come up with promising molecular structures. Some of the AI’s proposed molecules may be unstable or difficult to synthesize, so additional neural networks are used to determine the feasibility for these candidates to be tested in the lab.

With DGX Station, Elix achieves up to a 6x speedup on training the generative models, which would otherwise take a week or more to converge, or to reach the lowest possible error rate.

Molecular Docking for COVID-19 Research

With the inconceivable size of the chemical space, researchers couldn’t possibly test every possible molecule to figure out which will be effective to combat a specific disease. But based on what’s known about the target protein, GPU-accelerated molecular dynamics applications can be used to approximate molecular behavior and simulate target proteins at the atomic level.

Software like AutoDock-GPU, developed by the Center for Computational Structural Biology at the Scripps Research Institute, enables researchers to calculate the interaction energy between a candidate molecule and the protein target. Known as molecular docking, this computationally complex process simulates millions of different configurations to find the most favorable arrangement of each molecule for binding. Using the more than 27,000 NVIDIA GPUs on Oak Ridge National Laboratory’s Summit supercomputer, scientists were able to screen 1 billion drug candidates for COVID-19 in just 12 hours. Even using a single NVIDIA GPU provides more than 230x speedup over using a single CPU.

Argonne deployed one of the first DGX-A100 systems. Courtesy of Argonne National Laboratory.

In Illinois, Argonne National Laboratory is accelerating COVID-19 research using an NVIDIA A100 GPU-powered system based on the DGX SuperPOD reference architecture. Argonne researchers are combining AI and advanced molecular modelling methods to perform accelerated simulations of the viral proteins, and to screen billions of potential drug candidates, determining the most promising molecules to pursue for clinical trials.

Accelerating Biological Image Analysis

The drug discovery process involves significant high-throughput lab experiments as well. Phenotypic screening is one method of testing, in which a diseased cell is exposed to a candidate drug. With microscopes, researchers can observe and record subtle changes in the cell to determine if it starts to more closely resemble a healthy cell. Using AI to automate the process, thousands of possible drugs can be screened.

Digital biology company Recursion, based in Salt Lake City, uses AI and NVIDIA GPUs to observe these subtle changes in cell images, analyzing terabytes of data each week. The company has released an open-source COVID dataset, sharing human cellular morphological data with researchers working to create therapies for the virus.

Future Directions in AI for Drug Discovery

As AI and accelerated computing continue to accelerate genomics and drug discovery pipelines, precision medicine — personalizing individual patients’ treatment plans based on insights about their genome and their phenotype — will become more attainable.

Increasingly powerful NLP models will be applied to organize and understand massive datasets of scientific literature, helping connect the dots between independent investigations. Generative models will learn the fundamental equations of quantum mechanics and be able to suggest the optimal molecular therapy for a given target.

To learn more about how NVIDIA GPUs are being used to accelerate drug discovery, check out talks by Schrödinger, Oak Ridge National Laboratory and Atomwise at the GPU Technology Conference next week.

For more on how AI and GPUs are advancing COVID research, read our blog stories and visit the COVID-19 research hub.

Subscribe to NVIDIA healthcare news here

The post Drug Discovery in the Age of COVID-19 appeared first on The Official NVIDIA Blog.

Read More

AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS has expanded the availability of Amazon EC2 Inf1 instances to four new AWS Regions, bringing the total number of supported Regions to 11: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), Europe (Frankfurt, Ireland, Paris), and South America (São Paulo).

Amazon EC2 Inf1 instances are powered by AWS Inferentia chips, which are custom-designed to provide you with the lowest cost per inference in the cloud and lower the barriers for everyday developers to use machine learning (ML) at scale. Customers using models such as YOLO v3 and YOLO v4 can get up to 1.85 times higher throughput and up to 40% lower cost per inference compared to the EC2 G4 GPU-based instances.

As you scale your use of deep learning across new applications, you may be bound by the high cost of running trained ML models in production. In many cases, up to 90% of the infrastructure cost spent on developing and running an ML application is on inference, making the need for high-performance, cost-effective ML inference infrastructure critical. Inf1 instances are built from the ground up to deliver faster performance and more cost-effective ML inference than comparable GPU-based instances. This gives you the performance and cost structure you need to confidently deploy your deep learning models across a broad set of applications.

AWS Neuron SDK performance and support for new ML models

You can deploy your ML models to Inf1 instances natively with popular ML frameworks such as TensorFlow, PyTorch, and MXNet. You can deploy your existing models to Amazon EC2 Inf1 instances with minimal code changes by using the AWS Neuron SDK, which is integrated with these popular ML frameworks. This gives you the freedom to maintain hardware portability and take advantage of the latest technologies without being tied to vendor-specific software libraries.

Since its launch, the Neuron SDK has seen a dramatic improvement in the breadth of models that deliver best-in-class performance at a fraction of the cost. This includes natural language processing models like the popular BERT, image classification models (ResNet and VGG), and object detection models (OpenPose and SSD). The latest Neuron release (1.8.0) provides optimizations that improve performance of YOLO v3 and v4, VGG16, SSD300, and BERT. It also improves operational deployments of large-scale inference applications, with a session management agent incorporated into all supported ML frameworks and a new neuron tool that allows you to easily scale monitoring of large fleets of inference applications.

Customer success stories

Since the launch of Inf1 instances, a broad spectrum of customers, from large enterprises to startups, as well as Amazon services, have begun using them to run production workloads.

Anthem is one of the nation’s leading health benefits companies, serving the healthcare needs of over 40 million members across dozens of states. They use deep learning to automate the generation of actionable insights from customer opinions via natural language models.

“Our application is computationally intensive and needs to be deployed in a highly performant manner,” says Numan Laanait, PhD, Principal AI/Data Scientist at Anthem. “We seamlessly deployed our deep learning inferencing workload onto Amazon EC2 Inf1 instances powered by the AWS Inferentia processor. The new Inf1 instances provide two times higher throughput to GPU-based instances and allowed us to streamline our inference workloads.”

Condé Nast, another AWS customer, has a global portfolio that encompasses over 20 leading media brands, including Wired, Vogue, and Vanity Fair.

“Within a few weeks, our team was able to integrate our recommendation engine with AWS Inferentia chips,” says Paul Fryzel, Principal Engineer in AI Infrastructure at Condé Nast. “This union enables multiple runtime optimizations for state-of-the-art natural language models on SageMaker’s Inf1 instances. As a result, we observed a performance improvement of a 72% reduction in cost than the previously deployed GPU instances.”

Getting started

The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service for building, training, and deploying ML models. If you prefer to manage your own ML application development platforms, you can get started by either launching Inf1 instances with AWS Deep Learning AMIs, which include the Neuron SDK, or use Inf1 instances via Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS) for containerized ML applications.

For more information, see Amazon EC2 Inf1 Instances.

About the Author

Gadi Hutt is a Sr. Director, Business Development at AWS. Gadi has over 20 years’ experience in engineering and business disciplines. He started his career as an embedded software engineer, and later on moved to product lead positions. Since 2013, Gadi leads Annapurna Labs technical business development and product management focused on hardware acceleration software and hardware products like the EC2 FPGA F1 instances and AWS Inferentia along side with its Neuron SDK, accelerating machine learning in the cloud.

Read More

Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue

Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue

A typical machine learning (ML) workflow involves processes such as data extraction, data preprocessing, feature engineering, model training and evaluation, and model deployment. As data changes over time, when you deploy models to production, you want your model to learn continually from the stream of data. This means supporting the model’s ability to autonomously learn and adapt in production as new data is added.

In practice, data scientists often work with Jupyter notebooks for development work and find it hard to translate from notebooks to automated pipelines. To achieve the two main functions of a ML service in production, namely retraining (retrain the model on newer labeled data) and inference (use the trained model to get predictions), you might primarily use the following:

  • Amazon SageMaker – A fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly
  • AWS Glue – A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data

In this post, we demonstrate how to orchestrate an ML training pipeline using AWS Glue workflows and train and deploy the models using Amazon SageMaker. For this use case, you use AWS Glue workflows to build an end-to-end ML training pipeline that covers data extraction, data processing, training, and deploying models to Amazon SageMaker endpoints.

Use case

For this use case, we use the DBpedia Ontology classification dataset to build a model that performs multi-class classification. We trained the model using the BlazingText algorithm, which is a built-in Amazon SageMaker algorithm that can classify unstructured text data into multiple classes.

This post doesn’t go into the details of the model, but demonstrates a way to build an ML pipeline that builds and deploys any ML model.

Solution overview

The following diagram summarizes the approach for the retraining pipeline.

The workflow contains the following elements:

  • AWS Glue crawler – You can use a crawler to populate the Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.
  • AWS Glue triggers – Triggers are Data Catalog objects that you can use to either manually or automatically start one or more crawlers or ETL jobs. You can design a chain of dependent jobs and crawlers by using triggers.
  • AWS Glue job – An AWS Glue job encapsulates a script that connects source data, processes it, and writes it to a target location.
  • AWS Glue workflow – An AWS Glue workflow can chain together AWS Glue jobs, data crawlers, and triggers, and build dependencies between the components. When the workflow is triggered, it follows the chain of operations as described in the preceding image.

The workflow begins by downloading the training data from Amazon Simple Storage Service (Amazon S3), followed by running data preprocessing steps and dividing the data into train, test, and validate sets in AWS Glue jobs. The training job runs on a Python shell running in AWS Glue jobs, which starts a training job in Amazon SageMaker based on a set of hyperparameters.

When the training job is complete, an endpoint is created, which is hosted on Amazon SageMaker. This job in AWS Glue takes a few minutes to complete because it makes sure that the endpoint is in InService status.

At the end of the workflow, a message is sent to an Amazon Simple Queue Service (Amazon SQS) queue, which you can use to integrate with the rest of the application. You can also use the queue to trigger an action to send emails to data scientists that signal the completion of training, add records to management or log tables, and more.

Setting up the environment

To set up the environment, complete the following steps:

  1. Configure the AWS Command Line Interface (AWS CLI) and a profile to use to run the code. For instructions, see Configuring the AWS CLI.
  2. Make sure you have the Unix utility wget installed on your machine to download the DBpedia dataset from the internet.
  3. Download the following code into your local directory.

Organization of code

The code to build the pipeline has the following directory structure:

--Glue workflow orchestration

The code directory is divided into three parts:

  • AWS CloudFormation templates – The directory has two AWS CloudFormation templates: glue_resources.template and base_resources.template. The glue_resources.template template creates the AWS Glue workflow-related resources, and base_resources.template creates the Amazon S3, AWS Identity and Access Management (IAM), and SQS queue resources. The CloudFormation templates create the resources and write their names and ARNs to AWS Systems Manager Parameter Store, which allows easy and secure access to ARNs further in the workflow.
  • AWS Glue scripts – The folder glue_scripts holds the scripts that correspond to each AWS Glue job. This includes the ETL as well as model training and deploying scripts. The scripts are copied to the correct S3 bucket when the bash script runs.
  • Bash script – A wrapper script is the entry point to running the pipeline. It runs the CloudFormation templates and creates resources in the dev, test, and prod environments. You use the environment name, also referred to as stage in the script, as a prefix to the resource names. The bash script performs other tasks, such as downloading the training data and copying the scripts to their respective S3 buckets. However, in a real-world use case, you can extract the training data from databases as a part of the workflow using crawlers.

Implementing the solution

Complete the following steps:

  1. Go to the file and replace algorithm_image name with <ecr_path> based on your Region.

The following code example is a path for Region us-west-2:


For more information about BlazingText parameters, see Common parameters for built-in algorithms.

  1. Enter the following code in your terminal:
    sh -s dev AWS_PROFILE=your_profile_name

This step sets up the infrastructure of the pipeline.

  1. On the AWS CloudFormation console, check that the templates have the status CREATE_COMPLETE.
  2. On the AWS Glue console, manually start the pipeline.

In a production scenario, you can trigger this manually through a UI or automate it by scheduling the workflow to run at the prescribed time. The workflow provides a visual of the chain of operations and the dependencies between the jobs.

  1. To begin the workflow, in the Workflow section, select DevMLWorkflow.
  2. From the Actions drop-down menu, choose Run.
  3. View the progress of your workflow on the History tab and select the latest RUN ID.

The workflow takes approximately 30 minutes to complete. The following screenshot shows the view of the workflow post-completion.

  1. After the workflow is successful, open the Amazon SageMaker console.
  2. Under Inference, choose Endpoint.

The following screenshot shows that the endpoint the workflow deployed is ready.

Amazon SageMaker also provides details about the model metrics calculated on the validation set in the training job window. You can further enhance model evaluation by invoking the endpoint using a test set and calculating the metrics as necessary for the application.

Cleaning up

Make sure to delete the Amazon SageMaker hosting services—endpoints, endpoint configurations, and model artifacts. Delete both CloudFormation stacks to roll back all other resources. See the following code:

	def delete_resources(self):

        endpoint_name = self.endpoint

            print("Deleted Test Endpoint ", endpoint_name)
        except Exception as e:
            print('Model endpoint deletion failed')

            print("Deleted Test Endpoint Configuration ", endpoint_name)
        except Exception as e:
            print(' Endpoint config deletion failed')

            print("Deleted Test Endpoint Model ", endpoint_name)
        except Exception as e:
            print('Model deletion failed')


This post describes a way to build an automated ML pipeline that not only trains and deploys ML models using a managed service such as Amazon SageMaker, but also performs ETL within a managed service such as AWS Glue. A managed service unburdens you from allocating and managing resources, such as Spark clusters, and makes it easy to move from notebook setups to production pipelines.

About the Authors

Sai Sharanya Nalla is a Data Scientist at AWS Professional Services. She works with customers to develop and implement AI and ML solutions on AWS. In her spare time, she enjoys listening to podcasts and audiobooks, long walks, and engaging in outreach activities.




Inchara B Diwakar is a Data Scientist at AWS Professional Services. She designs and engineers ML solutions at scale, with experience across healthcare, manufacturing and retail verticals. Outside of work, she enjoys the outdoors, traveling and a good read.

Read More

BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents

BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents

Bidirectional Encoder Representations from Transformers (BERT) [1] has become one of the most popular models for natural language processing (NLP) applications. BERT can outperform other models in several NLP tasks, including question answering and sentence classification.

Training the BERT model on large datasets is expensive and time consuming, and achieving low latency when performing inference on this model is challenging. Latency and throughput are key factors to deploy a model in production. In this post, we focus on optimizing these factors for BERT inference tasks. We also compare the cost of deploying BERT on different Amazon Elastic Compute Cloud (Amazon EC2) instances.

When running inference on the BERT-base model, the g4dn.xlarge GPU instance achieves between 2.6–5 times lower latency (3.8 on average) than a c5.24xlarge CPU instance. The g4dn.xlarge instance also achieves the best cost-effective ratio (cost per requests) compared to c5.xlarge, c5.24xlarge, and m5.xlarge CPU instances. Specifically, the cost of processing 1 million BERT-inference requests with sequence length 128 is $0.20 on g4dn.xlarge, whereas on c5.xlarge (the best of these CPU instances), the cost is $3.31—the GPU instance is 16.5 times more efficient.

We achieved these results after a set of GPU optimizations on MXNet, described in the section Optimizing BERT model performance on MXNET 1.6 and 1.7 of this post.

Amazon EC2 G4 instances

G4 instances are optimized for machine learning application deployments. They’re equipped with NVIDIA T4 GPUs, powered by Tensor Cores, and deliver groundbreaking AI performance: up to 65 TFLOPS in FP16 precision and up to 130 TOPS in INT8 precision.

Amazon EC2 offers a variety of G4 instances with one or multiple GPUs, and with different amounts of vCPU and memory. You can perform BERT inference below 5 milliseconds on a single T4 GPU with 16 GB, such as on a g4dn.xlarge instance (the cost of this instance at the time of writing is $0.526 per hour on demand in the US East (N. Virginia) Region.

For more information about G4 instances, see Amazon EC2 G4 Instances.

GluonNLP and MXNet

GluonNLP is a deep learning framework built on top of MXNet, which was specifically designed for NLP applications. It extends MXNet, providing NLP models, datasets, and examples.

GluonNLP includes an efficient implementation of the BERT model, scripts for training and performing inference, and several datasets (such as GLUE benchmark and SQuAD). For more information, see GluonNLP: NLP made easy.

For this post, we use the GluonNLP BERT implementation to perform inference on NLP tasks. Specifically, we use MXNet version 1.7 and GluonNLP version 0.10.0.

BERT-base inference results

We present results performing two different BERT tasks: question answering and classification (sentimental analysis using the Stanford Sentiment Treebank (SST2) dataset). We achieved the results after a set of GPU optimizations on MXNet.

In the following graphs, we compare latency achieved by a single GPU on a g4dn.xlarge instance with FP16 precision vs. the most efficient CPU instance in terms of latency, c5.24xlarge with INT8 precision, MKL BLAS and 24 OpenMP threads.

The following graph shows BERT-base latency on c5.25xlarge (INT8) and g4dn.xlarge (FP16) instances performing a classification inference task (SST2 dataset). Different sequence length values (80, 128, 384), and different batch sizes (1, 4, 16, 8 ,32, 64, 128, 300) are shown. In the case of sequence length, 128 values are included as labels.

The following graph shows BERT-base latency on c5.24xlarge (INT8) and g4dn.xlarge (FP16) instances performing a question answering inference task (SQuAD dataset). Different sequence length values (80, 128, 384), and different batch sizes (1, 4, 16, 8 ,32, 64, 128, 300) are shown. In the case of sequence length, 128 values are included as labels.


In the following two graphs, we present a cost comparison between several instances based on the throughput (sentences/s) and the cost of each instance on demand (cost per hour) in the US East (N. Virginia) Region.

The following graph shows dollars per 1 million sequence classification requests, for different instances, batch size 128, and several sequence lengths (80, 128 and 384). The price on demand of each instance per hour was based on the US East (N. Virginia) Region: $0.192 for m5.xlarge, $0.17 for c5.xlarge, $4.08 for c5.24xlarge, and $0.526 g4dn.xlarge.

The following graph shows dollars per 1 million question answering requests, for different instances, batch size 128, and several sequence lengths (80, 128 and 384). The price on demand of each instance per hour was based on the US East (N. Virginia) Region: $0.192 for m5.xlarge, $0.17 for c5.xlarge, $4.08 for c5.24xlarge, and $0.526 g4dn.xlarge.

Deploying BERT on G4 instances

You can easily reproduce the results in the preceding section on a g4dn.xlarge instance. You can start from a pretrained model and fine-tune it for a specific task before running inference, or you can download one of the following fine-tuned models:

Then complete the following steps:

  1. To initialize a G4 instance, on the Amazon EC2 console, choose Deep Learning AMI (Ubuntu 18.04) Version 28.1 (or posterior) and a G4 instance.
  2. Connect to the instance and set MXNet 1.7 and GluonNLP 0.10.x:
pip install mxnet-cu102==1.7.0
git clone --branch v0.10.x
cd gluon-nlp; pip install -e .; cd scripts/bert
python install

 The command python install generates a custom graph pass ( that optimizes the graph, and therefore performance. It can be passed to the inference script as an argument.

  1. If you didn’t download any fine-tuned parameters, you can now fine-tune your model to specify a sequence length and use a GPU.
    • For a question answering task, run the following script (approximately 180 minutes):
python3 --max_seq_length 128 --gpu
    • For a classification task, run the following script:
python3 --task_name [task_name] --max_len 128 --gpu 0

In the preceding code, task choices include ‘MRPC’, ‘QQP’, ‘QNLI’, ‘RTE’, ‘STS-B’, ‘CoLA’, ‘MNLI’, ‘WNLI’, ‘SST’ (refers to SST2), ‘XNLI’, ‘LCQMC’, and ‘ChnSentiCorp’. Computation time depends on the specific task. For SST, it should take less than 15 minutes.

By default, these scripts run 3 epochs (to achieve the published accuracy in [1]).

They generate an output file, output_dir/net.params, where the fine-tuned parameters are stored and from where they can be loaded at inference step. Scripts also perform a prediction test to check accuracy.

You should get an F1 score of 85 or higher in question answering, and a validation metric higher to 0.92 in SST classification task.

You can now perform inference using validation datasets.

  1. Force MXNet to use FP32 precision in Softmax and LayerNorm layers for better accuracy when using FP16.

These two layers are susceptible to overflow, so we recommend always using FP32. MXNet takes care of it if you set the following:

  1. Activate True FP16 computation for performance purposes.

General matrix multiply operations don’t present accuracy issues in this model. By default, they’re computed using FP32 accumulation (for more information, see the section Optimizing BERT model performance on MXNET 1.6 and 1.7 in this post), but you can activate the FP16 accumulation setting:

 export MXNET_FC_TRUE_FP16=1
  1. Run inference:
python3 --model_parameters [path_to_finetuned_params] --task [_task_] --gpu 0 --dtype float16

In the preceding code, the  task can be one of ‘QA’, ‘embedding’, ‘MRPC’, ‘QQP’, ‘QNLI’, ‘RTE’, ‘STS-B’, ‘CoLA’, ‘MNLI’, ‘WNLI’, ‘SST’, ‘XNLI’, ‘LCQMC’, or ‘ChnSentiCorp’ [1].

This command exports the model (JSON and parameter files) into the output directory (output_dir/[task_name]), and performs inference using the validation dataset corresponding to each task.

It reports the average latency and throughput.

The second time you run it, you can skip the export step by adding the tag --only_infer and specifying the exported model to use by adding --exported_model followed by the prefix name of the JSON or parameter files.

Optimal latency is achieved on G4 instances with FP16 precision. We recommend adding the flag -dtype float16 and activating MXNET_FC_TRUE_FP16 when performing inference. These flags shouldn’t reduce the final accuracy in your results.

By default, all these scripts use BERT-base (12 transformer-encoder layers). If you want to use BERT-large, use the flag --bert_model bert_24_1024_16 when calling the scripts.

Optimizing BERT model performance on MXNet 1.6 and 1.7

Computationally, the BERT model is mainly dominated by general matrix multiply operations (GEMMs). They represent up to 56% of time consumed when performing inference. The following chart shows the percentage of computational time spent on each operation type performing BERT-base inference (sequence length 128 and batch size 128).

MXNet uses the cuBLAS library to efficiently compute these GEMMs on the GPU. These GEMMs belong to the multi-head self-attention part of the model (4 GEMMs per transformer layer), and the feed-forward network (2 GEMMs per transformer layer).

In this section, we discuss optimizing the most computational-consuming operations.

The following table shows the improvement of each optimization. The performance improvements were achieved by the different GPU BERT optimizations implemented on MXNet and GluonNLP, performing a question answering inference task (SQuAD dataset), and using a sequence length of 128. Speedup achieved is shown for different batch sizes.

LayerNorm, Softmax and AddBias

Although LayerNorm was already optimized for GPUs on MXNet 1.5, the implementation of Softmax was optimized in MXNet 1.6. The new implementation improves inference performance on GPUs by optimizing the device memory accesses and using the CUDA registers and shared memory during reduction operations more efficiently. Additionally, you have the option to apply a max_length mask within the C++ Softmax operator, which removes the need to apply the mask at the Python level.

The addition of bias terms following GEMMs was also optimized. Instead of using an mshadow broadcast summation, a custom CUDA kernel is now attached to the FullyConnected layer, which includes efficient device memory accesses.

Multi-head self-attention

The following equation defines the attention mechanism used in the BERT model [2], where Q represents the query, K the key, V the value, and dk the inner dimension of these three matrices:

Three different linear projections (FullyConnected: GEMMs and Bias-Addition) are performed to obtain Q, K, and V from the same input (when the same input is employed, the mechanism is denominated self-attention), but with different weights:

    • Q = input Wqt
    • K = input Wkt
    • V = input Wvt

The input size is (BatchSize, SeqLength, EmbeddingDim), and each weight tensor W size is (ProjectionDim, EmbeddingDim).

In multi-head attention, many projections and attention functions are applied to the input as the number of heads, augmenting the dimensions of the weights so that each W size is ((NumHeads x ProjectionDim), EmbeddingDim).

All these projections are independent, so we can compute them in parallel within the same operation, producing an output which size is (BatchSize, SeqLength, 3 x NumHeads x ProjectionDim). That is, GluonNLP uses a single FullyConnected layer to compute Q, K, and V.

To compute the attention function (the preceding equation), we first need to compute the dot product QKT. We need to perform this computation independently for each head, with m=SeqLength number of Q rows, n=SeqLength number of K columns, and k=ProjectionDim size of vectors in the dot product. We can use a batched dot product operation, where the number of batches is (BatchSize x NumHeads), to compute all the dot products within the same operation.

However, to perform such an operation in cuBLAS, we need to have the batches and heads dimensions contiguous (in order to have a regular pattern to express distance between batches), but that isn’t the case by default (SeqLength dimension is between them). To avoid rearranging Q, K, and V, GluonNLP transposes the input so that its shape is (SeqLength, BatchSize, EmbeddingDim), and Q, K, and V are directly projected into a tensor with shape (SeqLength, BatchSize, 3 x NumHeads x ProjectionDim).

Moreover, to avoid splitting the joint QKV output, we can compute the projections in an interleaved fashion, allocating continuously the applied weights Wq, Wk, Wv of each individual head. The following diagram depicts the interleaved projection operation, where P is the projection size, and we end with a joint QKV output with shape (SeqLength, BatchSize, NumHeads x 3 x ProjectionDim).

This strategy allows us to compute QKT from a unique joint input tensor with cuBLASGEMMStridedBatched, setting the number of batches to (BatchSize x NumHeads) and the stride to (3 x ProjectionDim). We also use a strided batched GEMM to compute the dot product of V (same stride as before) with the output of the Softmax function. We implemented MXNet operators that deal with this cuBLAS configuration.

True FP16

Since MXNet 1.7, you can compute completely in FP16 precision GEMMs. By default, when the data type is FP16, MXNet sets cuBLAS to internally use FP32 accumulation. You can now set the environment variable MXNET_FC_TRUE_FP16 to 1 to force MXNet to use FP16 as the cuBLAS internal computation type.

Pointwise fusion and prearrangement of MHA weights and bias using a custom graph pass

Finally, the feed-forward part of the model, which happens after each transformer layer, uses Gaussian Exponential Linear Unit (GELU) as its activation function. This operation follows a feed-forward FullyConnected operation, which includes bias addition. We use the MXNet functionality of custom graph passes to detach the bias addition from the FullyConnected operation and fuse it with GELU through the pointwise fusion mechanism.

In our custom graph pass for BERT, we also prearrange the weights and bias terms for the multi-head self-attention computation so that we avoid any overhead at runtime. As explained earlier, weights need to be interleaved, and bias terms need to be joint into a unique tensor. We do this before exporting the model. This strategy shows benefits in small batch size cases.


In this post, we presented an efficient solution for performing BERT inference tasks on EC2 G4 GPU instances. We showed how a set of MXNet optimizations boost GPU performance, achieving speeds up to twice as fast in both question answering and classification tasks.

We have shown that g4dn.xlarge instances offer lower latency (below 4 milliseconds with batch size 1) than any EC2 CPU instance, and g4dn.xlarge is 3.8 times better than c5.24xlarge on average. Finally, g4dn.xlarge offers the best cost per million requests ratio—16 times better than CPU instances (c5.xlarge) on average.




We would like to thank Triston Cao, Murat Guney from NVIDIA, Sandeep Krishnamurthy from Amazon, the Amazon-MXNet team, and the NVIDIA MXNet team for their feedback and support.


The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post.


  1. Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
  2. Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

About the Authors

Moises Hernandez Fernandez is an AI DevTech Engineer at NVIDIA. He works on accelerating NLP applications on GPUs. Before joining NVIDIA, he was conducting research into the brain connectivity, optimizing the analysis of diffusion MRI using GPUs. Moises received a PhD in Neurosciences from Oxford University.

Haibin Lin is a former Applied Scientist at Amazon Web Services. He works on distributed systems, deep learning, and NLP. He is a PPMC and committer of Apache MXNet, and a major contributor to the GluonNLP toolkit. He has finished his M.S. in Computer Science at Carnegie Mellon University, advised by Andy Pavlo. Prior to that, he has received a B.Eng. in Computer Science from University of Hong Kong and Shanghai Jiao Tong University jointly.

Przemyslaw Tredak is a senior developer technology engineer on the Deep Learning Frameworks team at NVIDIA. He is a committer of Apache MXNet and leads the MXNet team at NVIDIA.

Anish Mohan is a Machine Learning Architect at Nvidia and the technical lead for ML/DL engagements with key Nvidia customers in the greater Seattle region. Before Nvidia, he was at Microsoft’s AI Division, working to develop and deploy AI/ML algorithms and solutions.

Read More

AI in Schools: Sony Reimagines Remote Learning with Artificial Intelligence

AI in Schools: Sony Reimagines Remote Learning with Artificial Intelligence

Back to school was destined to look different this year.

With the world adapting to COVID-19, safety measures are preventing a return to in-person teaching in many places. Also, students learning through conventional video conferencing systems often feel the content is difficult to read, or teachers block the words written on presentation boards.

Faced with these challenges, educators at Prefectural University of Hiroshima in Japan envisioned a high-quality remote learning system with additional features not possible with traditional video conferencing.

They chose a distance-learning solution from Sony that links lecturers and students across their three campuses. It uses AI to make it easy for presenters anywhere to engage their audiences and impart information using captivating video. Thanks to these innovations, lecturers at Prefectural University can now teach students simultaneously on three campuses linked by a secure virtual private network.

Sony remote learning solution
Sony’s remote learning solution in action, with Edge Analytics Appliance, remote cameras and projectors.

AI Helps Lecturers Get Smarter About Remote Learning

At the heart of Prefectural’s distance learning system is Sony’s REA-C1000 Edge Analytics Appliance, which was developed using the NVIDIA Jetson Edge AI platform. The appliance lets teachers and speakers quickly create dynamic video presentations without using expensive video production gear or learning sophisticated software applications.

Sony’s exclusive AI algorithms run inside the appliance. These deep learning models employ techniques such as automatic tracking, zooming and cropping to allow non-specialists to produce engaging, professional-quality video in real time.

Users simply connect the Edge Analytics Appliance to a camera that can pan, tilt and zoom; a PC; and a display or recording device. In Prefectural’s case, multiple cameras capture what a lecturer writes on the board, questions and contributions from students, and up to full HD images depending on the size of the lecture hall.

Managing all of this technology is made simple for the lecturers. A touchscreen panel facilitates intuitive operation of the system without the need for complex adjustment of camera settings.

Sony remote learning solution

Teachers Achieve New Levels of Transparency

One of the landmark applications in the Edge Analytics Appliance is handwriting extraction, which lets students experience lectures more fully, rather than having to jot down notes.

The application uses a camera to record text and figures as an instructor writes them by hand on a whiteboard or blackboard, and then immediately draws them as if they are floating in front of the instructor.

Students viewing the lecture live from a remote location or from a recording afterward can see and recognize the text and diagrams, even if the original handwriting is unclear or hidden by the instructor’s body. The combined processing power of the compact, energy-efficient Jetson TX2 and Sony’s moving/unmoving object detection technology makes the transformation from the board to the screen seamless.

Handwriting extraction is also customizable: the transparency of the floating text and figures can be adjusted, so that characters that are faint or hard to read can be highlighted in color, making them more legible — and even more so than the original content written on the board.

Create Engaging Content Without Specialist Resources


Another innovative application is Chroma key-less CG overlay, using state-of-the-art algorithms from Sony, like moving-object detection, to produce class content without the need for large-scale video editing equipment.

Like a personal greenscreen for presenters, the application seamlessly places the speaker in front of any animations, diagrams or graphs being presented.

Previously, moving-object detection algorithms required for this kind of compositing could only be run on professional workstations. With Jetson TX2, Sony was able to include this powerful deep learning-based feature within the compact, simple design of the Edge Analytics Appliance.

A Virtual Camera Operator

Numerous additional algorithms within the appliance include those for color-pattern matching, shape recognition, pose recognition and more. These enable features such as:

  • PTZ Auto Tracking — automatically tracks an instructor’s movements and ensures they stay in focus.
  • Focus Area Cropping — crops a specified portion from a video recorded on a single camera and creates effects as if the cropped portion were recorded on another camera. This can be used to generate, for example, a picture-in-picture effect, where an audience can simultaneously see a close-up of the presenter speaking against a wide shot of the rest of the stage.
  • Close Up by Gesture — automatically zooms in on and records students or audience members who stand up in preparation to ask a question.

With the high-performance Jetson platform, the Edge Analytics Appliance can easily handle a wide range of applications like these. The result is like a virtual camera operator that allows people to create engaging, professional-looking video presentations without the expertise or expense previously required to do so.

Officials at Prefectural University of Hiroshima say the new distance learning initiative has already led to greater student and teacher satisfaction with remote learning. Linking the university’s three campuses through the system is also fostering a sense of unity among the campuses.

“We chose Sony’s Edge Analytics Appliance for our new distance learning design because it helps us realize a realistic and comfortable learning environment for students by clearly showing the contents on the board and encouraging discussion. It was also appealing as a cost-effective solution as teachers can simply operate without additional staff,” said Kyousou Kurisu, director of public university corporation, Prefectural University of Hiroshima.

Sony plans to continually update applications available on the Edge Analytics Appliance. So, like any student, the system will only get better over time.

The post AI in Schools: Sony Reimagines Remote Learning with Artificial Intelligence appeared first on The Official NVIDIA Blog.

Read More

Whether It’s Rembrandt or Toilets, ‘Curiosity About How Things Work’ Is Key to Innovation, CGI Legend Pat Hanrahan Says

Whether It’s Rembrandt or Toilets, ‘Curiosity About How Things Work’ Is Key to Innovation, CGI Legend Pat Hanrahan Says

You may have never heard of Pat Hanrahan, but you have almost certainly seen his work.

His list of credits includes three Academy Awards, and his work on Pixar’s RenderMan rendering technology enabled Hollywood megahits Toy Story, Finding Nemo, Cars and Jurassic Park.

Hanrahan also founded Tableau Software — snatched up by Salesforce last year for nearly $16 billion — and has mentored countless technology companies as a Stanford professor.

Hanrahan is the most recent winner of the Turing Award, along with his longtime friend and collaborator Ed Catmull, a former president at Pixar and Disney Animation Studios. The award — a Nobel Prize, of sorts, in computer science —  was for their work in 3D computer graphics and computer-generated imagery.

He spoke Thursday at NTECH, NVIDIA’s annual internal engineering conference. The digital event was followed by a virtual chat between NVIDIA CEO Jensen Huang and Hanrahan, who taught a computer graphics course at NVIDIA’s Silicon Valley campus during its early days.

While the theme of his address was “You Can Be an Innovator,” the main takeaway is that a “curiosity about how things work” is a prerequisite.

Hanrahan said his own curiosity for art and studying how Rembrandt painted flesh tones led to a discovery. Artists of that Baroque period, he said, applied a technique in oil painting with layers, called impasto, for depth of skin tone. This led to his own deeper study of light’s interaction with translucent surfaces.

“Artists, they sort of instinctively figured it out,” he said. “They don’t know about the physics of light transport. Inspired by this whole idea of Rembrandt’s, I came up with a mathematical model.”

Hanrahan said innovative people need to be instinctively curious. He tested that out himself when interviewing job candidates in the early days of Pixar. “I asked everybody that I wanted to hire into the engineering team, ‘How does a toilet work?’ To be honest, most people did not know how their toilet worked,” he said, “and these were engineers.”

At the age of seven 7, he’d already lifted the back cover of the toilet to find out what made it work.

Hanrahan worked with Steve Jobs at Pixar. Jobs’s curiosity and excitement about touch-capacitive sensors — technology that dated back to the 1970s — would eventually lead to the touch interface of the iPhone, he said.

After the talk, Huang joined the video feed from his increasingly familiar kitchen at home and interviewed Hanrahan. The wide-ranging conversation was like a time machine, with questions and reminisces looking back 20 years and discussions peering forward to the next 20.

The post Whether It’s Rembrandt or Toilets, ‘Curiosity About How Things Work’ Is Key to Innovation, CGI Legend Pat Hanrahan Says appeared first on The Official NVIDIA Blog.

Read More

Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)

Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)

Posted by Konstantinos (Gus) Katsiapis on behalf of the TFX Team

Table of Contents

TFX logo


Software Engineering, as a discipline, has matured over the past 5+ decades. The modern world heavily depends on it, so the increased maturity of Software Engineering was an eventuality. Practices like testing and reliable technologies help make Software Engineering reliable enough to build industries upon. Meanwhile, Machine Learning (ML) has also grown over the past 2+ decades. ML is used more and more for research, experimentation and production workloads. ML now commonly powers widely-used products integral to our lives.

But ML Engineering, as a discipline, has not widely matured as much as its Software Engineering ancestor. Can we take what we have learned and help the nascent field of applied ML evolve into ML Engineering the way Programming evolved into Software Engineering?

In this article we will give a whirlwind tour of Sibyl and TensorFlow Extended (TFX), two successive end-to-end (E2E) ML platforms at Alphabet. We will share the lessons learned from over a decade of applied ML built on these platforms, explain both their similarities and their differences, and expand on the shifts (both mental and technical) that helped us on our journey. In addition, we will highlight some of the capabilities of TFX that help realize several aspects of ML Engineering. We argue that in order to unlock the gains ML can bring, organizations should advance the maturity of their ML teams by investing in robust ML infrastructure and promoting ML Engineering education. We also recommend that before focusing on cutting-edge ML modeling techniques, product leaders should invest more time in adopting interoperable ML platforms for their organizations. In closing, we will also share a glimpse into the future of TFX.

Where We Are Coming From

Applied ML has been an integral part of Google products and services over the last decade, and is becoming more so over time. We discovered early on from our endeavors to apply ML in production that while ML algorithms are important, they are usually insufficient in realizing the successful application of ML in a product. In particular, E2E ML platforms, which help with all aspects of the ML lifecycle, are usually needed to both accelerate ML adoption and make its use durable and sustainable.

Sibyl (2007 – 2020)

E2E ML platforms are not a new thing at Google. Sibyl, founded in 2007, was a platform that enabled massive-scale ML, catered to production use. Sibyl offered a decent amount of modeling flexibility on top of “wide” models (linear, logistic, poisson regression and later factorization machines) coupled with non-linear transformations and customizable loss functions and regularization. Importantly, Sibyl also offered tools for several aspects of the ML workflow including Data Ingestion, Data Analysis and Validation, Training (of course), Model Analysis, and Training-Serving Skew Detection. All these were packaged as a single integrated product that allowed for iterative experimentation. This holistic product offering, coupled with the Sibyl team’s user focus, rendered Sibyl to, once upon a time, be one of the most widely used E2E ML platforms at Google. Sibyl has since been decommissioned. It was in production for ~14 years, and the vast majority of its workloads migrated to TFX.

TFX (2017 – ?)

While several of us were still working on Sibyl, a notable revolution was happening in the ML algorithms fields with the popularization of Deep Learning (DL). In 2015, Google publicly released TensorFlow (which was itself a successor to a previous system called DistBelief). Since its inception, TensorFlow supported a variety of applications with a focus on DL training and inference. Its flexible programming model allowed it to be used for a lot more than DL and its popularity in both research and production positioned it as the lingua franca for authoring ML algorithms. While TensorFlow offered flexibility, it lacked a complete end-to-end production system. On the other hand, Sibyl had robust end-to-end capabilities, but lacked flexibility. It became apparent that we needed an E2E ML platform for TensorFlow in order to accelerate ML at Google; in 2017, nearly a decade after the birth of Sibyl, we launched TFX within Google. TFX is now the most widely used, general purpose E2E ML platform at Alphabet, including Google.

In the 3 years since its launch, TFX has enabled Alphabet to realize what might be described as “industrial-scale” ML: TFX is used by thousands of users within Alphabet, and it powers hundreds of popular Alphabet products, including Cloud AI services on Google Cloud Platform (GCP). On any given day there are thousands of TFX pipelines running, which are processing exabytes of data and producing tens of thousands of models, which in turn are performing hundreds of millions of inferences per second. TFX’s widespread adoption helps Alphabet realize the flow of research into production and enables very diverse use cases for both direct and indirect TFX users. This widespread adoption also enables teams to focus on model development rather than ML platform development, allowing ML to be more easily used in novel product areas, and creating a virtuous cycle of ML platform evolution from ML applications.

Based on our internal success, and the expectation that equivalents of ML engineering will be needed by organizations and individuals everywhere in the world, we decided to publicly describe the design and initial deployment of TFX within Google and to, step by step, make more of our learnings and our technology publicly available (including open source), while we continue building more of each. We were able to accomplish this in part because, like Sibyl, TFX built upon robust infrastructural dependencies. For example, Sibyl made heavy use of MapReduce and its successor Flume for its distributed data processing, and now TFX heavily uses their portable successor, Apache Beam, for the same.

Following in TensorFlow’s footsteps, the public TFX offering was released in early 2019 and widely adopted in under a year across environments including on-premises and GCP with Cloud AI Platform Pipelines. Some of our partners have also publicly shared their use cases powered by TFX, including how it radically improved their applied ML velocity.

Lessons From Our 10+ Year Journey Of ML Platform Evolution

Though the journey of ML Platform(s) evolution at Google has been a long and exciting one, we expect that the majority of excitement is yet to come! To that end, we want to share a summary of our learnings, some of which were more painfully gained than others. The learnings fall into two categories, namely what remained the same as part of the evolution, but also what changed, and why! We present the learnings in the context of two successive platforms, Sibyl and TFX, though we believe them to be widely applicable.

What Remains The Same And Why

The areas discussed in this section capture a few examples of things that seem enduring and pass the test of time. As such, we expect these to also remain applicable in the future, across different incarnations of ML platforms and frameworks. We look at these from both an applied ML perspective and an infrastructure perspective.

Applied ML

The Rules Of Machine Learning

Successfully applying ML to a product is very much a discipline. It involves a steep learning curve and necessitates some mental model shifts (or perhaps augmentations). To make this challenging task easier, we have publicly shared The Rules of Machine Learning. These are rules that represent learnings from iteratively applying ML to a lot of products at Google. Notably, the adoption of ML in Google products illustrates a common evolution:

  • Start with simple rules and heuristics, and generate data to learn from; this journey usually starts from the serving side.
  • Move to simple ML (i.e., simple models) and realize large gains; this is usually the entry point for introduction of ML pipelines.
  • Move to ML with more features and more advanced models to realize decent gains.
  • Move to state-of-the-art ML, manage refinement and complexity (for solutions to the problems that are worth it), and realize small gains.
  • Apply the above launch-and-iterate cycle to more aspects of products and to solve more problems, bearing in mind return on investment (and diminishing returns).

We have found The Rules of Machine Learning to be steadfast across platforms and time and we hope they end up being as valuable to others as they have been to us and our users. In particular, we believe that following the rules will help others be better at the discipline of ML engineering, including helping them avoid the mistakes that we and our users have made in the past. TFX is an attempt to codify these rules, quite literally, in code. We hope to benefit ourselves but also accelerate ML, done well, for the entire industry.

The Discipline Of ML Engineering

In developing The Rules of Machine Learning, we realized that the discipline for building robust systems where the core logic is produced by complex processes involving both code and data requires additional scrutiny beyond that which software engineering provides. As such, we define ML Engineering as a superset of the discipline of software engineering designed to handle the unique complexities of the practical application of ML.

Attempting to summarize the totality of the discipline of ML engineering would be somewhat difficult, if not impossible, especially given how our understanding of it is still limited, and the discipline itself continues to evolve. We do take solace in the following though:

  • The limited understanding we do have seems to be enduring across platforms and time.
  • Analogy can be a powerful tool, so several aspects of the better understood discipline of software engineering have helped us draw parallels of how ML engineering could evolve from ML programming, much like how software engineering evolved from programming.

An early realization we had was the following: artifacts are first class citizens in ML, on par with the processes that produce and consume them.

This realization affected the implementation and evolution of Sibyl; it was entrenched in TFX by the time we publicly wrote about it and was ultimately generalized and formalized in ML Metadata, now powering TFX.

Below we present fundamental elements of ML engineering, some examples of ML artifacts and their first class citizenship, and make an attempt to draw analogies with software engineering where possible.


Similarly to how code is at the heart of software, data is at the heart of ML. Data management represents serious challenges in production ML. Perhaps the simplest analogy would be to think about what constitutes a unit test for data. Unit tests verify expectations on how code should behave, by testing the contracts of the pertinent code and instilling trustworthiness in said contracts. Similarly, setting explicit expectations on the form of the data (including its schema, invariants and value distributions), and checking that the data agrees with implicit expectations embedded in the training code can, more so together, make the data trustworthy enough to train models with. Though unit tests can be exhaustive and verify strong contracts, data contracts are in general a lot weaker even if they are necessary. Though unit tests can be exhaustively consumed and verified by humans, data can usually be meaningful to humans only in summarized fashion.

Just as code repositories and version control are pillars for managing code evolution in software engineering, systems for managing data evolution and understanding are pillars of ML engineering.

TFX’s ExampleGen, StatisticsGen, SchemaGen and ExampleValidator components help one treat data as first class citizens, by enabling data management, analysis and validation in (continuous) ML pipelines.


Similarly to how a software engineer produces code that is compiled into programs, an ML engineer produces data and code which is “compiled” into ML programs, more commonly known as models. These two kinds of programs are however very different in nature. Though programs that come out of software usually have strong contracts, models have much weaker contracts. These weak contracts are usually statistical in nature and as such only verifiable in some summarized form (such as a model having sufficient accuracy on a subset of labeled data). This is not at all surprising since models are the product of code and data, and the latter itself doesn’t have strong contracts and is also only digestible in summarized form.

Just as code and data evolve over time, models also evolve over time. However, model evolution is more complicated than the evolution of its constituent code and data. For example, high test coverage (with fuzzing) can give good confidence in both the correctness and the correct evolution of a piece of code, but out-of-distribution and counterfactual yet realistic data for model evaluation can be notoriously difficult to produce.

In the same way that putting together multiple programs in a system necessitates integration testing which is a pillar of software engineering, putting together code and data necessitates end-to-end model validation and understanding which is a pillar of ML engineering.

TFX’s Evaluator and InfraValidator components provide validation and understanding of models, treating them as first class citizens of ML engineering.

Mergeable Fragments

Similarly to how a software engineer merges together pre-existing libraries (or systems) with their code in order to build useful programs, an ML engineer merges together code fragments, data fragments, analysis fragments and model fragments on a regular basis in order to build useful ML pipelines. A notable difference between software engineering and ML engineering is that even when the code is fixed for the latter, data is usually volatile for it (e.g. new data arrives on a regular basis) and as such the downstream artifacts need to be produced frequently and efficiently. For example, a new version of a model usually needs to be produced if any part of its input data has changed. As such, it is important for ML pipelines to produce artifacts that are mergeable. For example, a summary of statistics from one dataset should be easily mergeable with that of another dataset such that it is easy to summarize the statistics of the union of the two datasets. Similarly, it should be easy to transfer the learnings of one model to another model in general, and the learnings of a previous version of a model to the next version of the same model in particular.

There is however a catch, which relates to the previous discussion regarding the equivalents of test coverage for models. Merging new fragments into a model could necessitate creation of novel out-of-distribution and counterfactual evaluation data, contributing to the difficulty of (efficient) model evolution, thus rendering it a lot harder than pure code evolution.

TFX’s ExampleGen, Transform, Trainer and Tuner components, together with TensorFlow Hub, help one treat artifacts as first class citizens by enabling production and consumption of mergeable fragments in workflows that perform data caching, analyzer caching, warmstarting and transfer learning.

Artifact Lineage

Despite all the advanced methodology and tooling that exists for software engineering, the programs and systems that are built invariably need to be debugged. The same holds for ML programs, but debugging them is notoriously harder because non-proximal effects are a lot more prevalent for ML programs due to the plethora of artifacts involved. A model might be inaccurate due to bad artifacts from several sources of error, including flaws in the code, the learning algorithm, the training data, the serving path, or the serving data, to name a few. Much like how stack traces are invaluable for identifying root causes of defects in software programs, the lineage of all artifacts produced and consumed by an ML pipeline is invaluable for identifying root causes of defects in ML models. Additionally, by knowing which downstream artifacts were produced from a problematic artifact, we can identify all impacted systems and users and take mitigating actions.

TFX’s use of ML Metadata (MLMD) helps treat artifacts as first class citizens. MLMD enables advanced cataloging and querying of metadata and lineage associated with artifacts which can together increase the confidence of sharing artifacts even outside the boundaries of a pipeline. MLMD also helps with advanced debugging and, when coupled with the underlying data storage layer, forms the foundation of TFX’s ML compliance mechanisms.

Continuous Learning And Unlearning

ML production pipelines operate in a dynamic environment:

  • New data can arrive continuously.
  • The modeling code can change, particularly in the early stages of model development.
  • The surrounding infrastructure can change, e.g., a new version of some underlying (ML) library.

When changes happen, a pipeline needs to react, often by rerunning its steps in the new environment. This dynamicity increases the importance of provenance tracking in order to facilitate debugging and root-cause analysis. As a simple example, to debug a model failure, it is necessary to know not only which data was used to train the model, but also the versions of the modeling code and any surrounding infrastructure.

ML pipelines must also support low-friction mechanisms to handle these changes. Consider for example the arrival of new data, which necessitates retraining the model. This is a natural requirement in rapidly changing environments, like recommender systems or adversarial systems. Requiring the user to manually retrain the model can be unrealistic, given that the data can arrive at a regular and frequent rate. Instead, we can employ automation by way of “continuous training”, where the pipeline detects the presence of new data and automatically schedules the generation of updated models. In turn, this functionality requires automatically: orchestrating work based on the presence of artifacts (including data), recovering from intermittent failures, and catching up to real-time when recovering. It is common for ML pipelines to run for years ingesting code and data, continuously producing models that make predictions that inform decisions.

Another example of a low-friction mechanism is support for “backfilling” an ML pipeline. In this case, the user might need to rerun the pipeline on existing artifacts but using updated versions of the components, such as rerunning the trainer on existing data using a new version of the modeling code/library. Another use of backfilling is rerunning the pipeline with new versions of existing data, say, to fix an error in the data. These backfills are orthogonal to continuous training and can be used together. For instance, the user can manually trigger a rerun of the trainer, and the generated model artifact can then automatically trigger model evaluation and validation.

TFX was built from the ground up in a way that enables continuous learning (and unlearning) which fundamentally shaped its design. At the same time, these advanced capabilities also allow it to be used in a “one-shot”, discontinuous, fashion. In fact, within Alphabet, both modes of deployment are widely used. Moreover, TFX also supports different types of backfill operations to enable fine-grained interventions during normal pipeline execution.

Even though the public TFX offering doesn’t yet offer continuous ML pipelines, we are actively working on making our existing technology portable so that it can be made publicly available (e.g RFC).


Building On The Shoulders Of Giants

Realizing ambitious goals necessitates building on top of solid foundations, collaborating with others and leveraging each other’s work. TFX reuses many of Sibyl’s system designs, hardened over a decade of Sibyl’s production ML experience. Additionally, TFX incorporates new technologies in areas where robust standards emerged:

  • Similarly to how Sibyl built its algorithms and workflows on top of MapReduce, TFX leverages both TensorFlow and Apache Beam for its distributed training and data processing workflows.
  • Similarly to how Sibyl was columnar, TFX adopted Apache Arrow as the columnar in-memory representation for its compute-intensive libraries.

Taking dependencies where robust standards have emerged has allowed TFX and its users to achieve seamless performance and scalability. It also enables TFX to focus its energy on building the deltas of what is needed for applied ML, as opposed to re-implementing difficult-to-get-right technology. Some of our dependencies, like Kubeflow Pipelines or Apache Airflow, are selected by TFX’s users themselves when the value / features they get from them outweigh the costs that the additional dependencies entail.

Taking dependencies unfortunately incurs costs. We have found that taking dependencies requires effort that is super-linear to the number of dependencies. Said costs are often absorbed by us and our sister teams but can (and sometimes do) leak to our users, usually in the form of conflicting (version) dependencies or incompatibilities between environments and dependencies.

Interoperability And Positive Externalities

ML platforms do not operate in a vacuum. They instead operate within the context of a bigger system or infrastructure, connecting to data producing sources upstream and model consuming sinks downstream, which in turn frequently produce the data that feeds the ML platform, thereby closing the loop. Strong adoption of a platform usually necessitates interoperability with other important technologies in its environment.

  • Similarly to how Sibyl interoperated with Google’s Ads technology stack for data ingestion and model serving, TFX offers a plethora of connectors for data ingestion and allows serving the produced model in multiple deployment environments and devices.
  • Similarly to how Sibyl interoperated with Google’s compute stack, TFX leverages Apache Beam to execute on Apache Flink and Apache Spark clusters as well as serverless offerings like Google Cloud Dataflow.
  • TFX built an orchestration abstraction on top of MLMD and provides orchestration options on top of Apache Airflow, Apache Beam, Kubeflow Pipelines as well as the primitives to integrate with one’s custom orchestrator. MLMD itself works with several relational databases like SQLite and MySQL.

Interoperability necessitates some amount of abstraction and standardization and usually enables sum-greater-than-its-parts effects. TFX is both a beneficiary and a benefactor of the positive externalities created by said interoperability, both within and outside of Alphabet. TFX’s users are also beneficiaries of the interoperability as they can more easily deploy and use TFX on top of their existing installed base.

Interoperability also comes with costs. The combination of multiple technology stacks can lead to an exponential number of distinct deployment configurations. While we test some of the distinct deployment configurations end-to-end and at-scale, like for example TFX on GCP, we have neither the expertise nor the resources to do so for the combinatorial explosion of all possible deployment options. We thus encourage the community to work with us on the deployment configurations that are most useful for them.

What Is Different And Why

The areas discussed in this section capture a few examples of things that needed to change in order for our ML platform to adapt to a new reality and as such remain useful and impactful.

Environment And Device Portability

Sibyl was a massive scale ML platform designed to be deployed on Google’s large-scale cluster, namely Borg. This made sense as applied ML at Google was, originally, primarily used in products that were widely used. As ML expertise grew across the world, and ML could be applied to more use cases (large and small) across environments both within and outside of Google, the need for portability gradually but surely became a hard constraint.

  • While Sibyl ran only on Google’s datacenters, TFX runs on laptops, workstations, servers, datacenters, and public Clouds. In particular, when TFX runs on Google’s Cloud, it leverages automation and optimizations offered by GCP Services, enabled by Google’s unique infrastructure.
  • While Sibyl ran only on CPUs, TFX leverages TensorFlow to run on different kinds of hardware including CPUs, GPUs and Google’s TPUs.
  • While Sibyl’s models ran on servers, TFX leverages TensorFlow to produce models that run on laptops, workstations, and servers via TensorFlow Serving and Apache Beam, on mobile and IoT devices via TensorFlow Lite, and on browsers via TensorFlow JS.

TFX’s portability enabled it to be used in a very diverse set of environments and devices, in order to solve problems from small scale to massive scale.

Unfortunately, portability comes with costs. We have found that maintaining a portable core with environment-specific and device-specific specialization requires effort that is super-linear to the number of environments / devices. Said costs are however largely absorbed by us and our sister teams and as such are frequently not visible to our users.

Modularity And Layering

Even though Sibyl’s offering as an integrated product was immensely valuable, its structure and interface were somewhat monolithic, limiting it to a specific set of “direct” users who would have to adopt it wholesale. In contrast, TFX evolved to be a modular and layered architecture, and became more so over time as partnerships with other teams and products grew. Notable layers (with examples) in TFX include:

Layer Examples
ML Services

(of composable Components)


TFX’s layered architecture enables it to be used by a very diverse set of users whether that’s piecemeal via its libraries, wholesale via its pipelines (with or without the pertinent services), or in a fashion that’s completely oblivious to the end users (e.g. by them using ML services which TFX powers under the hood)!

Unfortunately, layering comes with costs. We have found that maintaining multiple publicly accessible layers of our product requires effort that is roughly linear to the number of layers. Said costs occasionally leak to our users in the form of confusion regarding what layer makes the most sense for them to use.

Multi-faceted Flexibility

Even though Sibyl was more flexible in terms of modeling capabilities compared to available alternatives at the time, aspects of its flexibility across several parts of the ML workflow fell short of Google’s needs for accelerating ML for novel use cases, which led to the development of TFX.

  • While Sibyl only offered specific kinds of data analysis, TFX’s StatisticGen component offers more built-in capabilities and the ability to realize custom analyses, via TensorFlow Data Validation.
  • While Sibyl only offered transformations that were pure composable mappers, TFX’s Transform component offers more mappers, custom mappers, more analyzers, custom analyzers, as well as arbitrarily composed (custom) mappers and (custom) analyzers, via TensorFlow Transform.
  • While Sibyl only offered “wide” models, TFX’s Trainer component offers any model that can be realized on top of TensorFlow, including models that can be shared and can transfer-learn, via TensorFlow Hub.
  • While Sibyl only offered automatic feature crossing (a.k.a. feature conjunctions) on top of “wide” models, TFX’s Tuner component allows for arbitrary hyper parameter optimization based on state of the art algorithms.
  • While Sibyl only offered specific kinds of model analysis, TFX’s Evaluator component offers more built-in metrics, custom metrics, confidence intervals and fairness indicators, via TensorFlow Model Analysis.
  • While Sibyl’s pipeline topology was fixed (albeit somewhat customizable), TFX’s SDK allows one to create custom (optionally containerized) components and use them together with standard components in a flexible and fully customizable pipeline topology.

The increase of flexibility in all these dimensions enabled improved experimentation, wider reach, more use cases, as well as accelerated flow from research to production.

Flexibility does not come without costs. A more flexible system is one that is harder to get right in the first place as well as harder for us to maintain and to evolve as developers of the ML platform. Users may also have to manage increased complexity as they take advantage of this flexibility. Furthermore, we might not be able to offer as strong of a support story on top of an ML platform that is Turing complete.

Where We Are Going

Armed with the knowledge of the past, we present a glimpse of what we plan for the future of TFX, as of 2020. We will continue our work on enabling ML Engineering in order to democratize applied ML, and help everyone practice responsible AI and apply it in a fashion that upholds Google’s AI Principles.

Drive Interoperability And Standards

In order to meet the demand for the burgeoning variety of ML solutions, we will continue to increase our technology’s interoperability. Our work on interoperability and standards as well as open-sourcing more of our technology, reflects our principle to “be socially beneficial” as well as to “be made available for uses that accord with these principles” by making it easier for everyone to follow these practices. As part of this mission, we will empower the industry to build advanced ML systems by open-sourcing more of our technology, and by standardizing ML artifacts and metadata. Some select examples of this work include:

  • TFX Standardized Inputs.
  • Advanced TFX DSL semantics, Data Model and IR.
  • Standardization of ML artifacts and metadata.
  • Standardization of distributed workloads on heterogeneous runtime environments.
  • Inference on distributed and streaming models.
  • Improvements to interoperability with mobile and edge ML deployments.
  • Improvements for ML framework interoperability and artifact sharing.

Increase Automation

Automation is the backbone of reliable production systems, and TFX is heavily invested in improving and expanding its use of automation. Our work in increased automation reflects our principles of helping make ML deployments “be built and tested for safety” and “avoid creating or reinforcing unfair bias”. Some upcoming efforts include a TFX Pipeline testing framework, automated model improvement in the TFX Tuner, auto-detecting surprising model behavior on multidimensional slices, facilitating automatic production of Model Cards and improving our training-serving skew detection capabilities. TFX on GCP will also continue driving requirements for new (and will better make use of existing) advanced automation features of pertinent services.

Improve ML Understanding

ML understanding is an important aspect of deploying production ML, and TFX is well positioned to provide significant gains in this field. Our work on improving ML understanding reflects our principles to help “avoid creating or reinforcing unfair bias” and help make ML deployments “be accountable to people”. Critical to understanding is to be able to track the lineage of artifacts used to produce a model, an area TFX will continue to invest in. Improvements to TFX technologies like struct2tensor will further enable training, serving, and analyzing models on structured data, thus allowing reasoning about models closer to the original data semantics. We also plan to utilize TFX as a vehicle to expand support for fairness evaluation, remediation, and documentation.

Uphold High Standards And Best Practices

As a vehicle for amplification of ML technology, TFX must continue to “uphold high standards of scientific excellence” and promote best practices. The team will continue publishing scientific papers and conducting public outreach via our existing channels, as well as offer educational courses in partnership with established institutions. We will also improve trust in our model analysis tools using integrated uncertainty measures by, for example, enabling scalable computation of confidence intervals for model metrics, and we will improve our training-serving skew detection capabilities. It’s also critical for research and production to be able to have reproducible ML artifacts, enabled by our work in precise provenance tracking for auditing and reproducing models. Also key is reproducibility of measurements, driven by efforts like NitroML, which will provide tooling for benchmarking AutoML pipelines.

Given that several of the areas where we expand our technology are new to us, we will make an effort to distinguish the battle-tested from the experimental aspects of our technology, in order to enable our users to confidently choose the set of capabilities that meet their desires and needs.

Improve Tooling

Despite TFX providing tools for aspects of ML engineering and several phases of the ML lifecycle, we believe this is still a nascent area. While improving tooling is a natural fit for TFX, it also reflects our principle of helping ML deployments “be made available for uses that accord with these principles”, “supporting scientific excellence,” and being “built and tested for safety” .

One area of improvement is applying ML to the data itself, be it through sensing anomalies or finding patterns in data or enriching data with predictions from ML models. Making it easy to enrich large volumes of data (especially critical streaming data used for low-latency, high volume actions) has always been a challenge. Bringing TFX capabilities into data processing frameworks is our first step here. We have already made it possible to enrich streaming events with labels or make predictions in Apache Beam and, by extension, Cloud Dataflow. We plan to follow this work by leveraging pre-built models (served out of Cloud AI Pipelines and TensorFlow Serving) to make adding a new field in a distributed dataset representing predictions from streams of models trivially easy.

Furthermore, while there are many tools for detecting, discovering, and auditing ML workflows, there is still a need for automated (or assisted) mitigation of discovered issues, and we will invest in this area. For example, proactively predicting which pipeline runs won’t result in better models based on the currently-executing pipeline, perhaps even before training, can significantly reduce time and resources spent on creating poor models.

A Joint Journey

Building TFX and exploring the fundamentals of ML engineering was the cumulative effort of many people over many years. As we continue to make strides and further develop this field, it’s important we recognize the collaborative effort of those who got us here.

Of course, it will take many more collaborations to drive the future of this field, and as such, we invite you to join us on this journey “Towards ML Engineering”!

The TFX Team

The TFX project is realized via collaboration of multiple organizations within Google. Different organizations usually focus on different technology and product layers, though there is a lot of overlap on the portable parts of our technology. Overall we consider ourselves a single team and below we present an alphabetically sorted list of current TFX team members who are contributors to the ideation, research, design, implementation, execution, deployment, management, and advocacy (to name a few) aspects of TFX; they continue to inspire, help, teach, and challenge each other to advance our field:

Abhijit Karmarkar, Adam Wood, Aleksandr Zaks, Alina Shinkarsky, Neoklis Polyzotis, Amy Jang, Amy McDonald Sandjideh, Amy Skerry-Ryan, Andrew Audibert, Andrew Brown, Andy Lou, Anh Tuan Nguyen, Anirudh Sriram, Anna Ukhanova, Anusha Ramesh, Archana Jain, Arun Venkatesan, Ashley Oldacre, Baishun Wu, Ben Mathes, Billy Lamberta, Chandni Shah, Chansoo Lee, Chao Xie, Charles Chen, Chi Chen, Chloe Chao, Christer Leusner, Christina Greer, Christina Sorokin, Chuan Yu Foo, CK Luk, Connie Huang, Daisy Wong, David Smalling, David Zats, Dayeong Lee, Dhruvesh Talati, Doojin Park, Elias Moradi, Emily Caveness, Eric Johnson, Evan Rosen, Florian Feldhaus, Gal Oshri, Gautam Vasudevan, Gene Huang, Goutham Bhat, Guanxin Qiao, Gus Katsiapis, Gus Martins, Haiming Bao, Huanming Fang, Hui Miao, Hyeonji Lee, Ian Nappier, Ihor Indyk, Irene Giannoumis, Jae Chung, Jan Pfeifer, Jarek Wilkiewicz, Jason Mayes, Jay Shi, Jiayi Zhao, Jingyu Shao, Jiri Simsa, Jiyong Jung, Joana Carrasqueira, Jocelyn Becker, Joe Liedtke, Jongbin Park, Jordan Grimstad, Josh Gordon, Josh Yellin, Jungshik Jang, Juram Park, Justin Hong, Karmel Allison, Kemal El Moujahid, Kenneth Yang, Khanh LeViet, Kostik Shtoyk, Lance Strait, Laurence Moroney, Li Lao, Liam Crawford, Magnus Hyttsten, Makoto Uchida, Manasi Joshi, Mani Varadarajan, Marcus Chang, Mark Daoust, Martin Wicke, Megha Malpani, Mehadi Hassen, Melissa Tang, Mia Roh, Mig Gerard, Mike Dreves, Mike Liang, Mingming Liu, Mingsheng Hong, Mitch Trott, Muyang Yu, Naveen Kumar, Ning Niu, Noah Hadfield-Menell, Noé Lutz, Nomi Felidae, Olga Wichrowska, Paige Bailey, Paul Suganthan, Pavel Dournov, Pedram Pejman, Peter Brandt, Priya Gupta, Quentin de Laroussilhe, Rachel Lim, Rajagopal Ananthanarayanan, Rene van de Veerdonk, Robert Crowe, Romina Datta, Ron Yang, Rose Liu, Ruoyu Liu, Sagi Perel, Sai Ganesh Bandiatmakuri, Sandeep Gupta, Sanjana Woonna, Sanjay Kumar Chotakur, Sarah Sirajuddin, Sheryl Luo, Shivam Jindal, Shohini Ghosh, Sina Chavoshi, Sydney Lin, Tanya Grunina, Thea Lamkin, Tianhao Qiu, Tim Davis, Tris Warkentin, Varshaa Naganathan, Vilobh Meshram, Volodya Shtenovych, Wei Wei, Wolff Dobson, Woohyun Han, Xiaodan Song, Yash Katariya, Yifan Mai, Yiming Zhang, Yuewei Na, Zhitao Li, Zhuo Peng, Zhuoshu Li, Ziqi Huang, Zoey Sun, Zohar Yahav

Thank you, all!

The TFX Team … Extended

Beyond the current TFX team members, there have been many collaborators both within and outside of Alphabet whose discussions, technology, as well as direct and indirect contributions, have materially influenced our journey. Below we present an alphabetically sorted list of these collaborators:

Abdulrahman Salem, Ahmet Altay, Ajay Gopinathan‎, Alexandre Passos, Alexey Volkov, Anand Iyer, Andrew Bernard‎, Andrew Pritchard‎, Chary Aasuri, Chenkai Kuang, Chenyu Zhao, Chiu Yuen Koo, Chris Harris, Chris Olston, Christine Robson, Clemens Mewald, Corinna Cortes, Craig Chambers, Cyril Bortolato, D. Sculley, Daniel Duckworth‎, Daniel Golovin, David Soergel, Denis Baylor, Derek Murray, Devi Krishna, Ed Chi, Fangwei Li, Farhana Bandukwala, Gal Elidan, Gary Holt, George Roumpos, Glen Anderson, Greg Steuck, Grzegorz Czajkowski, Haakan Younes, Heng-Tze Cheng, Hossein Attar, Hubert Pham, Hussein Mehanna, Irene Cai, James L. Pine, James Pine, James Wu, Jeffrey Hetherly, Jelena Pjesivac-Grbovic, Jeremiah Harmsen, Jessie Zhu, Jiaxiao Zheng, Joe Lee, Jordan Soyke, Josh Cai, Judah Jacobson, Kaan Ege Ozgun‎, Kenny Song, Kester Tong, Kevin Haas, Kevin Serafini, Kiril Gorovoy, Kostik Steuck, Kristen LeFevre, Kyle Weaver, Kym Hines, Lana Webb, Lichan Hong, Lukasz Lew, Mark Omernick, Martin Zinkevich, Matthieu Monsch, Michel Adar, Michelle Tsai‎, Mike Gunter, Ming Zhong, Mohamed Hammad, Mona Attariyan, Mustafa Ispir, Neda Mirian, Nicholas Edelman‎, Noah Fiedel, Panagiotis Voulgaris‎, Paul Yang, Peter Dolan, Pushkar Joshi‎, Rajat Monga, Raz Mathias‎, Reiner Pope, Rezsa Farahani, Robert Bradshaw, Roberto Bayardo, Rohan Khot, Salem Haykal, Sam McVeety, Sammy Leong, Samuel Ieong, Shahar Jamshy, Slaven Bilac, Sol Ma, Stan Jedrus, Steffen Rendle, Steven Hemingray‎, Steven Ross, Steven Whang, Sudip Roy, Sukriti Ramesh, Susan Shannon, Tal Shaked, Tushar Chandra, Tyler Akidau, Venkat Basker, Vic Liu, Vinu Rajashekhar, Xin Zhang, Yan Zhu‎, Yaxin Liu, Younghee Kwon, Yury Bychenkov‎, Zhenyu Tan

Thank you, all!

Read More

Active learning workflow for Amazon Comprehend custom classification models – Part 1.

Active learning workflow for Amazon Comprehend custom classification models – Part 1.

Amazon Comprehend  Custom Classification API enables you to easily build custom text classification models using your business-specific labels without learning ML. For example, your customer support organization can use Custom Classification to automatically categorize inbound requests by problem type based on how the customer has described the issue.  You can use custom classifiers to automatically label support emails with appropriate issue types, routing customer phone calls to the right agents, and categorizing social media posts into user segments.

For custom classification, you start by creating a training job with a ground truth dataset comprising a collection of text and corresponding category labels. Upon completing the job, you have a classifier that can classify any new text into one or more named categories. When the custom classification model classifies a new unlabeled text document, it predicts what it has learned from the training data. Sometimes you may not have a training dataset with various language patterns, or once you deploy the model, you start seeing completely new data patterns. In these cases, the model may not be able to classify these new data patterns accurately. How can we ensure continuous model training to keep it up to date with new data and patterns?

In this two part blog series, we discuss an architecture pattern that allows you to build an active learning workflow for Amazon Comprehend custom classification models. The first post will describe a workflow comprising real-time classification, feedback pipelines and human review workflows using Amazon Augmented AI (Amazon A2I). The second post will cover the automated model building using the human reviewed data, selecting the best model, and automated deployment of an endpoint of the chosen model.

Feedback loops play a pivotal role in keeping the models up to date. This feedback helps the models learn about their misclassifications and learn the right ones. This process of teaching the models continuously through feedback and deploying them is called active learning.

For every prediction Amazon Comprehend Custom Classification makes, it also gives a confidence score associated with its prediction. This architecture proposes that you set an acceptable threshold and only accept the predictions with a confidence score that exceeds the threshold. All the predictions that have a confidence score less than the desired threshold are flagged for human review. The human decides whether to accept the model’s prediction or correct it.

In some instances, the model may be confident about its predictions, but the classification might be wrong. In these scenarios, the end-user applications that receive the model predictions can request explicit feedback from its users on the prediction quality. A human moderator reviews this explicit feedback and reclassifies instances where the feedback was negative. This process of generating human-verified data and using it for model retraining helps keep the models up to date, reduce data drift, and achieve higher model accuracy.

Feedback Workflow Architecture.

In this section, we discuss an architectural pattern for implementing an end-to-end active learning workflow for custom classification models in Amazon Comprehend using Amazon A2I. The active learning workflow comprises the following components:

  1. Real-time classification
  2. Feedback loops
  3. Human classification
  4. Model building
  5. Model selection
  6. Model deployment

The following diagram illustrates this architecture covering the first three components. In the following sections, we walk you through each step in the workflow.

Architecture Diagram for Feedback Loops

Real-time classification

To use custom classification in Amazon Comprehend, you need to create a custom classification job that reads a ground truth dataset from an Amazon Simple Storage Service (Amazon S3) bucket and builds a classification model. After the model builds successfully, you can create an endpoint that allows you to make real-time classifications of unlabeled text. This stage is represented by steps 1–3 in the preceding architecture:

  1. The end-user application calls an API Gateway endpoint with a text that needs to be classified.
  2. The API Gateway endpoint then calls an AWS Lambda function configured to call an Amazon Comprehend endpoint.
  3. The Lambda function calls the Amazon Comprehend endpoint, which returns the unlabeled text classification and a confidence score.

Feedback collection

When the endpoint returns the classification and the confidence score during the real-time classification, you can send instances with low-confidence scores to human review. This type of feedback is called implicit feedback.

  1. The Lambda function sends the implicit feedback to an Amazon Kinesis Data Firehose.

The other type of feedback is called explicit feedback and comes from the application’s end-users that use the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. Explicit feedback can be sent either in real-time through an API or a batch process.

  1. End-users of the application submit explicit real-time feedback through an API Gateway endpoint.
  2. The Lambda function backing the API endpoint transforms the data into a standard feedback format and writes it to the Kinesis Data Firehose delivery stream.
  3. End-users of the application can also submit explicit feedback as a batch file by uploading it to an S3 bucket.
  4. A trigger configured on the S3 bucket triggers a Lambda function.
  5. The Lambda function transforms the data into a standard feedback format and writes it to the delivery stream.
  6. Both the implicit and explicit feedback data gets sent to a delivery stream in a standard format. All this data is buffered and written to an S3 bucket.

Human classification

The human classification stage includes the following steps:

  1. A trigger configured on the feedback bucket in Step 10 invokes a Lambda function.
  2. The Lambda function creates Amazon A2I human review tasks for all the feedback data received.
  3. Workers assigned to the classification jobs log in to the human review portal and either approve the classification by the model or classify the text with the right labels.
  4. After the human review, all these instances are stored in an S3 bucket and used for retraining the models. Part 2 of this series covers the retraining workflow.

Solution overview

The next few sections of the post go over how to set up this architecture in your AWS account. We classify news into four categories: World, Sports, Business, and Sci/Tech, using the AG News dataset for custom classification, and set up the implicit and explicit feedback loop. You need to complete two manual steps:

  1. Create an Amazon Comprehend custom classifier and an endpoint.
  2. Create an Amazon SageMaker private workforce, worker task template, and human review workflow.

After this, you run the provided AWS CloudFormation template to set up the rest of the architecture.


Before you get started, download the dataset and upload it to Amazon S3. This dataset comprises a collection of news articles and their corresponding category labels. We have created a training dataset called train.csv from the original dataset and made it available for download.

The following screenshot shows a sample of the train.csv file.

CSV file representing the Training data set

After you download the train.csv file, upload it to an S3 bucket in your account for reference during training. For more information about uploading files, see How do I upload files and folders to an S3 bucket?

Creating a custom classifier and an endpoint

To create your classifier for classifying news, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom Classification.
  2. Choose Train classifier.
  3. For Name, enter news-classifier-demo.
  4. Select Using Multi-class mode.
  5. For Training data S3 location, enter the path for train.csv in your S3 bucket, for example, s3://<your-bucketname>/train.csv.
  6. For Output data S3 location, enter the S3 bucket path where you want the output, such as s3://<your-bucketname>/.
  7. For IAM role, select Create an IAM role.
  8. For Permissions to access, choose Input and output (if specified) S3 bucket.
  9. For Name suffix, enter ComprehendCustom.

Comprehend Custom Classification Model Creation

  1. Scroll down and choose Train Classifier to start the training process.

The training takes some time to complete. You can either wait to create an endpoint or come back to this step later after finishing the steps in the section Creating a private workforce, worker task template, and human review workflow.

Creating a custom classifier real-time endpoint

To create your endpoint, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom Classification.
  2. From the Classifiers list, choose the name of the custom model for which you want to create the endpoint and select your model news-classifier-demo.
  3. From the Actions drop-down menu, choose Create endpoint.
  4. For Endpoint name, enter classify-news-endpoint and give it one inference unit.
  5. Choose Create endpoint
  6. Copy the endpoint ARN as shown in the following screenshot. You use it when running the CloudFormation template in a future step.

Custom Classification Model Endpoint Page

Creating a private workforce, worker task template, and human review workflow.

This section walks you through creating a private workforce in Amazon SageMaker, a worker task template, and your human review workflow.

Creating a labeling workforce

  1. For this post, you will create a private work team and add only one user (you) to it. For instructions, see Create a Private Workforce (Amazon SageMaker Console).
  2. Once the user accepts the invitation, you will have to add him to the workforce. For instructions, see the Add a Worker to a Work Team section the Manage a Workforce (Amazon SageMaker Console)

Creating a worker task template

To create a worker task template, complete the following steps:

  1. On the Amazon A2I console, choose Worker task templates.
  2. Choose to Create a template.
  3. For Template name, enter custom-classification-template.
  4. For Template type, choose Custom,
  5. In the Template editor, enter the following GitHub UI template code.
  6. Choose Create.

Worker Task Template

Creating a human review workflow

To create your human review workflow, complete the following steps:

  1. On the Amazon A2I console, choose Human review workflows.
  2. Choose Create human review workflow.
  3. For Name, enter classify-workflow.
  4. Specify an S3 bucket to store output: s3://<your bucketname>/.

Use the same bucket where you downloaded your train.csv in the prerequisite step.

  1. For IAM role, select Create a new role.
  2. For Task type, choose Custom.
  3. Under Worker task template creation, select the custom classification template you created.
  4. For Task description, enter Read the instructions and review the document.
  5. Under Workers, select Private.
  6. Use the drop-down list to choose the private team that you created.
  7. Choose Create.
  8. Copy the workflow ARN (see the following screenshot) to use when initializing the CloudFormation parameters.

Human Review Workflow Page

Deploying the CloudFormation template to set up active learning feedback

Now that you have completed the manual steps, you can run the CloudFormation template to set up this architecture’s building blocks, including the real-time classification, feedback collection, and the human classification.

Before deploying the CloudFormation template, make sure you have the following to pass as parameters:

  • Custom classifier endpoint ARN
  • Amazon A2I workflow ARN
  1. Choose Launch Stack:

  1. Enter the following parameters:
    1. ComprehendEndpointARN – The endpoint ARN you copied.
    2. HumanReviewWorkflowARN – The workflow ARN you copied.
    3. ComrehendClassificationScoreThreshold – Enter 0.5, which means a 50% threshold for low confidence score.

CloudFormation Required Parameters

  1. Choose Next until the Capabilities
  2. Select the check-box to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.

For more information about these resources, see AWS IAM resources.

  1. Choose Create stack.

Acknowledgement section of the CloudFormation Page

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE.

CloudFormation Outputs

  1. On the Outputs tab of the stack (see the following screenshot), copy the value for  BatchUploadS3Bucket, FeedbackAPIGatewayID, and TextClassificationAPIGatewayID to interact with the feedback loop.
  2. Both the TextClassificationAPI and FeedbackAPI will require and API key to interact with them. The Cloudformtion output ApiGWKey refers to the name of the API key. Currently this API key is associated with a usage plan that allows 2000 requests per month.
  3. On the API Gateway console, choose either the TextClassification API or the the FeedbackAPI. Choose API Keys from the left navigation. Choose your API key from step 7. Expand the API key section in the right pane and copy the value.

API Key page

  1. You can manage the usage plan by following the instructions on, Create, configure, and test usage plans with the API Gateway console.
  2. You can also add fine grained authentication and authorization to your APIs. For more information on securing your APIs, you can follow instructions on Controlling and managing access to a REST API in API Gateway.

Testing the feedback loop

In this section, we walk you through testing your feedback loop, including real-time classification, implicit and explicit feedback, and human review tasks.

Real-time classification

To interact and test these APIs, you need to download Postman.

The API Gateway endpoint receives an unlabeled text document from a client application and internally calls the custom classification endpoint, which returns the predicted label and a confidence score.

  1. Open Postman and enter the TextClassificationAPIGateway URL in POST method.
  2. In the Headers section, configure the API key.  x-api-key :  << Your API key >>
  3. In the text field, enter the following JSON code (make sure you have JSON selected and enable raw):
{"classifier":"<your custom classifier name>", "sentence":"MS Dhoni retires and a billion people had mixed feelings."}
  1. Choose Send.

You get a response back with a confidence score and class, as seen in the following screenshot.

Sample JSON request to the Classify Text API endpoint.

Implicit feedback

When the endpoint returns the classification and the confidence score during the real-time classification, you can route all the instances where the confidence score doesn’t meet the threshold to human review. This type of feedback is called implicit feedback. For this post, we set the threshold as 0.5 as an input to the CloudFormation stack parameter.

You can change this threshold when deploying the CloudFormation template based on your needs.

Explicit feedback

The explicit feedback comes from the end-users of the application that uses the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. You can send the predicted label by the model’s explicit feedback through the following methods:

  • Real time through an API, which is usually triggered through a like/dislike button on a UI.
  • Batch process, where a file with a collection of misclassified utterances is put together based on a user survey conducted by the customer outreach team.

Invoking the explicit real-time feedback loop

To test the Feedback API, complete the following steps:

  1. Open Postman and enter the FeedbackAPIGatewayID value from your CloudFormation stack output in POST method.
  2. In the Headers section, configure the API key.  x-api-key :  << Your API key >>
  3. In the text field, enter the following JSON code (for classifier, enter the classifier you created, such as news-classifier-demo, and make sure you have JSON selected and enable raw):
{"classifier":"<your custom classifier name>","sentence":"Sachin is Indian Cricketer."}
  1. Choose Send.

Sample JSON request to the Feedback API endpoint.

Submitting explicit feedback as a batch file

Download the following test feedback JSON file, populate it with your data, and upload it into the BatchUploadS3Bucket created when you deployed your CloudFormation template. The following code shows some sample data in the file:

      "US music firms take legal action against 754 computer users alleged to illegally swap music online.",
      "A gamer spends $26,500 on a virtual island that exists only in a PC role-playing game."

Uploading the file triggers the Lambda function that starts your human review loop.

Human review tasks

All the feedback collected through the implicit and explicit methods is sent for human classification. The labeling workforce can include Amazon Mechanical Turk, private teams, or AWS Marketplace vendors. For this post, we create a private workforce. The URL to the labeling portal is located on the Amazon SageMaker console, on the Labeling workforces page, on the Private tab.

Private Workforce section of the SageMaker console.

After you log in, you can see the human review tasks assigned to you. Select the task to complete and choose Start working.

Human Review Task Page

You see the tasks displayed based on the worker template used when creating the human workflow.

Human Review Task

After you complete the human classification and submit the tasks, the human-reviewed data is stored in the S3 bucket you configured when creating the human review workflow. Go to Amazon Sagemaker-> Human review workflows->output location:

Human Review Task Output Location

This human-reviewed data is used to retrain the custom classification model to learn newer patterns and improve its overall accuracy. Below is screenshot of the human annotated output file output.json in S3 bucket:

Human Review Task Output payload

The process of retraining the models with human-reviewed data, selecting the best model, and automatically deploying the new endpoints completes the active learning workflow. We cover these remaining steps in Part 2 of this series.

Cleaning up

To remove all resources created throughout this process and prevent additional costs, complete the following steps:

  1. On the Amazon S3 console, delete the S3 bucket that contains the training dataset.
  2. On the Amazon Comprehend console, delete the endpoint and the classifier.
  3. On the Amazon A2I console, delete the human review workflow, worker template, and the private workforce.
  4. On the AWS CloudFormation console, delete the stack you created. (This removes the resources the CloudFormation template created.)


Amazon Comprehend helps you build scalable and accurate natural language processing capabilities without any machine learning experience. This post provides a reusable pattern and infrastructure for active learning workflows for custom classification models. The feedback pipelines and human review workflow help the custom classifier learn new data patterns continuously. The second part of this series covers the automatic model building, selection, and deployment of custom classification models.

For more information, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.

About the Authors

 Shanthan Kesharaju is a Senior Architect in the AWS ProServe team. He helps our customers with AI/ML strategy, architecture, and develop products with a purpose. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.




Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML Explainability areas in AI/ML.





Joyson Neville Lewis obtained his master’s in Information Technology from Rutgers University in 2018. He has worked as a Software/Data engineer before diving into the Conversational AI domain in 2019, where he works with companies to connect the dots between business and AI using voice and chatbot solutions. Joyson joined Amazon Web Services in February of 2018 as a Big Data Consultant for AWS Professional Services team in NYC.

Read More