How CEVA uses TensorFlow Lite for Always-On Speech Recognition on the Edge

How CEVA uses TensorFlow Lite for Always-On Speech Recognition on the Edge

A guest article by Ido Gus of CEVA

CEVA is a leading licensor of wireless connectivity and smart sensing technologies. Our products help OEMs design power-efficient, intelligent and connected devices for a range of end markets, including mobile, consumer, automotive, robotics, industrial and IoT.

In this article, we’ll describe how we used TensorFlow Lite for Microcontrollers (TFLM) to deploy a speech recognition engine and frontend, called WhisPro, on a bare-metal development board based on our CEVA-BX DSP core. WhisPro detects always-on wake words and speech commands efficiently, on-device.

Figure 1 CEVA Multi-microphone DSP Development Board

About WhisPro

WhisPro is a speech recognition engine and frontend targeted to run on low power, resource constrained edge devices. It is designed to handle the entire data flow from processing audio samples to detection.

WhisPro supports two use cases for edge devices:

  • Always-on wake word detection engine. In this use case, WhisPro’s role is to wake a device in sleep mode when a predefined phrase is detected.
  • Speech commands. In this use case, WhisPro’s role is to enable a voice-based interface. Users can control the device using their voice. Typical commands can be: volume up, volume down, play, stop, etc.

WhisPro enables voice interface on any SoC that has a CEVA BX DSP core integrated into it, lowering entry barriers to OEMs and ODM interested in joining the voice interface revolution.

Our Motivation

Originally, WhisPro was implemented using an in-house neural network library called CEVA NN Lib. Although that implementation achieved excellent performance, the development process was quite involved. We realized that, if we ported the TFLM runtime library and optimized it for our target hardware, the entire model porting process would become transparent and more reliable (far fewer lines of code would need to be written, modified, and maintained).

Building TFLM for CEVA-BX DSP Family

The first thing we had to do is to figure out how to port TFLM to our own platform. We found that following this porting to a new platform guide to be quite useful.
Following the guide, we:

  • Verified DebugLog() implementation is supported by our platform.
  • Created a TFLM runtime library project in CEVA’s Eclipse-based IDE:
    • Created a new CEVA-BX project in CEVA’s IDE
    • Added all the required source files to the project
  • Built the TFLM runtime library for the CEVA-BX core.
    This required the usual fiddling with compiler flags, including paths (not all required files are under the “micro” directory), linker script, and so on.

Model Porting Process

Our starting point is a Keras implementation of our model. Let’s look at the steps we took to deploy our model on our bare-metal target hardware:

Converted theTensorFlow model to TensorFlow Lite using the TF built-in converter:

$ python3 -m [options] notebook.ipynb

converter = tf.lite.TFLiteConverter.from_keras_model(keras_model)
converter.experimental_new_converter = True
tflite_model = converter.convert()
open("converted_to_tflite_model.tflite", "wb").write(tflite_model)

Used quantization:

$ python3 -m [options] notebook.ipynb

converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.representative_dataset = representative_data_gen

Converted the TensorFlow Lite model to TFLM using xxd:

$ python3 -m [options] notebook.ipynb

$> xxd –I model.tflite >

Here we found that some of the model layers (for example, GRU) were not properly supported (at the time) by TFLM. It is very reasonable to assume that, as TFLM continues to mature and Google and the TFLM community invest more in it, issues like this will become rarer.
In our case, though, we opted to re-implement the GRU layers in terms of Fully Connected layers, which was surprisingly easy.


The next step was to integrate the TFLM runtime library and the converted model into our existing embedded C frontend, which handles audio preprocessing and feature extraction.

Even though our frontend was not written with TFLM in mind, it was modular enough to allow easy integration by implementation of a single simple wrapper function, as follows:

  1. Linked the TFLM runtime library into our embedded C application (WhisPro frontend)
  2. Implemented a wrapper-over-setup function for mapping the model into a usable data structure, allocating the interpreter and tensors
  3. Implemented a wrapper-over-execute function for mapping data passed from the WhisPro frontend into tflite tensors used by the actual execute function
  4. Replaced the call to the original model execute function with a call to the TFLM implementation

Process Visualization

The process we described is performed by two components:

  • The microcontroller supplier, in this case, CEVA – is responsible for optimizing TFLM for its hardware architecture.
  • The microcontroller user, in this case, CEVA WhisPro developer – is responsible for deploying a neural network based model, using an optimized TFLM runtime library, on the target microcontroller.

What’s Next

This work has proven the importance of the TFLM platform to us, and the significant value supporting TFLM can add to our customers and partners by enabling easy neural network model deployment on edge devices. We are committed to further support TFLM on the CEVA-BX DSP family by:

  • Active contribution to the TFLM project, with the goal of improving layer coverage and overall platform maturity.
  • Investing in TFLM operator optimization for execution on CEVA-BX cores, aiming for full coverage.

Final Thoughts

While the porting process had some bumps along the way, at the end it was a great success, and took about 4-5 days’ worth of work. Implementing a model in C from scratch, and handcrafting model conversion scripts from Python to C, could take 2-3 weeks (and lots of debugging).

CEVA Technology Virtual Seminar

To learn more, you are welcome to watch CEVA’s virtual seminar – Wireless Audio session, covering TFLM, amongst other topics.

Read More

How NetEase Yanxuan uses TensorFlow for customer service chat bots

How NetEase Yanxuan uses TensorFlow for customer service chat bots

Posted by Liu Huiyun, a senior algorithm engineer at NetEase

With the development of natural language processing (NLP) technology, Intelligent customer service has become an important use case in the e-commerce field. In recent years, this use case has received more and more attention. This is because, in the purchasing process, users need to be transferred to a customer services system for consultation and support if they encounter any problems or have questions. If the customer service system is able to provide accurate and effective responses, this will directly improve the user experience and have a positive impact on purchase conversion. For example:

  • In pre-sales scenarios, users may ask for more detailed information about the products or promotional activities that they are interested in.
  • In post-sales scenarios, users often have questions about returning and exchanging products, shipping fees, and logistics issues.

During actual business operations, NetEase Yanxuan, a large eCommerce platform in China, produces and accumulates large volumes of information, such as product attributes, activity operations, aftersales policies. In the meantime, the corresponding business logic is complicated. Intelligent customer service is an intelligent dialog system that leverages this information to automatically answer user questions or help human customer service representatives do so.

However, the e-commerce field involves many detailed and complicated business aspects, and users may ask their questions in many different ways and in a colloquial manner. These features require Intelligent customer service systems to possess strong semantic understanding. To this end, we have combined general customer scenarios with Yanxuan’s businesses and designed a deep learning based system. Check Yanxuan Intelligent customer service Framework full picture

in Figure 1 and Figure 2.

  • As a user inputs a question, the input text and its contextual information are first sent to the intent recognition (IR) module.
  • The intent recognition module analyzes the user’s multi-layered intents and then distributes them to different sub-modules.
  • The sub-modules are responsible for more targeted business Q&A, and different sub-modules apply different technical solutions.

As you can see, deep learning algorithms are applied to different modules in the framework. Because of the advanced NLP algorithms, we can extract more general and multi-granular semantic information from the user’s utterance.

Figure 3 shows the Xiaoxuan bot answering questions in a real dialog scenario. Next, I will introduce the different sub-modules that apply deep learning technology.

Xiaoxuan bot answering questions
Figure 3. Online Conversation Example

Intent Recognition Module — Multilayer Classification Model

As the user inputs text, we use a multilayer classification intent recognition model built with TensorFlow to analyze the input text, its context, and the historical behavior of the user. We divide first-level intents into four main categories: pre-sales product questions, aftersales questions, casual chatting, and the rest. When users ask common policy-related a ftersales questions, the input is summarized into more detailed sub-level intents. Click here (Figure 4) to check the structure of the intent recognition process.

In essence, intent recognition can be viewed as a classification problem. When building a classification system, we use the Attention+BiLSTM (ABL) model structure as a preliminary baseline. Except for the raw input text, we further design more features fed to the deep model, such as n-grams and positional encoding in the Transformer model. Ultimately, more manually crafted features improves the model accuracy by three percentage points. In addition, we also use a fine-tuned BERT model to train a classification model with less labeled data, and it performs as good as an ABL model. Pretrained models have better generalization, and can learn more semantic information based on fewer labeled data. However, this approach requires more computing resources.

FAQ Module — Text Matching Model

Answering FAQs is a key function of Intelligent customer service systems. This module is composed of two components, recall and re-rank.

  • The recall stage adopts discrete searches at the word granularity as well as semantic searches based on dense sentence vectors.
  • The re-rank stage uses a text matching model built with TensorFlow to re-rank the recalled candidature Q&A pairs.
  • Then, filter by the final mixed strategies, the module returns the final answer.

In the automatic Q&A field, text matching algorithms are commonly applied to sentence similarity task and natural language inference task. From the most basic Siamese-LSTM networks, the structure of matching modules has evolved through InferNet, Decomposable Attention, ESIM, and finally to BERT models. Generally speaking, matching algorithms can be categorized into two kinds, one is representation-based and the other is interaction-based. Representation methods are focused on the encoding of single sentences, regardless of the interactive semantics between sentences which is used in interaction methods.

At the service layer, we adopt a variety of question matching solutions:

  1. Perform association matching between input question Q and answer A.
  2. Perform similarity matching between input question Q₁ and standard question Q₂.
  3. Perform similar question matching between input question Q and standard question Qs.

These three methods perform question relevance recall and Q&A association matching in different ways. In the match and rank stages, we can use flexible weighted discrimination.

We built a Siamese-LSTM model to use as our baseline model and then implemented the following model iteration solutions:

  • We converted the LSTM units into the encoders of the Transformer model and replaced the cosine distance characterization module with the sentence-pair vector feature: to connect to the MLP layer.
  • We integrated an ESIM model with ELMo features.
  • We fine tuned the BERT model.

Tests showed that these optimizations improved these models. For example, the encoders of the Transformer model showed better accuracy in tasks (1) and (3), increasing performance by nearly 5 percentage points.

In addition, we found that, without any additional feature construction or techniques, BERT could provide stable and outstanding matching performance. This is because, in the pretraining stage, BERT aims to predict whether a contextual relationship exists between two sentences, so it can learn the relationships between sentences. In addition, the self-attention mechanism is adept at capturing deep semantics and can obtain fine-grained matching results for a word in sentence A and any word in sentence B. This is crucial for text matching tasks.

KBQA Module — NER Module

In the product knowledge-base Q&A (KBQA) and shopping guide modules, we built a named-entity recognition (NER) model for the e-commerce field based on TensorFlow. The model can recognize product names, product attribute names, product attribute values, and other key product information in the questions asked by users, as shown in Figure 5. Then, entity names are sent to downstream modules, where Q&A knowledge graph techniques are used to generate a final answer.

Figure 5. E-commerce NER Example

Generally, NER algorithm models use a bidirectional LSTM with a Conditional Random Field (CRF) layer. The former captures the before and after features, understands the context, and fully extracts contextual information. The latter focuses on the probabilistic transfer constructed from the local and global features of the current dialogue text, effectively mining the semantic information of the text. Yanxuan uses a BiLSTM-CRF model as a word-granularity baseline model, which serves the Intelligent customer service system. In later experiments, we tested feature extraction and fine-tuned BERT models.

In bert-based model optimization, we tried to use bert to extract sentence vector features and incorporate them into bilstm and crf, as well as two methods of bert-based fine-tuning: the last layer of embedding prediction, and the embedding method of weighted hidden layers. On the test set, the feature fusion performed best, with F1 as high as 0.92, followed by the multi-hidden layer fusion method (0.90), and finally the single high-layer method (0.88). In terms of the time efficiency of online inference, feature fusion takes about 100ms, and fine-tuning the model takes about 10ms.

The performance results using Yanxuan’s dataset are shown in Table 1. These results tell us the following:

  • Feature extraction provides better performance than fine tuning. In addition to using BiLSTM for semantic and structure information extraction, by introducing BERT features into a feature extraction model, we obtain a wider range of semantic and structural representations. The performance boost obtained by adding additional parameters, as in feature extraction, is significantly higher than that of normal fine tuning.
  • Multilayer feature fusion provides better performance than high-level features. This is because, for sequence tagging tasks, we need to consider both the semantic representation and the fusion of other granular representations of the sentence, such as syntactic structure information.
  • In terms of response time, feature extraction, which adds additional parameters, is well-suited to offline systems, but cannot meet the needs of online systems. Fine-tuned models, however, can meet the timeliness requirements of online systems.

Casual Chat Module — Generative Model

A standalone customer service bot must be able to answer difficult questions from users. At the same time, it must also have the ability to chat casually so as to demonstrate both its humanity and intelligence.

To give our bot this capability, we built a casual chat module capable of handling routine chatting. This module includes two key models: retrieval-based QA and generative QA.

  • The retrieval-based QA model first recalls answers from a prepared corpus and then uses a text matching model to re-rank the answers.
  • The generative QA model uses the Transformer generative model trained using TensorFlow’s tensor2tensor to generate responses in an end-to-end (E2E) manner.

However, a purely E2E approach to response generation is difficult to control. Therefore, we decided to fuse the two models in our online system to ensure more reliable responses.

Model Deployment

Figure 6 shows an online service flow based on the BERT model. Thanks to the open-source TensorFlow versions of language models such as BERT, only a small number of labeled samples need to be used to build various text models that feature high accuracy. Then, we can use GPUs to accelerate computation in order to meet the QPS requirements of online services. Finally, we can quickly deploy and launch the model based on TensorFlow Serving (TFS). Therefore, it is the support provided by TensorFlow that allows us to deploy and iterate online services in a stable and efficient manner.

Figure 6. BERT-based Online Service Flow


As deep learning technology continues to develop, new models will make new breakthroughs in the NLP field. By continuing to apply academic advances in the industry, we can achieve outstanding business results. However, this would not be possible without the work of TensorFlow. In Yanxuan’s business scenarios, TensorFlow provides flexible and refined APIs that enables engineers to deal with agile development and test new models, greatly facilitating algorithm model iteration.

Read More

Neural Structured Learning in TFX

Neural Structured Learning in TFX

Posted by Arjun Gopalan, Software Engineer, Google Research

Edited by Robert Crowe, TensorFlow Developer Advocate, Google Research


Neural Structured Learning (NSL) is a framework in TensorFlow that can be used to train neural networks with structured signals. It handles structured input in two ways: (i) as an explicit graph, or (ii) as an implicit graph where neighbors are dynamically generated during model training. NSL with an explicit graph is typically used for Neural Graph Learning while NSL with an implicit graph is typically used for Adversarial Learning. Both of these techniques are implemented as a form of regularization in the NSL framework. As a result, they only affect the training workflow and so, the model serving workflow remains unchanged. In the rest of this post, we will mostly focus on how graph regularization can be implemented using the NSL framework in TFX.

The high-level workflow for building a graph-regularized model using NSL entails the following steps:

  1. Build a graph, if one is not available.
  2. Use the graph and the input example features to augment the training data.
  3. Use the augmented training data to apply graph regularization to a given model.

These steps don’t immediately map onto existing TFX pipeline components. However, TFX supports custom components which allow users to implement custom processing within their TFX pipelines. See this blog post for an introduction to custom components in TFX. So, to create a graph-regularized model in TFX incorporating the above steps, we will make use of additional custom TFX components.

To illustrate an example TFX pipeline with NSL, let’s consider the task of sentiment classification on the IMDB dataset. A colab-based tutorial demonstrating the use of NSL for this task with native TensorFlow is available here, which we will use as the basis for our TFX pipeline example.

Graph Regularization With Custom TFX Components

To build a graph-regularized NSL model in TFX for this task, we will define three custom components using the custom Python functions approach. Here is a TFX pipeline schematic for our example using these custom components. For brevity, we have skipped components that typically come after the Trainer component like the Evaluator, Pusher, etc.

example chart

Figure 1: Example TFX pipeline for text classification using graph regularization

In this figure, only the custom components (in pink) and the Graph-regularized Trainer component have NSL-related logic. It’s worth noting that the custom components shown here are only illustrative and it may be possible to build a functionally equivalent pipeline in other ways. We now describe each of the custom components in further detail and show code snippets for them.


This custom component assigns a unique ID to each training example that is used to associate each training example with its corresponding neighbors from the graph.

def IdentifyExamples(
orig_examples: InputArtifact[Examples],
identified_examples: OutputArtifact[Examples],
id_feature_name: Parameter[str],
component_name: Parameter[str]
) -> None:

# Compute the input and output URIs.

# For each input split, update the TF.Examples to include a unique ID.
with beam.Pipeline() as pipeline:
| 'ReadExamples' >>
os.path.join(input_dir, '*'),
| 'AddUniqueId' >> beam.Map(make_example_with_unique_id, id_feature_name)
| 'WriteIdentifiedExamples' >>
file_path_prefix=os.path.join(output_dir, 'data_tfrecord'),

identified_examples.split_names = orig_examples.split_names

The make_example_with_unique_id() function updates a given example to include an additional feature containing a unique ID.


As mentioned above, in the IMDB dataset, no explicit graph is given as an input. So, we will build one before we can demonstrate graph regularization. For this example, we will use a pre-trained text embedding model to convert raw text in the movie reviews to embeddings, and then use the resulting embeddings to build a graph.

The SynthesizeGraph custom component handles graph building for our example and notice that it defines a new Artifact called SynthesizedGraph, which will be the output of this custom component.

"""Custom Artifact type"""
class SynthesizedGraph(tfx.types.artifact.Artifact):
"""Output artifact of the SynthesizeGraph component"""
TYPE_NAME = 'SynthesizedGraphPath'
'span': standard_artifacts.SPAN_PROPERTY,
'split_names': standard_artifacts.SPLIT_NAMES_PROPERTY,

def SynthesizeGraph(
identified_examples: InputArtifact[Examples],
synthesized_graph: OutputArtifact[SynthesizedGraph],
similarity_threshold: Parameter[float],
component_name: Parameter[str]
) -> None:

# Compute the input and output URIs

# We build a graph only based on the 'train' split which includes both
# labeled and unlabeled examples.
create_embeddings(train_input_examples_uri, output_graph_uri)
build_graph(output_graph_uri, similarity_threshold)
synthesized_graph.split_names = artifact_utils.encode_split_names(

The create_embeddings() function involves converting the text in movie reviews to corresponding embeddings using some pre-trained model on TensorFlow Hub. The build_graph() function involves invoking the build_graph() API in NSL.


The purpose of this custom component is to combine the example features (text in the movie reviews) with the graph built from embeddings to produce an augmented training dataset. The resulting training examples will include features from their corresponding neighbors as well.

def GraphAugmentation(
identified_examples: InputArtifact[Examples],
synthesized_graph: InputArtifact[SynthesizedGraph],
augmented_examples: OutputArtifact[Examples],
num_neighbors: Parameter[int],
component_name: Parameter[str]
) -> None:

# Compute the input and output URIs

# Separate out the labeled and unlabeled examples from the 'train' split.
train_path, unsup_path = split_train_and_unsup(train_input_uri)

# Augment training data with neighbor features.
train_path, unsup_path, graph_path, output_path, add_undirected_edges=True,

# Copy the 'test' examples from input to output without modification.

augmented_examples.split_names = identified_examples.split_names

The split_train_and_unsup() function involves splitting the input Examples into labeled and unlabeled examples and the pack_nbrs() NSL API creates the augmented training dataset.

Graph-regularized Trainer

Now that all of our custom components are implemented, the remaining NSL-specific addition to the TFX pipeline is in the Trainer component. Below is a simplified view of the graph-regularized Trainer component.


estimator = tf.estimator.Estimator(
model_fn=feed_forward_model_fn, config=run_config, params=HPARAMS)

# Create a graph regularization config.
graph_reg_config = nsl.configs.make_graph_reg_config(

# Invoke the Graph Regularization Estimator wrapper to incorporate
# graph-based regularization for training.
graph_nsl_estimator = nsl.estimator.add_graph_regularization(


As you can see, once a base model has been created (in this case a feed-forward neural network), it’s straightforward to convert it to a graph-regularized model by invoking the NSL wrapper API.

And that’s it! We now have all of the missing pieces that are required to build a graph-regularized NSL model in TFX. A colab-based tutorial that demonstrates this example end-to-end in TFX is available here. Feel free to try it and customize it as you want!

Adversarial Learning

As mentioned in the introduction above, another aspect of Neural Structured Learning is adversarial learning where instead of using explicit neighbors from a graph for regularization, implicit neighbors are created dynamically and adversarially to confuse the model. So, regularizing using adversarial examples is an effective way to improve a model’s robustness. Adversarial learning using NSL can be easily integrated into a TFX pipeline. It does not require any custom components and only the trainer component needs to be updated to invoke the adversarial regularization wrapper API in NSL.


We have demonstrated how to build a graph-regularized model with NSL in TFX using custom components. It’s certainly possible to build graphs in other ways as well as structure the overall pipeline differently. We hope that this example provides a basis for your own NSL workflows.

Additional Links

For more information on NSL, check out the following resources:


We’d like to thank the Neural Structured Learning and TFX teams at Google as well as Aurélien Geron for their support and contributions.

Read More

How TensorFlow docs uses Jupyter notebooks

How TensorFlow docs uses Jupyter notebooks

Posted by Billy Lamberta, TensorFlow Team

Jupyter notebooks are an important part of our TensorFlow documentation infrastructure. With the JupyterCon 2020 conference underway, the TensorFlow docs team would like to share some tools we use to manage a large collection of Jupyter notebooks as a first-class documentation format published on

As the TensorFlow ecosystem has grown, the TensorFlow documentation has grown into a substantial software project in its own right. We publish ~270 notebook guides and tutorials on—all tested and available in GitHub. We also publish an additional ~400 translated notebooks for many languages—all tested like their English counterpart. The tooling we’ve developed to work with Jupyter notebooks helps us manage all this content.

Graph showing Notebooks published

When we published our first notebook on over two years ago for the 2018 TensorFlow Developer Summit, the community response was fantastic. Users love that they can immediately jump from webpage documentation to an interactive computing experience in Google Colab. This setup allows you to run—and experiment with—our guides and tutorials right in the browser, without installing any software on your machine. This integration with Colab made it much easier to get started and changed how we could teach TensorFlow using Jupyter notebooks. Other machine learning projects soon followed. Notebooks can be loaded directly from GitHub into Google Colab with just the URL:<repo>/blob/<branch>/<path>/notebook.ipynb

For compute-intensive tasks, Colab provides TPUs and GPUs at no cost. The TensorFlow documentation, such as this quickstart tutorial, has buttons that link to both its notebook source in GitHub and to load in Colab.

Better collaboration

Software documentation is a team effort, and notebooks are an expressive, education-focused format that allows engineers and writers to build up an interactive demonstration. Jupyter notebooks are JSON-formatted files that contain text cells and code cells, typically executed in sequential order from top-to-bottom. They are an excellent way to communicate programming ideas, and, with some discipline, a way to share reproducible results.

On the TensorFlow team, notebooks allow engineers, technical writers, and open source contributors to collaborate on the same document without the tension that exists between a separate code example and its published explanation. We write TensorFlow notebooks so that the documentation is the code—self-contained, easily shared, and tested.

Notebook translations with GitLocalize

Documentation needs to reach everyone around the world—something the TensorFlow team values. The TensorFlow community translation project has grown to 10 languages over the past two years. Translation sprints are a great way to engage with the community on open source documentation projects.

To make TensorFlow documentation accessible to even more developers, we worked with Alconost to add Jupyter notebook support to their GitLocalize translation tool. GitLocalize makes it easy to create translated notebooks and sync documentation updates from the source files. Open source contributors can submit pull requests and provide reviews using the TensorFlow GitLocalize project:

Jupyter notebook support in GitLocalize not only benefits TensorFlow, but is now available for all open source translation projects that use notebooks with GitHub.

TensorFlow docs notebook tools

Incorporating Jupyter notebooks into our docs infrastructure allows us to run and test all the published guides and tutorials to ensure everything on the site works for a new TensorFlow release—using stable or nightly packages.

Benefits aside, there are challenges with managing Jupyter notebooks as source code. To make pull requests and reviews easier for contributors and project maintainers, we created the TensorFlow docs notebook tools to automate common fixes and communicate issues to contributors with continuous integration (CI) tests. You can install the tensorflow-docs pip package directly from the tensorflow/docs GitHub repository:

$ python3 -m pip install -U git+


While the Jupyter notebook format is straightforward, notebook authoring environments are often inconsistent with JSON formatting or embed their own metadata in the file. These unnecessary changes can cause diff churn in pull requests that make content reviews difficult. The solution is to use an auto-formatter that outputs consistent notebook JSON.

nbfmt is a notebook formatter with a preference for the TensorFlow docs notebook style. It formats the JSON and strips unneeded metadata except for some Colab-specific fields used for our integration. To run:

$ python3 -m [options] notebook.ipynb

For TensorFlow docs projects, notebooks saved without output cells are executed and tested; notebooks saved with output cells are published as-is. We prefer to remove outputs to test our notebooks, but nbfmt can be used with either format.

The --test flag is available for continuous integration tests. Instead of updating the notebook, it returns an error if the notebook is not formatted. We use this in a CI test for one of our GitHub Actions workflows. And with some further bot integration, formatting patches can be automatically applied to the contributor’s pull request.


The easiest way to scale reviews is to let the machine do it. Every project has recurring issues that pop up in reviews, and style questions are often best settled with a style guide (TensorFlow likes the Google developer docs style guide). For a large project, the more patterns you can catch and fix automatically, the more time you’ll have available for other goals.

nblint is a notebook linting tool that checks documentation style rules. We use it to catch common style and structural issues in TensorFlow notebooks:

>$ python3 -m [options] notebook.ipynb

Lints are assertions that test specific sections of the notebook. These lints are collected into style modules. nblint tests the google and tensorflow styles by default, and other style modules can be loaded at the command-line. Some styles require arguments that are also passed at the command-line, for example, setting a different repo when linting the TensorFlow translation notebooks:

$ python3 -m 

Lint tests can have an associated fix that makes it easy to update notebooks to pass style checks automatically. Use the --fix argument to apply lint fixes that overwrite the notebook, for example:

$ python3 -m --fix 
--arg=repo:tensorflow/docs notebook.ipynb

Learn more

TensorFlow is a big fan of Project Jupyter and Jupyter notebooks. Along with Google Colab, notebooks changed how we teach TensorFlow and scale a large open source documentation project with tested guides, tutorials, and translations. We hope that sharing some of the tools will help other open source projects that want to use notebooks as documentation.

Read a TensorFlow tutorial and then run the notebook in Google Colab. To contribute to the TensorFlow documentation project, submit a pull request or a translation review to our GitLocalize project.

Special thanks to Mark Daoust, Wolff Dobson, Yash Katariya, the TensorFlow docs team, and all TensorFlow docs authors, reviewers, contributors, and supporters.

Read More

Optimizing TensorFlow Lite Runtime Memory

Optimizing TensorFlow Lite Runtime Memory

Posted by Juhyun Lee and Yury Pisarchyk, Software Engineers

Running inference on mobile and embedded devices is challenging due to tight resource constraints; one has to work with limited hardware under strict power requirements. In this article, we want to showcase improvements in TensorFlow Lite’s (TFLite) memory usage that make it even better for running inference at the edge.

Intermediate Tensors

Typically, a neural network can be thought of as a computational graph consisting of operators, such as CONV_2D or FULLY_CONNECTED, and tensors holding the intermediate computation results, called intermediate tensors. These intermediate tensors are typically pre-allocated to reduce the inference latency at the cost of memory space. However, this cost, when implemented naively, can’t be taken lightly in a resource-constrained environment; it can take up a significant amount of space, sometimes even several times larger than the model itself. For example, the intermediate tensors in MobileNet v2 take up 26MB memory (Figure 1) which is about twice as large as the model itself.

Figure 1. The intermediate tensors of MobileNet v2 (top) and a mapping of their sizes onto a 2D memory space (bottom). If each intermediate tensor uses a dedicated memory buffer (depicted with 65 distinct colors), they take up ~26MB of runtime memory.

The good news is that these intermediate tensors don’t have to co-exist in memory thanks to data dependency analysis. This allows us to reuse the memory buffers of the intermediate tensors and reduce the total memory footprint of the inference engine. If the network has the shape of a simple chain, two large memory buffers are sufficient as they can be swapped back and forth interchangeably throughout the network. However, for arbitrary networks forming complicated graphs, this NP-complete resource allocation problem requires a good approximation algorithm.

We have devised a number of different approximation algorithms for this problem, and they all perform differently depending on the neural network and the properties of memory buffers, but they all use one thing in common: tensor usage records. A tensor usage record of an intermediate tensor is an auxiliary data structure that contains information about how big the tensor is and when it is used for the first and the last time in a given execution plan of the network. With the help of these records, the memory manager is able to compute the intermediate tensor usages at any moment in the network’s execution and optimize its runtime memory for the smallest footprint possible.

Shared Memory Buffer Objects

In TFLite GPU OpenGL backend, we employ GL textures for these intermediate tensors. These come with a couple of interesting restrictions: (a) A texture’s size can’t be modified after its creation, and (b) only one shader program gets exclusive access to the texture object at a given time. In this Shared Memory Buffer Objects mode, the objective is to minimize the sum of the sizes of all created shared memory buffer objects in the object pool. This optimization is similar to the well-known register allocation problem, except that it’s much more complicated due to the variable size of each object.

With the aforementioned tensor usage records, we have devised 5 different algorithms as shown in Table 1. Except for Min-Cost Flow, they are greedy algorithms, each using a different heuristic, but still reaching or getting very close to the theoretical lower bound. Some algorithms perform better than others depending on the network topology, but in general, GREEDY_BY_SIZE_IMPROVED and GREEDY_BY_BREADTH produce the object assignments with the smallest memory footprint.

Table 1. Memory footprint of Shared Objects strategies (in MB; best results highlighted in green). The first 5 rows are our strategies, and the last 2 serve as a baseline (Lower Bound denotes an approximation of the best number possible which may not be achievable, and Naive denotes the worst number possible with each intermediate tensor assigned its own memory buffer).

Coming back to our opening example, GREEDY_BY_BREADTH performs best on MobileNet v2 which leverages each operator’s breadth, i.e. the sum of all tensors in the operator’s profile. Figure 2, especially when compared to Figure 1, highlights how big of a gain one can get when employing a smart memory manager.

Figure 2. The intermediate tensors of MobileNet v2 (top) and a mapping of their sizes onto a 2D memory space (bottom). If the intermediate tensors share memory buffers (depicted with 4 distinct colors), they only take up ~7MB of runtime memory.

Memory Offset Calculation

For TFLite running on the CPU, the memory buffer properties applicable to GL textures don’t apply. Thus, it is more common to allocate a huge memory arena upfront and have it shared among all readers and writers which access it by a given offset that does not interfere with other read and writes. The objective in this Memory Offset Calculation approach is to minimize the size of the memory arena.

We have devised 3 different algorithms for this optimization problem and have also explored prior work (Strip Packing by Sekiyama et al. 2018). Similar to the Shared Objects approach, some algorithms perform better than others depending on the network as shown in Table 2. One takeaway from this investigation is that the Offset Calculation approach has a smaller footprint than the Shared Objects approach in general, and thus, one should prefer the former over the latter if applicable.

Table 2. Memory footprint of Offset Calculation strategies (in MB; best results highlighted in green). The first 3 rows are our strategies, the next 1 is prior work, and the last 2 serve as baseline (Lower Bound denotes an approximation of the best number possible which may not be achievable, and Naive denotes the worst number possible with each intermediate tensor assigned its own memory buffer).

These memory optimizations, for both CPU and GPU, have shipped by default with the last few stable TFLite releases, and have proven valuable in supporting more demanding, state-of-the-art models like MobileBERT. You can find more details about the implementation by looking at the GPU implementation and CPU implementation directly.


Matthias Grundmann, Jared Duke, Sarah Sirajuddin, and special thanks to Andrei Kulik for initial brainstorming and Terry Heo for the final implementation in TFLite.

Read More

Boosting quantum computer hardware performance with TensorFlow

Boosting quantum computer hardware performance with TensorFlow

A guest article by Michael J. Biercuk, Harry Slatyer, and Michael Hush of Q-CTRL

Google recently announced the release of TensorFlow Quantum – a toolset for combining state-of-the-art machine learning techniques with quantum algorithm design. This was an important step to build tools for developers working on quantum applications – users operating primarily at the “top of the stack”.

In parallel we’ve been building a complementary TensorFlow-based toolset working from the hardware level up – from the bottom of the stack. Our efforts have focused on improving the performance of quantum computing hardware through the integration of a set of techniques we call quantum firmware.

In this article we’ll provide an overview of the fundamental driver for this work – combating noise and error in quantum computers – and describe how the team at Q-CTRL uses TensorFlow to efficiently characterize and suppress the impact of noise and imperfections in quantum hardware. These are key challenges in the global effort to make quantum computers useful.

Q-CTRL image

The Achilles heel of quantum computers – noise and error

Quantum computing, simply put, is a new way to process information using the laws of quantum physics – the rules that govern nature on tiny size scales. Through decades of effort in science and engineering we’re now ready to put this physics to work solving problems that are exceptionally difficult for regular computers.

Realizing useful computations on today’s systems requires a recognition that performance is predominantly limited by hardware imperfections and failures, not system size. Susceptibility to noise and error remains the Achilles heel of quantum computers, and ultimately limits the range and utility of algorithms run on quantum computing hardware.

As a broad community average, most quantum computer hardware can run just a few dozen calculations over a time much less than one millisecond before requiring a reset due to the influence of noise. Depending on the specifics that’s about 1024 times worse than the hardware in a laptop!

This is the heart of why quantum computing is really hard. In this context, “noise” describes all of the things that cause interference in a quantum computer. Just like a mobile phone call can suffer interference leading it to break up, a quantum computer is susceptible to interference from all sorts of sources, like electromagnetic signals coming from WiFi or disturbances in the Earth’s magnetic field.

When qubits in a quantum computer are exposed to this kind of noise, the information in them gets degraded just the way sound quality is degraded by interference on a call. In a quantum system this process is known as decoherence. Decoherence causes the information encoded in a quantum computer to become randomized – and this leads to errors when we execute an algorithm. The greater the influence of noise, the shorter the algorithm that can be run.

So what do we do about this? To start, for the past two decades teams have been working to make their hardware more passively stable – shielding it from the noise that causes decoherence. At the same time theorists have designed a clever algorithm called Quantum Error Correction that can identify and fix errors in the hardware, based in large part on classical error correction codes. This is essential in principle, but the downside is that to make it work you have to spread the information in one qubit over lots of qubits; it may take 1000 or more physical qubits to realize just one error-corrected “logical qubit”. Today’s machines are nowhere near capable of getting benefits from this kind of Quantum Error Correction.

Q-CTRL adds something extra – quantum firmware – which can stabilize the qubits against noise and decoherence without the need for extra resources. It does this by adding new solutions at the lowest layer of the quantum computing stack that improve the hardware’s robustness to error.

Building quantum firmware with TensorFlow

Quantum firmware graphic

Quantum firmware describes a set of protocols whose purpose is to deliver quantum hardware with augmented performance to higher levels of abstraction in the quantum computing stack. The choice of the term firmware reflects the fact that the relevant routines are usually software-defined but embedded proximal to the physical layer and effectively invisible to higher layers of abstraction.

Quantum computing hardware generally relies on a form of precisely engineered light-matter interaction in order to enact quantum logic operations. These operations in a sense constitute the native machine language for a quantum computer; a timed pulse of microwaves on resonance with a superconducting qubit can translate to an effective bit-flip operation while another pulse may implement a conditional logic operation between a pair of qubits. An appropriate composition of these electromagnetic signals then implements the target quantum algorithm.

Quantum firmware determines how the physical hardware should be manipulated, redefining the hardware machine language in a way that improves stability against decoherence. Key to this process is the calculation of noise-robust operations using information gleaned from the hardware itself.

Building in TensorFlow was essential to moving beyond “home-built’’ code to commercial-grade products for Q-CTRL. Underpinning these techniques (formally coming from the field of quantum control) are tools allowing us to perform complex gradient-based optimizations. We express all optimization problems as data flow graphs, which describe how optimization variables (variables that can be tuned by the optimizer) are transformed into the cost function (the objective that the optimizer attempts to minimize). We combine custom convenience functions with access to TensorFlow primitives in order to efficiently perform optimizations as used in many different parts of our workflow. And critically, we exploit TensorFlow’s efficient gradient calculation tools to address what is often the weakest link in home-built implementations, especially as the analytic form of the relevant function is often nonlinear and contains many complex dependencies.

For example, consider the case of defining a numerically optimized error-robust quantum bit flip used to manipulate a qubit – the analog of a classical NOT gate. As mentioned above, in a superconducting qubit this is achieved using a pulse of microwaves. We have the freedom to “shape” various aspects of the envelope of the pulse in order to enact the same mathematical transformation in a way that exhibits robustness against common noise sources, such as fluctuations in the strength or frequency of the microwaves.

To do this we first define the data flow graph used to optimize the manipulation of this qubit – it includes objects that describe available “knobs” to adjust, the sources of noise, and the target operation (here a Hadamard gate).

data flow graph

The data flow graph used to optimize quantum controls. The loop at left is run through our TensorFlow optimization engine

Once the graph has been defined inside our context manager, an object must be created that ties together the objective function (in this case minimizing the resultant gate error) and the desired outputs defining the shape of the microwave pulse. With the graph object created, an optimization can be run using a service that returns a new graph object containing the results of the optimization.

This structure allows us to simply create helper functions which enable physically motivated constraints to be built directly into the graph. For instance, these might be symmetry requirements, limits on how a signal changes in time, or even incorporation of characteristics of the electronics systems used to generate the microwave pulses. Any other capabilities not directly covered by this library of helper functions can also be directly coded as TensorFlow primitives.

With this approach we achieve an extremely flexible and high-performance optimization engine; our direct benchmarking has revealed order-of-magnitude benefits in time to solution relative to the best available alternative architectures.

The capabilities enabled by this toolkit span the space of tasks required to stabilize quantum computing hardware and reduce errors at the lowest layer of the quantum computing stack. And importantly they’re experimentally verified on real quantum computing hardware; quantum firmware has been shown to reduce the likelihood of errors, mitigate system performance variations across devices, stabilize hardware against slowly drifting out of calibration, and even make quantum logic operations more compatible with higher level abstractions in quantum computing such as quantum error correction. All of these capabilities and real hardware demonstrations are accessible via our publicly available User Guides and Application Notes in executable Jupyter notebook form.

Ultimately, we believe that building and operating large-scale quantum computing systems will be effectively impossible without the integration of the capabilities encapsulated in quantum firmware. There are many concepts to be drawn from across the fields of machine learning and robotic control in the drive for performance and autonomy, and TensorFlow has proven an efficient language to support the development of the critical toolsets.

A brief history of QC, from Shor to quantum machine learning

The quantum computing boom started in 1994 with the discovery of Shor’s algorithm for factoring large numbers. Public key cryptosystems — which is to say, most encryption — rely on the mathematical complexity of factoring primes to keep messages safe from prying computers. By virtue of their approach to encoding and processing information, however, quantum computers are conjectured to be able to factor primes faster — exponentially faster — than a classical machine. In principle this poses an existential threat not only to national security, but also emerging technologies such as cryptocurrencies.

This realization set in motion the development of the entire field of quantum computing. Shor’s algorithm spurred the NSA to begin one of its first ever open, University-driven research programs asking the question of whether such systems could be built. Fast forward to 2020 and quantum supremacy has been achieved, meaning that a real quantum computing hardware system has performed a task that’s effectively impossible for even the world’s largest supercomputers.

Quantum supremacy is an important technical milestone whose practical importance in solving problems of relevance to end users remains a bit unclear. Our community is continuing to make great progress towards quantum advantage – a threshold indicating that it’s actually cheaper or faster to use a quantum computer for a problem of practical relevance. And for the right problems, we think that within the next 5-10 years we’ll cross that threshold with a quantum computer that isn’t that much bigger than the ones we have today. It just needs to perform much better.

So, which problems are the right problems for quantum computers to address first?

In many respects, Shor’s algorithm has receded in importance as the scale of the challenge emerged. A recent technical analysis suggests that we’re unlikely to see Shor deployed at a useful scale until 2039. Today, small-scale machines with a couple of dozen interacting qubits exist in labs around the world, built from superconducting circuits, individual trapped atoms, or similarly exotic materials. The problem is that these early machines are just too small and too fragile to solve problems relevant to factoring.

To factor a number sufficiently large to be relevant in cryptography, one would need a system composed of thousands of qubits capable of handling trillions of operations each. This is nothing for a conventional machine where hardware can run for a billion years at a billion operations per second and never be likely to suffer a fault. But as we’ve seen it’s quite a different story for quantum computers.

These limits have driven the emergence of a new class of applications in materials science and chemistry that could prove equally impactful, using much smaller systems. Quantum computing in the near term could also help develop new classes of artificial intelligence systems. Recent efforts have demonstrated a strong and unexpected link between quantum computation and artificial neural networks, potentially portending new approaches to machine learning.

This class of problem can often be cast as optimizations where input into a classical machine learning algorithm comes from a small quantum computation, or where data is represented in the quantum domain and a learning procedure implemented. TensorFlow Quantum provides an exciting toolset for developers seeking new and improved ways to exploit the small quantum computers existing now and in the near future.

Still, even those small machines don’t perform particularly well. Q-CTRL’s quantum firmware enables users to extract maximum performance from hardware. Thus we see that TensorFlow has a critical role to play across the emerging quantum computing software stack – from quantum firmware through to algorithms for quantum machine learning.

Resources if you’d like to learn more

We appreciate that members of the TensorFlow community may have varying levels of familiarity with quantum computing, and that this overview was only a starting point. To help readers interested in learning more about quantum computing we’re happy to provide a few resources:

  • For those knowledgeable about machine learning, Q-CTRL has also produced a series of webinars introducing the concept of Robust Control in quantum computing and even demonstrating reinforcement learning to discover gates on real quantum hardware.
  • If you need to start from zero, Q-CTRL has produced a series of introductory video tutorials helping the uninitiated begin their quantum journey via our learning center. We also offer a visual interface enabling new users to discover and build intuition for the core concepts underlying quantum computing – including the impact of noise on quantum hardware.
  • Jack Hidary from X wrote a great text focused on linking the foundations of quantum computing with how teams today write code for quantum machines.
  • The traditional “formal” starting point for those interested in quantum computing is the timeless textbook from “Mike and Ike

Read More

Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)

Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)

Posted by Konstantinos (Gus) Katsiapis on behalf of the TFX Team

Table of Contents

TFX logo


Software Engineering, as a discipline, has matured over the past 5+ decades. The modern world heavily depends on it, so the increased maturity of Software Engineering was an eventuality. Practices like testing and reliable technologies help make Software Engineering reliable enough to build industries upon. Meanwhile, Machine Learning (ML) has also grown over the past 2+ decades. ML is used more and more for research, experimentation and production workloads. ML now commonly powers widely-used products integral to our lives.

But ML Engineering, as a discipline, has not widely matured as much as its Software Engineering ancestor. Can we take what we have learned and help the nascent field of applied ML evolve into ML Engineering the way Programming evolved into Software Engineering?

In this article we will give a whirlwind tour of Sibyl and TensorFlow Extended (TFX), two successive end-to-end (E2E) ML platforms at Alphabet. We will share the lessons learned from over a decade of applied ML built on these platforms, explain both their similarities and their differences, and expand on the shifts (both mental and technical) that helped us on our journey. In addition, we will highlight some of the capabilities of TFX that help realize several aspects of ML Engineering. We argue that in order to unlock the gains ML can bring, organizations should advance the maturity of their ML teams by investing in robust ML infrastructure and promoting ML Engineering education. We also recommend that before focusing on cutting-edge ML modeling techniques, product leaders should invest more time in adopting interoperable ML platforms for their organizations. In closing, we will also share a glimpse into the future of TFX.

Where We Are Coming From

Applied ML has been an integral part of Google products and services over the last decade, and is becoming more so over time. We discovered early on from our endeavors to apply ML in production that while ML algorithms are important, they are usually insufficient in realizing the successful application of ML in a product. In particular, E2E ML platforms, which help with all aspects of the ML lifecycle, are usually needed to both accelerate ML adoption and make its use durable and sustainable.

Sibyl (2007 – 2020)

E2E ML platforms are not a new thing at Google. Sibyl, founded in 2007, was a platform that enabled massive-scale ML, catered to production use. Sibyl offered a decent amount of modeling flexibility on top of “wide” models (linear, logistic, poisson regression and later factorization machines) coupled with non-linear transformations and customizable loss functions and regularization. Importantly, Sibyl also offered tools for several aspects of the ML workflow including Data Ingestion, Data Analysis and Validation, Training (of course), Model Analysis, and Training-Serving Skew Detection. All these were packaged as a single integrated product that allowed for iterative experimentation. This holistic product offering, coupled with the Sibyl team’s user focus, rendered Sibyl to, once upon a time, be one of the most widely used E2E ML platforms at Google. Sibyl has since been decommissioned. It was in production for ~14 years, and the vast majority of its workloads migrated to TFX.

TFX (2017 – ?)

While several of us were still working on Sibyl, a notable revolution was happening in the ML algorithms fields with the popularization of Deep Learning (DL). In 2015, Google publicly released TensorFlow (which was itself a successor to a previous system called DistBelief). Since its inception, TensorFlow supported a variety of applications with a focus on DL training and inference. Its flexible programming model allowed it to be used for a lot more than DL and its popularity in both research and production positioned it as the lingua franca for authoring ML algorithms. While TensorFlow offered flexibility, it lacked a complete end-to-end production system. On the other hand, Sibyl had robust end-to-end capabilities, but lacked flexibility. It became apparent that we needed an E2E ML platform for TensorFlow in order to accelerate ML at Google; in 2017, nearly a decade after the birth of Sibyl, we launched TFX within Google. TFX is now the most widely used, general purpose E2E ML platform at Alphabet, including Google.

In the 3 years since its launch, TFX has enabled Alphabet to realize what might be described as “industrial-scale” ML: TFX is used by thousands of users within Alphabet, and it powers hundreds of popular Alphabet products, including Cloud AI services on Google Cloud Platform (GCP). On any given day there are thousands of TFX pipelines running, which are processing exabytes of data and producing tens of thousands of models, which in turn are performing hundreds of millions of inferences per second. TFX’s widespread adoption helps Alphabet realize the flow of research into production and enables very diverse use cases for both direct and indirect TFX users. This widespread adoption also enables teams to focus on model development rather than ML platform development, allowing ML to be more easily used in novel product areas, and creating a virtuous cycle of ML platform evolution from ML applications.

Based on our internal success, and the expectation that equivalents of ML engineering will be needed by organizations and individuals everywhere in the world, we decided to publicly describe the design and initial deployment of TFX within Google and to, step by step, make more of our learnings and our technology publicly available (including open source), while we continue building more of each. We were able to accomplish this in part because, like Sibyl, TFX built upon robust infrastructural dependencies. For example, Sibyl made heavy use of MapReduce and its successor Flume for its distributed data processing, and now TFX heavily uses their portable successor, Apache Beam, for the same.

Following in TensorFlow’s footsteps, the public TFX offering was released in early 2019 and widely adopted in under a year across environments including on-premises and GCP with Cloud AI Platform Pipelines. Some of our partners have also publicly shared their use cases powered by TFX, including how it radically improved their applied ML velocity.

Lessons From Our 10+ Year Journey Of ML Platform Evolution

Though the journey of ML Platform(s) evolution at Google has been a long and exciting one, we expect that the majority of excitement is yet to come! To that end, we want to share a summary of our learnings, some of which were more painfully gained than others. The learnings fall into two categories, namely what remained the same as part of the evolution, but also what changed, and why! We present the learnings in the context of two successive platforms, Sibyl and TFX, though we believe them to be widely applicable.

What Remains The Same And Why

The areas discussed in this section capture a few examples of things that seem enduring and pass the test of time. As such, we expect these to also remain applicable in the future, across different incarnations of ML platforms and frameworks. We look at these from both an applied ML perspective and an infrastructure perspective.

Applied ML

The Rules Of Machine Learning

Successfully applying ML to a product is very much a discipline. It involves a steep learning curve and necessitates some mental model shifts (or perhaps augmentations). To make this challenging task easier, we have publicly shared The Rules of Machine Learning. These are rules that represent learnings from iteratively applying ML to a lot of products at Google. Notably, the adoption of ML in Google products illustrates a common evolution:

  • Start with simple rules and heuristics, and generate data to learn from; this journey usually starts from the serving side.
  • Move to simple ML (i.e., simple models) and realize large gains; this is usually the entry point for introduction of ML pipelines.
  • Move to ML with more features and more advanced models to realize decent gains.
  • Move to state-of-the-art ML, manage refinement and complexity (for solutions to the problems that are worth it), and realize small gains.
  • Apply the above launch-and-iterate cycle to more aspects of products and to solve more problems, bearing in mind return on investment (and diminishing returns).

We have found The Rules of Machine Learning to be steadfast across platforms and time and we hope they end up being as valuable to others as they have been to us and our users. In particular, we believe that following the rules will help others be better at the discipline of ML engineering, including helping them avoid the mistakes that we and our users have made in the past. TFX is an attempt to codify these rules, quite literally, in code. We hope to benefit ourselves but also accelerate ML, done well, for the entire industry.

The Discipline Of ML Engineering

In developing The Rules of Machine Learning, we realized that the discipline for building robust systems where the core logic is produced by complex processes involving both code and data requires additional scrutiny beyond that which software engineering provides. As such, we define ML Engineering as a superset of the discipline of software engineering designed to handle the unique complexities of the practical application of ML.

Attempting to summarize the totality of the discipline of ML engineering would be somewhat difficult, if not impossible, especially given how our understanding of it is still limited, and the discipline itself continues to evolve. We do take solace in the following though:

  • The limited understanding we do have seems to be enduring across platforms and time.
  • Analogy can be a powerful tool, so several aspects of the better understood discipline of software engineering have helped us draw parallels of how ML engineering could evolve from ML programming, much like how software engineering evolved from programming.

An early realization we had was the following: artifacts are first class citizens in ML, on par with the processes that produce and consume them.

This realization affected the implementation and evolution of Sibyl; it was entrenched in TFX by the time we publicly wrote about it and was ultimately generalized and formalized in ML Metadata, now powering TFX.

Below we present fundamental elements of ML engineering, some examples of ML artifacts and their first class citizenship, and make an attempt to draw analogies with software engineering where possible.


Similarly to how code is at the heart of software, data is at the heart of ML. Data management represents serious challenges in production ML. Perhaps the simplest analogy would be to think about what constitutes a unit test for data. Unit tests verify expectations on how code should behave, by testing the contracts of the pertinent code and instilling trustworthiness in said contracts. Similarly, setting explicit expectations on the form of the data (including its schema, invariants and value distributions), and checking that the data agrees with implicit expectations embedded in the training code can, more so together, make the data trustworthy enough to train models with. Though unit tests can be exhaustive and verify strong contracts, data contracts are in general a lot weaker even if they are necessary. Though unit tests can be exhaustively consumed and verified by humans, data can usually be meaningful to humans only in summarized fashion.

Just as code repositories and version control are pillars for managing code evolution in software engineering, systems for managing data evolution and understanding are pillars of ML engineering.

TFX’s ExampleGen, StatisticsGen, SchemaGen and ExampleValidator components help one treat data as first class citizens, by enabling data management, analysis and validation in (continuous) ML pipelines.


Similarly to how a software engineer produces code that is compiled into programs, an ML engineer produces data and code which is “compiled” into ML programs, more commonly known as models. These two kinds of programs are however very different in nature. Though programs that come out of software usually have strong contracts, models have much weaker contracts. These weak contracts are usually statistical in nature and as such only verifiable in some summarized form (such as a model having sufficient accuracy on a subset of labeled data). This is not at all surprising since models are the product of code and data, and the latter itself doesn’t have strong contracts and is also only digestible in summarized form.

Just as code and data evolve over time, models also evolve over time. However, model evolution is more complicated than the evolution of its constituent code and data. For example, high test coverage (with fuzzing) can give good confidence in both the correctness and the correct evolution of a piece of code, but out-of-distribution and counterfactual yet realistic data for model evaluation can be notoriously difficult to produce.

In the same way that putting together multiple programs in a system necessitates integration testing which is a pillar of software engineering, putting together code and data necessitates end-to-end model validation and understanding which is a pillar of ML engineering.

TFX’s Evaluator and InfraValidator components provide validation and understanding of models, treating them as first class citizens of ML engineering.

Mergeable Fragments

Similarly to how a software engineer merges together pre-existing libraries (or systems) with their code in order to build useful programs, an ML engineer merges together code fragments, data fragments, analysis fragments and model fragments on a regular basis in order to build useful ML pipelines. A notable difference between software engineering and ML engineering is that even when the code is fixed for the latter, data is usually volatile for it (e.g. new data arrives on a regular basis) and as such the downstream artifacts need to be produced frequently and efficiently. For example, a new version of a model usually needs to be produced if any part of its input data has changed. As such, it is important for ML pipelines to produce artifacts that are mergeable. For example, a summary of statistics from one dataset should be easily mergeable with that of another dataset such that it is easy to summarize the statistics of the union of the two datasets. Similarly, it should be easy to transfer the learnings of one model to another model in general, and the learnings of a previous version of a model to the next version of the same model in particular.

There is however a catch, which relates to the previous discussion regarding the equivalents of test coverage for models. Merging new fragments into a model could necessitate creation of novel out-of-distribution and counterfactual evaluation data, contributing to the difficulty of (efficient) model evolution, thus rendering it a lot harder than pure code evolution.

TFX’s ExampleGen, Transform, Trainer and Tuner components, together with TensorFlow Hub, help one treat artifacts as first class citizens by enabling production and consumption of mergeable fragments in workflows that perform data caching, analyzer caching, warmstarting and transfer learning.

Artifact Lineage

Despite all the advanced methodology and tooling that exists for software engineering, the programs and systems that are built invariably need to be debugged. The same holds for ML programs, but debugging them is notoriously harder because non-proximal effects are a lot more prevalent for ML programs due to the plethora of artifacts involved. A model might be inaccurate due to bad artifacts from several sources of error, including flaws in the code, the learning algorithm, the training data, the serving path, or the serving data, to name a few. Much like how stack traces are invaluable for identifying root causes of defects in software programs, the lineage of all artifacts produced and consumed by an ML pipeline is invaluable for identifying root causes of defects in ML models. Additionally, by knowing which downstream artifacts were produced from a problematic artifact, we can identify all impacted systems and users and take mitigating actions.

TFX’s use of ML Metadata (MLMD) helps treat artifacts as first class citizens. MLMD enables advanced cataloging and querying of metadata and lineage associated with artifacts which can together increase the confidence of sharing artifacts even outside the boundaries of a pipeline. MLMD also helps with advanced debugging and, when coupled with the underlying data storage layer, forms the foundation of TFX’s ML compliance mechanisms.

Continuous Learning And Unlearning

ML production pipelines operate in a dynamic environment:

  • New data can arrive continuously.
  • The modeling code can change, particularly in the early stages of model development.
  • The surrounding infrastructure can change, e.g., a new version of some underlying (ML) library.

When changes happen, a pipeline needs to react, often by rerunning its steps in the new environment. This dynamicity increases the importance of provenance tracking in order to facilitate debugging and root-cause analysis. As a simple example, to debug a model failure, it is necessary to know not only which data was used to train the model, but also the versions of the modeling code and any surrounding infrastructure.

ML pipelines must also support low-friction mechanisms to handle these changes. Consider for example the arrival of new data, which necessitates retraining the model. This is a natural requirement in rapidly changing environments, like recommender systems or adversarial systems. Requiring the user to manually retrain the model can be unrealistic, given that the data can arrive at a regular and frequent rate. Instead, we can employ automation by way of “continuous training”, where the pipeline detects the presence of new data and automatically schedules the generation of updated models. In turn, this functionality requires automatically: orchestrating work based on the presence of artifacts (including data), recovering from intermittent failures, and catching up to real-time when recovering. It is common for ML pipelines to run for years ingesting code and data, continuously producing models that make predictions that inform decisions.

Another example of a low-friction mechanism is support for “backfilling” an ML pipeline. In this case, the user might need to rerun the pipeline on existing artifacts but using updated versions of the components, such as rerunning the trainer on existing data using a new version of the modeling code/library. Another use of backfilling is rerunning the pipeline with new versions of existing data, say, to fix an error in the data. These backfills are orthogonal to continuous training and can be used together. For instance, the user can manually trigger a rerun of the trainer, and the generated model artifact can then automatically trigger model evaluation and validation.

TFX was built from the ground up in a way that enables continuous learning (and unlearning) which fundamentally shaped its design. At the same time, these advanced capabilities also allow it to be used in a “one-shot”, discontinuous, fashion. In fact, within Alphabet, both modes of deployment are widely used. Moreover, TFX also supports different types of backfill operations to enable fine-grained interventions during normal pipeline execution.

Even though the public TFX offering doesn’t yet offer continuous ML pipelines, we are actively working on making our existing technology portable so that it can be made publicly available (e.g RFC).


Building On The Shoulders Of Giants

Realizing ambitious goals necessitates building on top of solid foundations, collaborating with others and leveraging each other’s work. TFX reuses many of Sibyl’s system designs, hardened over a decade of Sibyl’s production ML experience. Additionally, TFX incorporates new technologies in areas where robust standards emerged:

  • Similarly to how Sibyl built its algorithms and workflows on top of MapReduce, TFX leverages both TensorFlow and Apache Beam for its distributed training and data processing workflows.
  • Similarly to how Sibyl was columnar, TFX adopted Apache Arrow as the columnar in-memory representation for its compute-intensive libraries.

Taking dependencies where robust standards have emerged has allowed TFX and its users to achieve seamless performance and scalability. It also enables TFX to focus its energy on building the deltas of what is needed for applied ML, as opposed to re-implementing difficult-to-get-right technology. Some of our dependencies, like Kubeflow Pipelines or Apache Airflow, are selected by TFX’s users themselves when the value / features they get from them outweigh the costs that the additional dependencies entail.

Taking dependencies unfortunately incurs costs. We have found that taking dependencies requires effort that is super-linear to the number of dependencies. Said costs are often absorbed by us and our sister teams but can (and sometimes do) leak to our users, usually in the form of conflicting (version) dependencies or incompatibilities between environments and dependencies.

Interoperability And Positive Externalities

ML platforms do not operate in a vacuum. They instead operate within the context of a bigger system or infrastructure, connecting to data producing sources upstream and model consuming sinks downstream, which in turn frequently produce the data that feeds the ML platform, thereby closing the loop. Strong adoption of a platform usually necessitates interoperability with other important technologies in its environment.

  • Similarly to how Sibyl interoperated with Google’s Ads technology stack for data ingestion and model serving, TFX offers a plethora of connectors for data ingestion and allows serving the produced model in multiple deployment environments and devices.
  • Similarly to how Sibyl interoperated with Google’s compute stack, TFX leverages Apache Beam to execute on Apache Flink and Apache Spark clusters as well as serverless offerings like Google Cloud Dataflow.
  • TFX built an orchestration abstraction on top of MLMD and provides orchestration options on top of Apache Airflow, Apache Beam, Kubeflow Pipelines as well as the primitives to integrate with one’s custom orchestrator. MLMD itself works with several relational databases like SQLite and MySQL.

Interoperability necessitates some amount of abstraction and standardization and usually enables sum-greater-than-its-parts effects. TFX is both a beneficiary and a benefactor of the positive externalities created by said interoperability, both within and outside of Alphabet. TFX’s users are also beneficiaries of the interoperability as they can more easily deploy and use TFX on top of their existing installed base.

Interoperability also comes with costs. The combination of multiple technology stacks can lead to an exponential number of distinct deployment configurations. While we test some of the distinct deployment configurations end-to-end and at-scale, like for example TFX on GCP, we have neither the expertise nor the resources to do so for the combinatorial explosion of all possible deployment options. We thus encourage the community to work with us on the deployment configurations that are most useful for them.

What Is Different And Why

The areas discussed in this section capture a few examples of things that needed to change in order for our ML platform to adapt to a new reality and as such remain useful and impactful.

Environment And Device Portability

Sibyl was a massive scale ML platform designed to be deployed on Google’s large-scale cluster, namely Borg. This made sense as applied ML at Google was, originally, primarily used in products that were widely used. As ML expertise grew across the world, and ML could be applied to more use cases (large and small) across environments both within and outside of Google, the need for portability gradually but surely became a hard constraint.

  • While Sibyl ran only on Google’s datacenters, TFX runs on laptops, workstations, servers, datacenters, and public Clouds. In particular, when TFX runs on Google’s Cloud, it leverages automation and optimizations offered by GCP Services, enabled by Google’s unique infrastructure.
  • While Sibyl ran only on CPUs, TFX leverages TensorFlow to run on different kinds of hardware including CPUs, GPUs and Google’s TPUs.
  • While Sibyl’s models ran on servers, TFX leverages TensorFlow to produce models that run on laptops, workstations, and servers via TensorFlow Serving and Apache Beam, on mobile and IoT devices via TensorFlow Lite, and on browsers via TensorFlow JS.

TFX’s portability enabled it to be used in a very diverse set of environments and devices, in order to solve problems from small scale to massive scale.

Unfortunately, portability comes with costs. We have found that maintaining a portable core with environment-specific and device-specific specialization requires effort that is super-linear to the number of environments / devices. Said costs are however largely absorbed by us and our sister teams and as such are frequently not visible to our users.

Modularity And Layering

Even though Sibyl’s offering as an integrated product was immensely valuable, its structure and interface were somewhat monolithic, limiting it to a specific set of “direct” users who would have to adopt it wholesale. In contrast, TFX evolved to be a modular and layered architecture, and became more so over time as partnerships with other teams and products grew. Notable layers (with examples) in TFX include:

Layer Examples
ML Services

(of composable Components)


TFX’s layered architecture enables it to be used by a very diverse set of users whether that’s piecemeal via its libraries, wholesale via its pipelines (with or without the pertinent services), or in a fashion that’s completely oblivious to the end users (e.g. by them using ML services which TFX powers under the hood)!

Unfortunately, layering comes with costs. We have found that maintaining multiple publicly accessible layers of our product requires effort that is roughly linear to the number of layers. Said costs occasionally leak to our users in the form of confusion regarding what layer makes the most sense for them to use.

Multi-faceted Flexibility

Even though Sibyl was more flexible in terms of modeling capabilities compared to available alternatives at the time, aspects of its flexibility across several parts of the ML workflow fell short of Google’s needs for accelerating ML for novel use cases, which led to the development of TFX.

  • While Sibyl only offered specific kinds of data analysis, TFX’s StatisticGen component offers more built-in capabilities and the ability to realize custom analyses, via TensorFlow Data Validation.
  • While Sibyl only offered transformations that were pure composable mappers, TFX’s Transform component offers more mappers, custom mappers, more analyzers, custom analyzers, as well as arbitrarily composed (custom) mappers and (custom) analyzers, via TensorFlow Transform.
  • While Sibyl only offered “wide” models, TFX’s Trainer component offers any model that can be realized on top of TensorFlow, including models that can be shared and can transfer-learn, via TensorFlow Hub.
  • While Sibyl only offered automatic feature crossing (a.k.a. feature conjunctions) on top of “wide” models, TFX’s Tuner component allows for arbitrary hyper parameter optimization based on state of the art algorithms.
  • While Sibyl only offered specific kinds of model analysis, TFX’s Evaluator component offers more built-in metrics, custom metrics, confidence intervals and fairness indicators, via TensorFlow Model Analysis.
  • While Sibyl’s pipeline topology was fixed (albeit somewhat customizable), TFX’s SDK allows one to create custom (optionally containerized) components and use them together with standard components in a flexible and fully customizable pipeline topology.

The increase of flexibility in all these dimensions enabled improved experimentation, wider reach, more use cases, as well as accelerated flow from research to production.

Flexibility does not come without costs. A more flexible system is one that is harder to get right in the first place as well as harder for us to maintain and to evolve as developers of the ML platform. Users may also have to manage increased complexity as they take advantage of this flexibility. Furthermore, we might not be able to offer as strong of a support story on top of an ML platform that is Turing complete.

Where We Are Going

Armed with the knowledge of the past, we present a glimpse of what we plan for the future of TFX, as of 2020. We will continue our work on enabling ML Engineering in order to democratize applied ML, and help everyone practice responsible AI and apply it in a fashion that upholds Google’s AI Principles.

Drive Interoperability And Standards

In order to meet the demand for the burgeoning variety of ML solutions, we will continue to increase our technology’s interoperability. Our work on interoperability and standards as well as open-sourcing more of our technology, reflects our principle to “be socially beneficial” as well as to “be made available for uses that accord with these principles” by making it easier for everyone to follow these practices. As part of this mission, we will empower the industry to build advanced ML systems by open-sourcing more of our technology, and by standardizing ML artifacts and metadata. Some select examples of this work include:

  • TFX Standardized Inputs.
  • Advanced TFX DSL semantics, Data Model and IR.
  • Standardization of ML artifacts and metadata.
  • Standardization of distributed workloads on heterogeneous runtime environments.
  • Inference on distributed and streaming models.
  • Improvements to interoperability with mobile and edge ML deployments.
  • Improvements for ML framework interoperability and artifact sharing.

Increase Automation

Automation is the backbone of reliable production systems, and TFX is heavily invested in improving and expanding its use of automation. Our work in increased automation reflects our principles of helping make ML deployments “be built and tested for safety” and “avoid creating or reinforcing unfair bias”. Some upcoming efforts include a TFX Pipeline testing framework, automated model improvement in the TFX Tuner, auto-detecting surprising model behavior on multidimensional slices, facilitating automatic production of Model Cards and improving our training-serving skew detection capabilities. TFX on GCP will also continue driving requirements for new (and will better make use of existing) advanced automation features of pertinent services.

Improve ML Understanding

ML understanding is an important aspect of deploying production ML, and TFX is well positioned to provide significant gains in this field. Our work on improving ML understanding reflects our principles to help “avoid creating or reinforcing unfair bias” and help make ML deployments “be accountable to people”. Critical to understanding is to be able to track the lineage of artifacts used to produce a model, an area TFX will continue to invest in. Improvements to TFX technologies like struct2tensor will further enable training, serving, and analyzing models on structured data, thus allowing reasoning about models closer to the original data semantics. We also plan to utilize TFX as a vehicle to expand support for fairness evaluation, remediation, and documentation.

Uphold High Standards And Best Practices

As a vehicle for amplification of ML technology, TFX must continue to “uphold high standards of scientific excellence” and promote best practices. The team will continue publishing scientific papers and conducting public outreach via our existing channels, as well as offer educational courses in partnership with established institutions. We will also improve trust in our model analysis tools using integrated uncertainty measures by, for example, enabling scalable computation of confidence intervals for model metrics, and we will improve our training-serving skew detection capabilities. It’s also critical for research and production to be able to have reproducible ML artifacts, enabled by our work in precise provenance tracking for auditing and reproducing models. Also key is reproducibility of measurements, driven by efforts like NitroML, which will provide tooling for benchmarking AutoML pipelines.

Given that several of the areas where we expand our technology are new to us, we will make an effort to distinguish the battle-tested from the experimental aspects of our technology, in order to enable our users to confidently choose the set of capabilities that meet their desires and needs.

Improve Tooling

Despite TFX providing tools for aspects of ML engineering and several phases of the ML lifecycle, we believe this is still a nascent area. While improving tooling is a natural fit for TFX, it also reflects our principle of helping ML deployments “be made available for uses that accord with these principles”, “supporting scientific excellence,” and being “built and tested for safety” .

One area of improvement is applying ML to the data itself, be it through sensing anomalies or finding patterns in data or enriching data with predictions from ML models. Making it easy to enrich large volumes of data (especially critical streaming data used for low-latency, high volume actions) has always been a challenge. Bringing TFX capabilities into data processing frameworks is our first step here. We have already made it possible to enrich streaming events with labels or make predictions in Apache Beam and, by extension, Cloud Dataflow. We plan to follow this work by leveraging pre-built models (served out of Cloud AI Pipelines and TensorFlow Serving) to make adding a new field in a distributed dataset representing predictions from streams of models trivially easy.

Furthermore, while there are many tools for detecting, discovering, and auditing ML workflows, there is still a need for automated (or assisted) mitigation of discovered issues, and we will invest in this area. For example, proactively predicting which pipeline runs won’t result in better models based on the currently-executing pipeline, perhaps even before training, can significantly reduce time and resources spent on creating poor models.

A Joint Journey

Building TFX and exploring the fundamentals of ML engineering was the cumulative effort of many people over many years. As we continue to make strides and further develop this field, it’s important we recognize the collaborative effort of those who got us here.

Of course, it will take many more collaborations to drive the future of this field, and as such, we invite you to join us on this journey “Towards ML Engineering”!

The TFX Team

The TFX project is realized via collaboration of multiple organizations within Google. Different organizations usually focus on different technology and product layers, though there is a lot of overlap on the portable parts of our technology. Overall we consider ourselves a single team and below we present an alphabetically sorted list of current TFX team members who are contributors to the ideation, research, design, implementation, execution, deployment, management, and advocacy (to name a few) aspects of TFX; they continue to inspire, help, teach, and challenge each other to advance our field:

Abhijit Karmarkar, Adam Wood, Aleksandr Zaks, Alina Shinkarsky, Neoklis Polyzotis, Amy Jang, Amy McDonald Sandjideh, Amy Skerry-Ryan, Andrew Audibert, Andrew Brown, Andy Lou, Anh Tuan Nguyen, Anirudh Sriram, Anna Ukhanova, Anusha Ramesh, Archana Jain, Arun Venkatesan, Ashley Oldacre, Baishun Wu, Ben Mathes, Billy Lamberta, Chandni Shah, Chansoo Lee, Chao Xie, Charles Chen, Chi Chen, Chloe Chao, Christer Leusner, Christina Greer, Christina Sorokin, Chuan Yu Foo, CK Luk, Connie Huang, Daisy Wong, David Smalling, David Zats, Dayeong Lee, Dhruvesh Talati, Doojin Park, Elias Moradi, Emily Caveness, Eric Johnson, Evan Rosen, Florian Feldhaus, Gal Oshri, Gautam Vasudevan, Gene Huang, Goutham Bhat, Guanxin Qiao, Gus Katsiapis, Gus Martins, Haiming Bao, Huanming Fang, Hui Miao, Hyeonji Lee, Ian Nappier, Ihor Indyk, Irene Giannoumis, Jae Chung, Jan Pfeifer, Jarek Wilkiewicz, Jason Mayes, Jay Shi, Jiayi Zhao, Jingyu Shao, Jiri Simsa, Jiyong Jung, Joana Carrasqueira, Jocelyn Becker, Joe Liedtke, Jongbin Park, Jordan Grimstad, Josh Gordon, Josh Yellin, Jungshik Jang, Juram Park, Justin Hong, Karmel Allison, Kemal El Moujahid, Kenneth Yang, Khanh LeViet, Kostik Shtoyk, Lance Strait, Laurence Moroney, Li Lao, Liam Crawford, Magnus Hyttsten, Makoto Uchida, Manasi Joshi, Mani Varadarajan, Marcus Chang, Mark Daoust, Martin Wicke, Megha Malpani, Mehadi Hassen, Melissa Tang, Mia Roh, Mig Gerard, Mike Dreves, Mike Liang, Mingming Liu, Mingsheng Hong, Mitch Trott, Muyang Yu, Naveen Kumar, Ning Niu, Noah Hadfield-Menell, Noé Lutz, Nomi Felidae, Olga Wichrowska, Paige Bailey, Paul Suganthan, Pavel Dournov, Pedram Pejman, Peter Brandt, Priya Gupta, Quentin de Laroussilhe, Rachel Lim, Rajagopal Ananthanarayanan, Rene van de Veerdonk, Robert Crowe, Romina Datta, Ron Yang, Rose Liu, Ruoyu Liu, Sagi Perel, Sai Ganesh Bandiatmakuri, Sandeep Gupta, Sanjana Woonna, Sanjay Kumar Chotakur, Sarah Sirajuddin, Sheryl Luo, Shivam Jindal, Shohini Ghosh, Sina Chavoshi, Sydney Lin, Tanya Grunina, Thea Lamkin, Tianhao Qiu, Tim Davis, Tris Warkentin, Varshaa Naganathan, Vilobh Meshram, Volodya Shtenovych, Wei Wei, Wolff Dobson, Woohyun Han, Xiaodan Song, Yash Katariya, Yifan Mai, Yiming Zhang, Yuewei Na, Zhitao Li, Zhuo Peng, Zhuoshu Li, Ziqi Huang, Zoey Sun, Zohar Yahav

Thank you, all!

The TFX Team … Extended

Beyond the current TFX team members, there have been many collaborators both within and outside of Alphabet whose discussions, technology, as well as direct and indirect contributions, have materially influenced our journey. Below we present an alphabetically sorted list of these collaborators:

Abdulrahman Salem, Ahmet Altay, Ajay Gopinathan‎, Alexandre Passos, Alexey Volkov, Anand Iyer, Andrew Bernard‎, Andrew Pritchard‎, Chary Aasuri, Chenkai Kuang, Chenyu Zhao, Chiu Yuen Koo, Chris Harris, Chris Olston, Christine Robson, Clemens Mewald, Corinna Cortes, Craig Chambers, Cyril Bortolato, D. Sculley, Daniel Duckworth‎, Daniel Golovin, David Soergel, Denis Baylor, Derek Murray, Devi Krishna, Ed Chi, Fangwei Li, Farhana Bandukwala, Gal Elidan, Gary Holt, George Roumpos, Glen Anderson, Greg Steuck, Grzegorz Czajkowski, Haakan Younes, Heng-Tze Cheng, Hossein Attar, Hubert Pham, Hussein Mehanna, Irene Cai, James L. Pine, James Pine, James Wu, Jeffrey Hetherly, Jelena Pjesivac-Grbovic, Jeremiah Harmsen, Jessie Zhu, Jiaxiao Zheng, Joe Lee, Jordan Soyke, Josh Cai, Judah Jacobson, Kaan Ege Ozgun‎, Kenny Song, Kester Tong, Kevin Haas, Kevin Serafini, Kiril Gorovoy, Kostik Steuck, Kristen LeFevre, Kyle Weaver, Kym Hines, Lana Webb, Lichan Hong, Lukasz Lew, Mark Omernick, Martin Zinkevich, Matthieu Monsch, Michel Adar, Michelle Tsai‎, Mike Gunter, Ming Zhong, Mohamed Hammad, Mona Attariyan, Mustafa Ispir, Neda Mirian, Nicholas Edelman‎, Noah Fiedel, Panagiotis Voulgaris‎, Paul Yang, Peter Dolan, Pushkar Joshi‎, Rajat Monga, Raz Mathias‎, Reiner Pope, Rezsa Farahani, Robert Bradshaw, Roberto Bayardo, Rohan Khot, Salem Haykal, Sam McVeety, Sammy Leong, Samuel Ieong, Shahar Jamshy, Slaven Bilac, Sol Ma, Stan Jedrus, Steffen Rendle, Steven Hemingray‎, Steven Ross, Steven Whang, Sudip Roy, Sukriti Ramesh, Susan Shannon, Tal Shaked, Tushar Chandra, Tyler Akidau, Venkat Basker, Vic Liu, Vinu Rajashekhar, Xin Zhang, Yan Zhu‎, Yaxin Liu, Younghee Kwon, Yury Bychenkov‎, Zhenyu Tan

Thank you, all!

Read More

Bringing the Mona Lisa Effect to Life with TensorFlow.js

Bringing the Mona Lisa Effect to Life with TensorFlow.js

A guest post by Emily Xie, Software Engineer


Urban legend says that Mona Lisa’s eyes will follow you as you move around the room. This is known as the “Mona Lisa effect.” For fun, I recently programmed an interactive digital portrait that brings this phenomenon to life through your browser and webcam.

At its core, the project leverages TensorFlow.js, deep learning, and some image processing techniques. The general idea is as follows: first, we must generate a sequence of images of Mona Lisa’s head, with eyes gazing from the left to right. From this pool, we’ll continuously select and display a single frame in real-time based on the viewer’s location.

In this post, I’ll walk through the specifics of the project’s technical design and implementation.

Animating the Mona Lisa with Deep Learning

Image animation is a technique that allows one to puppeteer a still image through a driving video. Using a deep-learning-based approach, I was able to generate a highly convincing animation of Mona Lisa’s gaze.

Specifically, I used the First Order Motion Model (FOMM), released by Aliaksandr Siarohin et al. in 2019. At a very high level, this method is composed of two modules: one for motion extraction, and another for image generation. The motion module detects keypoints and local affine transformations from the driving video. Diffs of these values between consecutive frames are then used as input to a network that predicts a dense motion field, along with an occlusion mask which specifies the image regions that either need to be modified or contextually inferred. The image generation network, then, detects facial landmarks and produces the final output––the source image, warped and in-painted according to the results of the motion module.

I chose FOMM in particular because of its ease of use. Prior models in this domain had been “object-specific”, meaning that they required detailed data of the object to be animated, whereas FOMM operated agnostically to this. More importantly, the authors released an open-source, out-of-the-box implementation with pre-trained weights for facial animation. Because of this, applying the model to the Mona Lisa became a surprisingly straightforward endeavor: I simply cloned the repo into a Colab notebook, produced a short driving video of me with my eyes moving around, and fed it through the model along with a screenshot of La Gioconda’s head. The resulting movie was stellar. From this, I ultimately sampled just 33 images to constitute the final animation.

Example of a driving video and the image animation predictions generated by FOMM.
A subsample of the final animation frames, produced using the First Order Motion Model.

Image Blending

While I could’ve re-trained the model for my project’s purposes, I decided to work within the constraints of Siarohin’s weights in order to avoid the time and computational resources that would’ve been otherwise required. This, however, meant that the resulting frames were fixed at a lower resolution than desired, and consisted of just the subject’s head. But since I wanted the final visual to include the entirety of Mona Lisa––hands, torso, and background included––my plan was to simply superimpose the output head frames onto an image of the painting.

Mona Lisa
An example of a head frame overlaid on top of the underlying image. To best illustrate the problem, the version shown here is from an earlier iteration of the project where there was further resolution loss in the head frame.

This, however, produced its own set of challenges. If you look at the example above, you’ll notice that the lower-resolution output of the model––coupled with some subtle collateral background changes due to FOMM’s warping procedure––causes the head frame to visually jut out. In other words, it was a bit obvious that this was just a picture on top of another picture. To address this, I did some image processing in Python to “blend” the head image into the underlying one.

First, I resized the head frame to its original resolution. From there, I created a new frame using a weighted average of these blurred out pixels and the corresponding pixels in the underlying image, where the weight––or alpha––of a pixel in the head frame decreases as it moves away from the midpoint.

The function to determine alpha was adapted from a 2D sigmoid, and is expressed as:

Where j determines the logistic function’s slope, k is the inflection point, and m is the midpoint of the input values. Graphed out, the function looks like:

Function graph

After I applied the above procedure to all 33 frames in the animation set, the resulting superimpositions each appeared to be a single image to the unsuspecting eye:

Tracking the Viewer’s Head via BlazeFace

All that was left at this point was to determine how to track the user via the webcam and display the corresponding frame.

Naturally, I turned to TensorFlow.js for the job. The library offered a fairly robust set of models to detect the presence of a human given visual input, but after some research and thinking, I landed on BlazeFace as my method of choice.

BlazeFace is a deep-learning based object recognition model that detects human faces and facial landmarks. It is specifically trained for using mobile camera input. This worked well for my use case, as I expected most viewers to be using their webcam in a similar manner––with their heads in frame, front-facing, and fairly close to the camera––whether through their mobile devices or on their laptops.

My foremost consideration in selecting this model, however, was its extraordinary speed of detection. To make this project convincing, I needed to be able to run the entire animation in real time, including the facial recognition step. BlazeFace adapts the single-shot detection (SSD) model, a deep-learning-based object detection algorithm that simultaneously proposes bounding boxes and detects objects in just one forward pass of the network. BlazeFace’s lightweight detector is capable of recognizing facial landmarks at speeds as fast as 200 frames per second.

A demo of what BlazeFace can capture given an input image: bounding boxes for a human head, along with facial landmarks.

Having settled on the model, I then wrote code to continually pipe the user’s webcam data into BlazeFace. On each run, the model outputted an array of facial landmarks and their corresponding 2D coordinate positions. Using this, I approximated the X coordinate of the face’s center by calculating the midpoint between the eyes.

Finally, I mapped this result to an integer between 0 and 32. These values, as you may recall, each represented a frame in the animated sequence––with 0 depicting Mona Lisa with her eyes to the left, and 32 with her eyes to the right. From there, it was just a matter of displaying the frame on the screen.

Try it out!

You can play with the project at To follow more of my work, feel free to check out my personal website, Github, or Twitter.


Thanks to Andrew Fu for reading this post and providing me feedback, to Nick Platt for lending his ear and thoughts on a frontend bug, and to Jason Mayes along with the rest of the team at Google for their work in reaching out and amplifying this project.

Read More

Introducing TensorFlow Recommenders

Introducing TensorFlow Recommenders

Posted by Maciej Kula and James Chen, Google Brain

From recommending movies or restaurants to coordinating fashion accessories and highlighting blog posts and news articles, recommender systems are an important application of machine learning, surfacing new discoveries and helping users find what they love.

At Google, we have spent the last several years exploring new deep learning techniques to provide better recommendations through multi-task learning, reinforcement learning, better user representations and fairness objectives. These and other advancements have allowed us to greatly improve our recommendations.

Today, we’re excited to introduce TensorFlow Recommenders (TFRS), an open-source TensorFlow package that makes building, evaluating, and serving sophisticated recommender models easy.

Built with TensorFlow 2.x, TFRS makes it possible to:

TFRS is based on TensorFlow 2.x and Keras, making it instantly familiar and user-friendly. It is modular by design (so that you can easily customize individual layers and metrics), but still forms a cohesive whole (so that the individual components work well together). Throughout the design of TFRS, we’ve emphasized flexibility and ease-of-use: default settings should be sensible; common tasks should be intuitive and straightforward to implement; more complex or custom recommendation tasks should be possible.

TensorFlow Recommenders is open-source and available on Github. Our goal is to make it an evolving platform, flexible enough for conducting academic research and highly scalable for building web-scale recommender systems. We also plan to expand its capabilities for multi-task learning, feature cross modeling, self-supervised learning, and state-of-the-art efficient approximate nearest neighbours computation.

Example: building a movie recommender

To get a feel for how to use TensorFlow Recommenders, let’s start with a simple example. First, install TFRS using pip:

!pip install tensorflow_recommenders

We can then use the MovieLens dataset to train a simple model for movie recommendations. This dataset contains information on what movies a user watched, and what ratings users gave to the movies they watched.

We will use this dataset to build a model to predict which movies a user watched, and which they didn’t. A common and effective pattern for this sort of task is the so-called two-tower model: a neural network with two sub-models that learn representations for queries and candidates separately. The score of a given query-candidate pair is simply the dot product of the outputs of these two towers.

This model architecture is quite flexible. The inputs can be anything: user ids, search queries, or timestamps on the query side; movie titles, descriptions, synopses, lists of starring actors on the candidate side.

In this example, we’re going to keep things simple and stick to user ids for the query tower, and movie titles for the candidate tower.

To start with, let’s prepare our data. The data is available in TensorFlow Datasets.

import tensorflow as tf

import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
# Ratings data.
ratings = tfds.load("movielens/100k-ratings", split="train")
# Features of all the available movies.
movies = tfds.load("movie_lens/100k-movies", split="train")

Out of all the features available in the dataset, the most useful are user ids and movie titles. While TFRS can use arbitrarily rich features, let’s only use those to keep things simple.

ratings = x: {
"movie_title": x["movie_title"],
"user_id": x["user_id"],
movies = x: x["movie_title"])

When using only user ids and movie titles our simple two-tower model is very similar to a typical matrix factorization model. To build it, we’re going to need the following:

  • A user tower that turns user ids into user embeddings (high-dimensional vector representations).
  • A movie tower that turns movie titles into movie embeddings.
  • A loss that maximizes the predicted user-movie affinity for watches we observed, and minimizes it for watches that did not happen.

TFRS and Keras provide a lot of the building blocks to make this happen. We can start with creating a model class. In the __init__ method, we set up some hyper-parameters as well as the primary components of the model.

class TwoTowerMovielensModel(tfrs.Model):
"""MovieLens prediction model."""

def __init__(self):
# The `__init__` method sets up the model architecture.

# How large the representation vectors are for inputs: larger vectors make
# for a more expressive model but may cause over-fitting.
embedding_dim = 32
num_unique_users = 1000
num_unique_movies = 1700
eval_batch_size = 128

The first major component is the user model: a set of layers that describe how raw user features should be transformed into numerical user representations. Here, we use the Keras preprocessing layers to turn user ids into integer indices, then map those into learned embedding vectors:

 # Set up user and movie representations.
self.user_model = tf.keras.Sequential([
# We first turn the raw user ids into contiguous integers by looking them
# up in a vocabulary.
# We then map the result into embedding vectors.
tf.keras.layers.Embedding(num_unique_users, embedding_dim)

The movie model looks similar, translating movie titles into embeddings:

self.movie_model = tf.keras.Sequential([
tf.keras.layers.Embedding(num_unique_movies, embedding_dim)

Once we have both user and movie models we need to define our objective and its evaluation metrics. In TFRS, we can do this via the Retrieval task (using the in-batch softmax loss):

# The `Task` objects has two purposes: (1) it computes the loss and (2)
# keeps track of metrics.
self.task = tfrs.tasks.Retrieval(
# In this case, our metrics are top-k metrics: given a user and a known
# watched movie, how highly would the model rank the true movie out of
# all possible movies?

We use the compute_loss method to describe how the model should be trained.

def compute_loss(self, features, training=False):
# The `compute_loss` method determines how loss is computed.

# Compute user and item embeddings.
user_embeddings = self.user_model(features["user_id"])
movie_embeddings = self.movie_model(features["movie_title"])

# Pass them into the task to get the resulting loss. The lower the loss is, the
# better the model is at telling apart true watches from watches that did
# not happen in the training data.
return self.task(user_embeddings, movie_embeddings)

We can fit this model using standard Keras fit calls:

model = MovielensModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1)), verbose=False)

To sanity-check the model’s recommendations we can use the TFRS BruteForce layer. The BruteForce layer is indexed with precomputed representations of candidates, and allows us to retrieve top movies in response to a query by computing the query-candidate score for all possible candidates:

index = tfrs.layers.ann.BruteForce(model.user_model)
index.index(movies.batch(100).map(model.movie_model), movies)

# Get recommendations.
_, titles = index(tf.constant(["42"]))
print(f"Recommendations for user 42: {titles[0, :3]}")

Of course, the BruteForce layer is only suitable for very small datasets. See our full tutorial for an example of using TFRS with Annoy, an approximate nearest neighbours library.

We hope this gave you a taste of what TensorFlow Recommenders offers. To learn more, check out our tutorials or the API reference. If you’d like to get involved in shaping the future of TensorFlow recommender systems, consider contributing! We will also shortly be announcing a TensorFlow Recommendations Special Interest Group, welcoming collaboration and contributions on topics such as embedding learning and distributed training and serving. Stay tuned!


TensorFlow Recommenders is the result of a joint effort of many folks at Google and beyond. We’d like to thank Tiansheng Yao, Xinyang Yi, Ji Yang for their core contributions to the library, and Lichan Hong and Ed Chi for their leadership and guidance. We are also grateful to Zhe Zhao, Derek Cheng, Sagar Jain, Alexandre Passos, Francois Chollet, Sandeep Gupta, Eric Ni, and many, many others for their suggestions and support of this project.Read More

What’s new in TensorFlow Lite for NLP

What’s new in TensorFlow Lite for NLP

Posted by Tian Lin, Yicheng Fan, Jaesung Chung and Chen Cen

TensorFlow Lite has been widely adopted in many applications to provide machine learning features on edge devices such as mobile phones, microcontroller units, and Edge TPUs. Among all popular applications that make people’s life easier and more productive, Natural Language Understanding is one of the key areas that attracts much attention from both the research community and the industry. After the demo of the on-device question-answering use case at TensorFlow World in 2019, we got a lot of interest and feedback from the community on making more such NLP models available for on-device inference.

Inspired by that feedback, today we are delighted to announce an end-to-end support for NLP tasks based on TensorFlow Lite. With this infrastructure work, more and more NLP models are able to run on mobile phones, and users can enjoy the advantage of NLP models, while keeping their personal data on-device. In this blog, we will introduce the new features that allow: (1) Using new pre-trained NLP Models, (2) Creating your own NLP models, (3) Better support for converting TensorFlow NLP Models to TensorFlow Lite format and (4) Deploying these models on mobile devices.

Using new pre-trained NLP models

Reference apps

Reference apps are a set of open-source mobile applications that encapsulate pretrained machine learning models, inference code and runnable demos. We provide a series of NLP reference apps that are integrated with Android Studio and XCode, so developers can build with just one click and deploy on Android or iOS phones.

Using the NLP reference apps below, mobile developers can learn the end to end flow of integrating existing NLP models (powered by BERT, MobileBERT or ALBERT), transforming raw text data, and connecting the model’s inputs and outputs to generate prediction results,

  • Text classification: The model predicts labels based on given text data.
  • Question answering app: Given an article and a user question, the model can answer the question within the article.
  • Smart reply app: Given previous context, the model can predict and generate potential auto replies.

The pretrained models used in the above reference apps are available in TensorFlow Hub. The chart below shows a comparison of the latency, size and F1 score between the models.

Benchmark on Pixel 4 CPU, 4 Threads, March 2020
Model hyper parameters: Sequence length 128, Vocab size 30K

Optimizing NLP Models for on-device use cases

On-device models have different constraints compared to server-side models. They run on devices with less memories and slower chips and hence need to be optimized for size and inference speed.Here are several examples of how we optimize models for NLP tasks.

Quantized MobileBERT

MobileBERT is a compact BERT model open sourced on GitHub. It is 4.3x smaller & 5.5x faster than BERT base (float32) while achieving comparable results on GLUE and SQuAD datasets.
After the initial release, we further improved the model by using quantization to optimize its model size and performance, so that it can utilize accelerators like GPU/DSP if available. The quantized MobileBERT is 16x smaller & 8x faster than the BERT base, with little accuracy loss. The MLPerf for Mobile community is leveraging the quantized MobileBERT model for mobile inference benchmarking, and the model can also run in Chrome using TensorFlow.js.
Compared with the original BERT base model (416MB), the below table shows the performance of quantized MobileBERT under the same setting.

Embedding-free NLP models with projection methods

Language Identification is a type of problems to classify the language of a given text. Recently we open source two models using projection methods, namely SGNN and PRADO.
We used SGNN to show how easy and efficient to use Tensorflow Lite for NLP Tasks. SGNN projects texts to fixed-length features followed by fully connected layers. With annotations to tell TensorFlow Lite converter to fuse TF.Text API, we can get a more efficient model for inference on TensorFlow Lite. Previously, the model took 1332.87 μs to run on benchmark; and after fusion, we see 64.06 μs on the same machine. This brings the magic of 20x speed-up.
We also demonstrate a model architecture called PRADO. PRADO first computes trainable projected features from the sequence of word tokens, then applies convolution and attention to map features to a fixed-length encoding. By combining a projection layer, a convolutional and attention encoder mechanism, PRADO achieves similar accuracy as LSTM, but with 100x smaller model size.
The idea behind these models is to use projection to compute features from texts, so that the model does not need to maintain a big embedding table to convert text features to embeddings. In this way, we’ve proven the model will be much smaller than embedding based models, while maintaining similar performance and inference latency.

Creating your own NLP Models

In addition to using pre-trained models, TensorFlow Lite also provides you with tools such as Model Maker to customize existing models for your own data.

TensorFlow Lite Model Maker: Transfer Learning Toolkit for machine learning beginners

TensorFlow Lite Model Maker is an easy-to-use transfer learning tool to adapt state-of-the-art machine learning models to your dataset. It allows mobile developers to create a model without any machine learning expertise, reduces the required training data and shortens the training time through transfer learning.
After the initial release focusing on vision tasks, we recently added two new NLP tasks to Model Maker. You can follow the colab and guide for Text Classification and Question Answer. To install Model Maker:

pip install tflite-model-maker

To customize the model, developers need to write a few lines of python code as follow:

# Loads Data.
train_data = TextClassifierDataLoader.from_csv(train_csv_file, mode_spec=spec)
test_data = TextClassifierDataLoader.from_csv(test_csv_file, mode_spec=spec)

# Customize the TensorFlow model.
model = text_classifier.create(train_data, model_spec=spec)

# Evaluate the model.
loss, acc = model.evaluate(test_data)

# Export as a TensorFlow Lite model.
model.export(export_dir, quantization_config=config)

Conversion: Better support to convert NLP models to TensorFlow Lite

Since the TensorFlow Lite builtin operator library only supports a subset of TensorFlow operators, you may have run into issues while converting your NLP model to TensorFlow Lite, either due to missing ops or unsupported data types (like RaggedTensor support, hash table support, and asset file handling, etc.). Here are a few tips on how to resolve the conversion issues in such cases.

Run TensorFlow ops and TF.text ops in TensorFlow Lite

We have enhanced Select TensorFlow ops to support various cases. With Select TF ops, developers can leverage TensorFlow ops to run models on TensorFlow Lite, when there are no built-in TensorFlow Lite equivalent ops. For example, it’s common to use TF.Text ops and RaggedTensor when training TensorFlow models, and now those models can be easily converted to TensorFlow Lite and run with necessary ops.
Furthermore, we provide the solution of using op selectively building, so that we get a trimmed binary for mobile deployment. It can select a small set of used ops in the final build target, and thus reduces the binary size in deployment.

More efficient and friendly custom ops

In TensorFlow Lite, we provide a few new mobile-friendly ops for NLP, such as Ngram, SentencePieceTokenizer, WordPieceTokenizer and WhitespaceTokenizer.
Previously, there were several restrictions blocking models with SentencePiece from being converted to TensorFlow Lite. The new SentencePieceTokenizer API for mobile resolves these challenges, and simultaneously optimizes the implementation to make it run faster.
Similarly, Ngram and WhitespaceTokenizer are now not only supported, but will also be executed more efficiently on devices.
TensorFlow Lite recently announced operation fusion with MLIR. We used the same mechanism to fuse TF.Text APIs into custom TensorFlow Lite ops, improving inference efficiency significantly. For example, the WhitespaceTokenizer API was made up of multiple ops, and took 0.9ms to run in the original graph in TensorFlow Lite. After fusing these ops into a single op, it finishes in 0.04ms, a 23x speed-up. This approach has been proven to bring a huge gain in inference latency in the SGNN model mentioned above.

Hash table support

Hash table is important for many NLP models, since we usually need to utilize numeric computation in the language model by transforming words into token IDs and vice versa. Hash table will be enabled in TensorFlow Lite soon. It is supported by handling asset files natively in the TensorFlow Lite format and delivering op kernels as TensorFlow Lite built-in operators.

Deployment: How to run NLP models on-device

Running inference with TensorFlow Lite is now much easier than before. You can use pre-built inference APIs to integrate your model within 5 lines of code, or use utilities to build your own Android/iOS inference APIs.

Simple model deployment using TensorFlow Lite Task Library

The TensorFlow Lite Task Library is a powerful and easy-to-use task-specific library that provides out of the box pre- and post-processing utilities required for ML inference, enabling app developers to easily create machine learning features with TensorFlow Lite. There are three text APIs supported in the Task Library, which correspond to the use cases and models mentioned above:

  • NLClassifier: classifies the input text to a set of known categories.
  • BertNLClassifier: classifies text optimized for BERT-family models.
  • BertQuestionAnswerer: answers questions based on the content of a given passage with BERT-family models.

The Task Library works cross-platform on both Android and iOS. The following example shows inference with a BertQA model in Java/Swift:

// Initialization
BertQuestionAnswerer answerer = BertQuestionAnswerer.createFromFile(androidContext, modelFile);
// Answer a question
List answers = answerer.answer(context, question);
Java code for Android
// Initialization
let mobileBertAnswerer = TFLBertQuestionAnswerer.mobilebertQuestionAnswerer(modelPath: modelPath)
// Answer a question
let answers = mobileBertAnswerer.answer(context: context, question: question)
Swift code for iOS

Customized Inference APIs

If your use case is not supported by the existing task libraries, you can also leverage the Task API Infrastructure and build your own C++/Android/iOS inference APIs using common NLP utilities such as Wordpiece and Sentencepiece tokenizers in the same repo.


In this article, we introduced the new support for NLP tasks in TensorFlow Lite. With the latest update of TensorFlow Lite, developers can easily create, convert and deploy NLP models on-device. We will continue providing more useful tools, and accelerate the development of on-device NLP models from research to production. We would love to hear your feedback, and suggestions for newer NLP tools and utilities. Please email or create a TensorFlow Lite support GitHub issue.


We like to thank Khanh LeViet, Arun Venkatesan, Max Gubin, Robby Neale, Terry Huang, Peter Young, Gaurav Nemade, Prabhu Kaliamoorthi, Ping Yu, Renjie Liu, Lu Wang, Xunkai Zhang, Yuqi Li, Sijia Ma, Thai Nguyen, Xingying Song, Chung-Ching Chang, Shuangfeng Li to contribute to the blogpost.Read More