PyTorch 1.5 released, new and updated APIs including C++ frontend API parity with Python

Today, we’re announcing the availability of PyTorch 1.5, along with new and updated libraries. This release includes several major new API additions and improvements. PyTorch now includes a significant update to the C++ frontend, ‘channels last’ memory format for computer vision models, and a stable release of the distributed RPC framework used for model-parallel training. The release also has new APIs for autograd for hessians and jacobians, and an API that allows the creation of Custom C++ Classes that was inspired by pybind.

You can find the detailed release notes here.

C++ Frontend API (Stable)

The C++ frontend API is now at parity with Python, and the features overall have been moved to ‘stable’ (previously tagged as experimental). Some of the major highlights include:

  • Now with ~100% coverage and docs for C++ torch::nn module/functional, users can easily translate their model from Python API to C++ API, making the model authoring experience much smoother.
  • Optimizers in C++ had deviated from the Python equivalent: C++ optimizers can’t take parameter groups as input while the Python ones can. Additionally, step function implementations were not exactly the same. With the 1.5 release, C++ optimizers will always behave the same as the Python equivalent.
  • The lack of tensor multi-dim indexing API in C++ is a well-known issue and had resulted in many posts in PyTorch Github issue tracker and forum. The previous workaround was to use a combination of narrow / select / index_select / masked_select, which was clunky and error-prone compared to the Python API’s elegant tensor[:, 0, ..., mask] syntax. With the 1.5 release, users can use tensor.index({Slice(), 0, "...", mask}) to achieve the same purpose.

‘Channels last’ memory format for Computer Vision models (Experimental)

‘Channels last’ memory layout unlocks ability to use performance efficient convolution algorithms and hardware (NVIDIA’s Tensor Cores, FBGEMM, QNNPACK). Additionally, it is designed to automatically propagate through the operators, which allows easy switching between memory layouts.

Learn more here on how to write memory format aware operators.

Custom C++ Classes (Experimental)

This release adds a new API, torch::class_, for binding custom C++ classes into TorchScript and Python simultaneously. This API is almost identical in syntax to pybind11. It allows users to expose their C++ class and its methods to the TorchScript type system and runtime system such that they can instantiate and manipulate arbitrary C++ objects from TorchScript and Python. An example C++ binding:

template <class T>
struct MyStackClass : torch::CustomClassHolder {
  std::vector<T> stack_;
  MyStackClass(std::vector<T> init) : stack_(std::move(init)) {}

  void push(T x) {
  T pop() {
    auto val = stack_.back();
    return val;

static auto testStack =
  torch::class_<MyStackClass<std::string>>("myclasses", "MyStackClass")
      .def("push", &MyStackClass<std::string>::push)
      .def("pop", &MyStackClass<std::string>::pop)
      .def("size", [](const c10::intrusive_ptr<MyStackClass>& self) {
        return self->stack_.size();

Which exposes a class you can use in Python and TorchScript like so:

def do_stacks(s : torch.classes.myclasses.MyStackClass):
    s2 = torch.classes.myclasses.MyStackClass(["hi", "mom"])
    print(s2.pop()) # "mom"
    return s2 # ["hi", "foobar"]

You can try it out in the tutorial here.

Distributed RPC framework APIs (Now Stable)

The Distributed RPC framework was launched as experimental in the 1.4 release and the proposal is to mark Distributed RPC framework as stable and no longer experimental. This work involves a lot of enhancements and bug fixes to make the distributed RPC framework more reliable and robust overall, as well as adding a couple of new features, including profiling support, using TorchScript functions in RPC, and several enhancements for ease of use. Below is an overview of the various APIs within the framework:


The RPC API allows users to specify functions to run and objects to be instantiated on remote nodes. These functions are transparently recorded so that gradients can backpropagate through remote nodes using Distributed Autograd.

Distributed Autograd

Distributed Autograd connects the autograd graph across several nodes and allows gradients to flow through during the backwards pass. Gradients are accumulated into a context (as opposed to the .grad field as with Autograd) and users must specify their model’s forward pass under a with dist_autograd.context() manager in order to ensure that all RPC communication is recorded properly. Currently, only FAST mode is implemented (see here for the difference between FAST and SMART modes).

Distributed Optimizer

The distributed optimizer creates RRefs to optimizers on each worker with parameters that require gradients, and then uses the RPC API to run the optimizer remotely. The user must collect all remote parameters and wrap them in an RRef, as this is required input to the distributed optimizer. The user must also specify the distributed autograd context_id so that the optimizer knows in which context to look for gradients.

Learn more about distributed RPC framework APIs here.

New High level autograd API (Experimental)

PyTorch 1.5 brings new functions including jacobian, hessian, jvp, vjp, hvp and vhp to the torch.autograd.functional submodule. This feature builds on the current API and allows the user to easily perform these functions.

Detailed design discussion on GitHub can be found here.

Python 2 no longer supported

Starting PyTorch 1.5.0, we will no longer support Python 2, specifically version 2.7. Going forward support for Python will be limited to Python 3, specifically Python 3.5, 3.6, 3.7 and 3.8 (first enabled in PyTorch 1.4.0).

We’d like to thank the entire PyTorch team and the community for all their contributions to this work.


Team PyTorch

Read More

Specification gaming: the flip side of AI ingenuity

Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden touch, in which the king asks that anything he touches be turned to gold – but soon finds that even food and drink turn to metal in his hands. In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material – and thus exploit a loophole in the task specification.Read More

MIT team races to fill Covid-19-related ventilator shortage

It was clear early on in the unfolding Covid-19 pandemic that a critical need in the coming weeks and months would be for ventilators, the potentially life-saving devices that keep air flowing into a patient whose ability to breathe is failing.

Seeing a potential shortfall of hundreds of thousands of such units, professor of mechanical engineering Alex Slocum Sr. and other engineers at MIT swung into action, rapidly pulling together a team of volunteers with expertise in mechanical design, electronics, and controls, and a team of doctors with clinical experience in treating respiratory conditions. They started working together nonstop to develop an inexpensive alternative and share what they learned along the way. The goal was a design that could be produced quickly enough, potentially worldwide, to make a real difference in the immediate crisis.

In a very short time, they succeeded.

Just four weeks since the team convened, production of the first devices based directly on its work has begun in New York City. A group including 10XBeta, Boyce Technologies, and Newlab has begun production of a version called Spiro Wave, in close collaboration with the MIT team. The consortium expects to quickly deliver hundreds of units to meet the immediate needs of hospitals in New York and, eventually, other hospitals around the country.

Meanwhile, the team, called MIT Emergency Ventilator, has continued their research to develop the design further. The next iteration will be more compact, have a slightly different drive system, and add a key respiratory function. Their overarching goal is to focus on safety and straightforward functionality and fabrication. 10XBeta, in both New York and Johannesburg, along with Vecna Technologies and NN Life Sciences in the Boston Area, are participating in this effort. 10XBeta was founded by MIT alumnus Marcel Botha SM ’06.

One version of the MIT Emergency Ventilator team’s emergency ventilator design undergoes testing in their lab. Courtesy of MIT Emergency Ventilator​ Team

A complex design challenge

Alexander Slocum Jr. SB ’08, SM ’10, PhD ’13, a mechanical engineer who is now a surgical resident at the Medical College of Wisconsin, worked closely with his father, Slocum Sr., and MIT research scientist Nevan Hanumara MS ’06, PhD ’12 to help lead the initial ramp up.

“The numbers are frightening, to put it bluntly,” Slocum Jr. says. “This project started around the time of news reports from Italy describing ventilators being rationed due to shortages, and available data at that time suggested about 10 percent of Covid patients would require an ICU.” One of his first tasks was to estimate the potential ventilator shortage, using resources like the CDC’s Pandemic Response Plan, and literature on critical care resource utilization. “We estimated a shortage of around 100,000 to 200,000 ventilators was possible by April or May,” he says.

Hanumara, who is one of the project leads for the Emergency Ventilator team, says the team intends to offer open-source guidelines, rather than detailed plans or kits, that will serve as resources to enable skilled teams around the country and world — such as hospital-based engineering groups, biomedical device manufacturing companies, and industry groups — to develop their own specific versions, taking into account local supply chains.

“There’s a reason we don’t have a single exact plan [on the website],” Hanumara says. “We have information and reference designs, because this isn’t something a home hobbyist should be making. We want to emphasize that it’s not trivial to create a system that can provide ventilation safely.”

“We saw all these designs being posted online, which is awesome that so many people wanted to help,” says Slocum Jr. “We thought the best first step would be to identify the minimum clinical functional requirements for safe ventilation, compare that to reported methods for managing ventilated patients with Covid, and use that to help us choose a design.”

The principle behind the existing device is certainly simple enough: Take an emergency resuscitator bag (Ambu is a common brand), which hospitals already have in large numbers and which is designed to be squeezed by hand. Automating the squeezing — using a pair of curved paddles driven by a motor — would allow rapid scale-up. But there’s a lot more to it, Hanumara says: “The controls are really tricky, and they have required many iterations as our understanding of the clinical and safety challenge grew.”

Slocum Jr. adds, “Covid patients often require ventilation for a week or more, and in longer cases that would mean about a million breaths. The paddles are specifically designed to encourage rolling contact in order to minimize wear on the bag.”

The starting point was a design developed a decade ago as a student team project in MIT class 2.75 (Medical Device Design), taught by Slocum Sr. and Hanumara. The team’s paper gave the new project a significant head start in tackling the design problem now, as they make rapid progress in close consultation with clinical practitioners.

That integral involvement of clinicians “is one key difference between us and a lot of the others” working on this engineering problem, says Kimberly Jung, an MIT master’s student in mechanical engineering.

Jung — who previously served five years in the U.S. Army, earned an MBA at Harvard University, and started a spice business that is currently the largest employer of women in Afghanistan — has been acting as the team’s executive officer as well as part of the engineering team. She says “there’s a lot of individuals and many small companies who are trying to make solutions for low-cost ventilators. The problem is that they just haven’t adhered to clinical guidelines, such as the tidal volume, inspiration-to-expiration ratio, breath per minute rate, maximum pressures, and key monitoring for safety. Developing these clinical requirements and translating them into engineering design requirements takes a lot of time and effort. This is a year-long research and development process that has been condensed into several weeks.”

A team assembles

Others got pulled into the team as the project ramped up. Coby Unger, an industrial designer and instructor at the MIT Hobby Shop, started building the first prototypes in the machine shop. Jung recruited her classmate and neighbor, Shakti Shaligram SM ’19, to help with machining, and also brought in Michael Detienne, an electrical engineer and member of the MITERS makerspace. Two students at the MIT Maker Workshop helped with initial fabrication with stock borrowed from the MIT Laboratory for Manufacturing and Productivity shop. Looking for pressure sensors, Hanumara reached out to David Hagan PhD ’20, CEO of an MIT spinoff company called QuantAQ, and he joined the team. The website was rapidly deployed by Eric Norman, a communications expert who had worked with Hanumara on another MIT project.

Realizing that feedback and control systems were crucial to the device’s safe operation, the team early on decided they needed help from specialists in that area. Daniela Rus, head of MIT’s Computer Science and Artificial Intelligence Laboratory, joined the team and took responsibility for the control system along with several members of her research group. Rus also suggested research scientist Murad Abu-Khalaf and graduate students Teddy Ort and Brandon Araki join the volunteer team. They eagerly accepted the invictation. Ort’s roommate, Amado Antonini SM ’18, also joined the team to assist with motor controls.

Meanwhile, alumnus Albert Kwon SB ’08, HST ’13, an anesthesiologist at Westchester Medical Center and assistant professor of anesthesiology at New York Medical College, was recruited by Slocum Jr. to join the project early on. Kwon was granted leave from his job at Westchester to devote time to the project, providing clinical guidance on the kinds of controls and safety systems needed for the device to work safely. “Westchester Medical Center gave him up, which is very special, and he’s been working to translate the technical to clinical, and explain the scenarios that fit with a stripped-down system like this,” Hanumara says. Jay Connor, a surgeon at Mt. Auburn Hospital and part of the Medical Device Design course teaching team, Christoph Nabzdyk, a cardiothoracic anesthesiologist and critical care physician at Mayo Clinic and long-time colleague of Kwon, and Dirk Varelmann, another anesthesiologist from Brigham and Women’s Hospital, and many other clinicians advised the MIT Emergency Ventilator team.

A spark to help others fill the gap

“While our design cannot replace a full featured ventilator,” Hanumara stresses, “it does provide key ventilation functions that will allow health care facilities under pressure to better ration their ICU ventilators and human resources, in a bad scenario.”

In a way, he says, “we’re turning the clock back, going back to the core parameters of ventilation.” Before today’s electronic sensors and controls were available, “doctors were trained to adjust the ventilators based on looking directly at physiological responses of the patient. So, we know that’s doable. … The patient himself is a reasonable sensor.”

While the federal government has now established contracts with large manufacturing companies to start producing ventilators to help meet the urgent need, that process will take time, Jung says, leaving a significant gap for something to meet the need in the meantime. “The fastest these large manufacturers can spin up is about two months,” she says.

“This need will probably be even more pronounced in the emerging markets,” Hanumara adds.

The team doesn’t plan to directly launch their own production, or even to provide a single, detailed set of plans. “Our goal is to put out a really solid reference design,” Hanumara says “and to a limited extent help big groups scale it. We have shared great learnings with our local industry collaborators.” It will be up to local teams to adapt the design to the materials and parts that they can reliably obtain and the particular needs of their hospitals.

He says “your mechanical and electrical engineering team will have to inquire as to what’s in their supply chain and what fabrication methods they have easily available to them and adapt the design. The base designs are intended to be really adaptable, but it may require modifications. What motors can they source? What motor drivers and controllers do electrical team need to look at? What level of controls and safeties do their clinicians require for their patient population and how should this be reflected in the code? So, we can’t put out an exact kit,” says Hanumara.

The hope is to provide a spark to start teams everywhere to further develop and adapt the concept, Hanumara says. “Provided clinical safety is shown, we’ll probably see many of these around the world, with some shared DNA from us, as well as local flavors. And I think that will be beautiful, because it will mean that people all over are working hard to help their communities.”

“I’m super proud of the team,” Jung says, “for how each of us has stepped up to the plate and stuck with it despite the internal and external challenges. All of us have one mission in mind, which is to save lives, and that’s what has kept us together and turned us into a quirky MIT family.”

This article has been updated to reflect a change to the project’s name, from MIT E-Vent to MIT Emergency Ventilator.

Read More

What’s new in TensorFlow Lite from DevSummit 2020

What’s new in TensorFlow Lite from DevSummit 2020

Posted by Khanh LeViet, Developer Advocate on behalf of the TensorFlow Lite team
Edge devices, such as smartphones, have become more powerful each year and enable an increasing number of on-device machine learning use cases. TensorFlow Lite is the official framework for running TensorFlow model inference on edge devices. It runs on more than 4 billion active devices globally, on various platforms, including Android, iOS, and Linux-based IoT devices, and on bare metal microcontrollers.

We continue to push the limits of on-device machine learning with TensorFlow Lite by making it faster and easier to use. In this blog post, we highlight recent TensorFlow Lite features that were launched within the past six months, leading up to the TensorFlow Dev Summit in March 2020.

Pushing the limits of on-device machine learning

Enabling state-of-the-art models

Machine learning is a fast-moving field with new models that break the state-of-the-art records every few months. We put a lot of effort into making these state-of-the-art models run well on TensorFlow Lite. As examples, we now support EfficientNet-Lite (paper), a family of image classification models, MobileBERT (paper), and ALBERT-Lite (paper), a light-weight version of BERT (paper) that supports multiple NLP (natural language processing) tasks. Check out some of the performance benchmarks below.


EfficientNet-Lite is a family of image classification models that achieve state-of-the-art accuracy with an order of magnitude fewer computations and parameters. The models are optimized for TensorFlow Lite with quantization, resulting in faster inference with negligible accuracy loss, and they can run on the CPU, GPU, or Edge TPU. Find out more in our blog post.

Figure: Integer-only quantized models running on Pixel 4 CPU with 4 threads.

MobileBERT and ALBERT-Lite

MobileBERT and ALBERT-Lite are the optimized versions of the popular BERT model that achieved state-of-the-art accuracy on a range of NLP tasks, including question and answer, natural language inference, and others. MobileBERT is about 4x faster and smaller than BERT and retains similar accuracy. Meanwhile, ALBERT-Lite is 6x smaller than BERT, but slower than MobileBERT.

Figure: Pixel 4, Float32 question & answer models, CPU 4 threads

New TensorFlow Lite converter

We launched a new converter that enables more models and improves the developer experience:

  • Enables conversion of new classes of models, including DeepSpeech V2, Mask R-CNN, Mobile BERT, MobileNetSSD, and many more
  • Adds support for functional control flow (enabled by default in TensorFlow 2.x)
  • Tracks original TensorFlow node names and Python code and exposes them during conversion if errors occur
  • Leverages MLIR, Google’s cutting edge compiler technology for ML, which makes it easier to troubleshoot conversion errors and extend to accommodate feature requests

The new converter is fully backward compatible and is enabled by default since TensorFlow 2.2, while the old converter is still available via a flag. See the documentation for more details.

Quantization-aware-training support for Keras

Quantization-aware-training (QAT) enables you to train and deploy models with the performance and size benefits of quantization—makes your model 4x times smaller and run faster, while retaining accuracy. You can add QAT with one line of code.

import tensorflow_model_optimization as tfmot

model = tf.keras.Sequential([
# Quantize the entire model.
quantized_model = tfmot.quantization.keras.quantize_model(model)

# Continue with training as usual.

Here is how QAT stacks up against the original float model and post-training quantization. Find out more in our blog post.

Faster inference across platforms

Better CPU performance

Improving CPU performance has been a major priority for the team and we’ve shipped a number of substantial CPU-related performance optimizations in recent months. As part of this effort, we developed an optimized matrix multiplication library (ruy), built from the ground up to deliver better performance on the classes of CPU hardware and models typically used in mobile environments. As of TensorFlow 1.15, this library is enabled by default for all ARM devices and has helped deliver latency improvements anywhere from 1.2x to 5x across an extremely broad range of models and use cases.

Pixel 4 – Single Threaded CPU, February 2020

We have some additional CPU optimizations scheduled to ship in the TensorFlow 2.3 release, including ~40% faster execution of models with post-training weight quantization, as well as a new highly optimized floating-point convolutional kernel library (XNNPACK) that delivers 20-50% faster execution across all of the key floating-point convolutional models supported by TensorFlow Lite.

Faster inference with new hardware accelerator delegates

TensorFlow Lite is truly cross platform, so that you can train a model once and get the optimal performance on every supported platform. In the past few months, we have added support for running inference on Qualcomm’s Hexagon DSPs, Apple’s Core ML, Android GPU with OpenCL.
Hexagon DSPs are microprocessors that can be found on millions of modern Android phones using Qualcomm Snapdragon SoCs. The new TensorFlow Lite Hexagon delegate leverages the DSP to achieve performance gains in the range of 3-25x for models like MobileNet and Inceptionv3 compared to CPU, while being more power efficient than both CPU and GPU. Learn more in our blog post.
Core ML is the machine learning framework available on Apple’s devices and provides the API to run ML models on Apple’s Neural Engine. The new TensorFlow Lite Core ML delegate allows running TensorFlow Lite models on Core ML and Neural Engine, if available, to achieve faster inference with better power consumption efficiency. On iPhone XS and newer devices, where Neural Engine is available, we have observed performance gains from 1.3x to 11x on various computer vision models. More details can be found in our blog post.
OpenCL is a framework for writing programs that execute across heterogeneous platforms. We recently added support for OpenCL to the TensorFlow Lite GPU delegate, achieving approximately 4-6x speed-up over CPU and approximately 2x speed-up over OpenGL on a variety of computer vision models. Here is a snapshot of the OpenCL backend performance on Pixel 4.

Android performance bottleneck profiler

TensorFlow Lite on Android supports instrumented logging of internal events, including ops invocations, that can be tracked by Android’s system tracing. The new profiling data allows you to identify performance bottlenecks.
Here are some examples of insights that you can get from the profiler and potential solutions to improve performance:

  • If the number of available CPU cores is smaller than the number of inference threads, then the CPU scheduling overhead can lead to subpar performance. You can reschedule other CPU intensive tasks in your application to avoid overlapping with your model inference or tweak the number of interpreter threads.
  • If the operators are not fully delegated, then some parts of the model graph are executed on the CPU rather than the expected hardware accelerator. You can substitute the unsupported operators with similar supported operators.

This feature is available now in TensorFlow Lite Android library nightly build. More details can be found here.

Make ML easier to use

Model creation with no ML expertise

TensorFlow Lite Model Maker enables you to adapt state-of-the-art machine learning models to your dataset with transfer learning. It shortens the learning curve for developers new to machine learning by wrapping the complex machine learning concepts with an intuitive API. For example, you can train a state-of-the-art image classification with only four lines of code.

data = ImageClassifierDataLoader.from_folder('flower_photos/')
model = image_classifier.create(data)
loss, accuracy = model.evaluate()
model.export('flower_classifier.tflite', 'flower_label.txt', with_metadata=True)

Model Maker supports many state-of-the-art models that are available on TensorFlow Hub, including the EfficientNet-Lite models mentioned above. It currently supports image classification (tutorial) and text classification (tutorial) with more computer-vision and NLP use cases coming soon.

Model sharing made easy with metadata

Traditionally, running inference with TensorFlow Lite means working with the raw tensors. This presented two hurdles:

  1. The consumer of the TensorFlow Lite model will need to know exactly what the tensor shape means (e.g. 1 x 224 x 224 x 3). Is it a bitmap? If so, is it in red, blue, and green channels or some other scheme? This poses a problem if the team creating the model is not the same team consuming it.
  2. The need to use a lot of error-prone boilerplate code to convert from high-level data types, such as Bitmap to an RGB float array or a ByteArray, before it can be used.

To solve the first problem, we added support for model metadata to TensorFlow Lite, allowing model creators to describe the input and output of their model using typed objects. In addition to basic information, such as the size of the bitmap or color channels, we also included information, such as mean and standard deviation, to communicate to the model consumers, so that the appropriate normalization can be applied.
To solve the second problem of boilerplate code, we created the Android code generator, which reads the TensorFlow Lite metadata and creates the appropriate wrapper code to resize, normalize, and convert to and from ByteArray. This means you can now interact with the TensorFlow Lite model using high-level objects that you are familiar with:

// 1. Initializing the Model    
MyClassifierModel myImageClassifier = new MyClassifierModel(activity);

// 2. Setting the input with a Bitmap called inputBitmap
MyClassifierModel.Inputs inputs = myImageClassifier.createInputs();

// 3. Running the model
MyClassifierModel.Outputs outputs =;

// 4. Retrieving the result
Map labeledProbability = outputs.getProbability();

This is currently an experimental feature and only supports image-based models. We added metadata support to most TensorFlow Lite vision models on TensorFlow Hub and the Image Classifier Model Maker. Going forward, the project is expanding in three ways:

  1. Support input types beyond images to enable more use-cases
  2. Build an Android Studio plugin that makes this even easier to use
  3. Add iOS support

More sample and learning materials

We launched two online courses on Coursera and Udacity to provide a structured learning path for TensorFlow Lite. Both courses are four weeks long and teach how to use TensorFlow Lite on Android, iOS, and IoT devices.
We released new sample apps demonstrating how to use pretrained models, including style transfer, question and answer and more. We love to see the engagement in the TensorFlow Lite community. Recently, one member collected pretrained models, samples, and tutorials created by the community and curated them on GitHub. Feel free to contribute!

Better support for microcontrollers

Official support for Arduino

TensorFlow Lite for Microcontrollers is now available as an official Arduino library, which makes it easy to deploy speech detection to an Arduino Nano in under 5 minutes.

More TensorFlow for Microcontrollers optimizations

We are working with leading industry partners who are writing optimized implementations of TensorFlow Lite for Microcontrollers kernels for their hardware architectures. For example, Cadence announced their support for TensorFlow for Microcontrollers for their Tensilica HiFi DSP family.

How Google is using TensorFlow Lite

TensorFlow Lite is used extensively within Google in many of our key products, including YouTube, Google Assistant, and Google Photos.
The Google Lens team shared how they migrated from a server-based model to a client-based on-device model to improve the user experience.

The Live Perception team showed how to build a machine learning pipeline to process live camera feed in real-time.

What’s next

We have new features and improvements coming in a few months:

  • XNNPACK integration for highly optimized floating-point model execution. This will significantly speed up CPU inference across platforms.
  • New state-of-the-art on-device models, an updated guide, and examples demonstrating more use cases, such as native C/C++ APIs for inference on mobile.
  • Additional tools for trimming binary size based on the ops used in client models, reducing the size impact on client apps.
  • Enhancements to Model Maker for more tasks like object detection or NLP tasks. We are adding BERT support to enable new NLP tasks like question and answer, which will empower developers without ML expertise to build state-of-the-art NLP models through transfer learning.
  • Expansion of the metadata and codegen tools to support more use cases, including object detection and other NLP-related tasks, and better integration with Android Studio.

To see the longer term TensorFlow Lite product roadmap, please check out our website.Read More

Optimizing style transfer to run on mobile with TFLite

Optimizing style transfer to run on mobile with TFLite

Posted by Khanh LeViet and Luiz Gustavo Martins, Developer Advocates

Neural style transfer is an optimization technique used to take two images, a content image (such as a building) and a style image (such as artwork by an iconic painter), and blend them together so the output image looks like the content image “painted” in the style of the reference image. Today, we are excited to share a pre-trained style transfer TensorFlow Lite model that is optimized for mobile, and an Android and an iOS sample app that uses the model to stylize any images.

In this article, we will walk you through the journey of optimizing the large TensorFlow model for mobile deployment, and how to use it efficiently in a mobile app with TensorFlow Lite. We hope that you can use our pre-trained style transfer model or leverage our insights for your use cases.


An example of style transfer

Style transfer was first published in A Neural Algorithm of Artistic Style. The original technique, however, was computationally expensive and it can take several seconds to stylize an image even on high-end GPUs. Subsequent work by several authors (for example) showed how to speed up style transfer.

After evaluating several model architectures, we decided to start with a pre-trained arbitrary style transfer model from Magenta for our sample app. The model can take any content and style image as input, then use a feedforward neural network to generate a stylized output image. This model allows much faster style transfer compared to the technique in Gatys’s paper, but it is still quite large (44 MB) and slow (2340 ms on Pixel 4 CPU). Therefore, we need to optimize the model to make it suitable to use in mobile applications. This article shares our experience doing so, with resources you can take advantage of in your work.

Optimizing the model architecture

The structure of our style transfer model

The Magenta’s arbitrary style transfer model consists of two subnetworks:

  • Style prediction network: converts the style image to a style embedding vector.
  • Style transform network: applies the style embedding vector on the content image to generate a stylized image.

Magenta’s style prediction network has an InceptionV3 backbone, so we replaced it with a MobileNetV2 backbone, which is optimized for mobile. The style transform network consists of several convolution layers. We applied the width multiplier idea from MobileNet, scaling down the number of output channels of all convolution layers by a factor of 4.
Then, we had to decide how to train our model. We experimented with multiple options: training the mobile model from scratch or distilling from the pre-trained Magenta’s model. We found that fixing the weights of MobileNetV2 while optimizing other parameters from scratch gave the best result.
We were able to achieve a similar level of style and content loss, while significantly shrinking and speeding up the model.

* Benchmarked on Pixel 4 CPU using TensorFlow Lite with 2 threads, April 2020.
* See this paper for more details about the definition of loss function used in this style transfer model


Once we have settled on the model architecture, we continue to shrink our mobile model further with quantization using the TensorFlow Model Optimization Toolkit. This is an important technique that is applicable for most mobile deployment of TensorFlow models, as it can shrink the model size up to 4X and speed up model inference with insignificant quality trade-off.
Among the quantization options available that TensorFlow provides, we decided to use post-training integer quantization because it has the right balance of simplicity and model quality. We only needed to provide a small portion of our training dataset when converting the TensorFlow model to TensorFlow Lite.
After quantization, our model is more than an order smaller and faster than the original model, while maintaining the same level of style and content loss.

* Benchmarked on Pixel 4 CPU using TensorFlow Lite with 2 threads, April 2020.

Deployment to mobile

We implemented an Android app to demonstrate how to use the style transfer model. The app takes a style image, a content image, and outputs an image that mixes the style and content of the input images.
We use the phone’s camera to capture the content images with the Camera2 API and provide a set of famous paintings to be used as style images. As mentioned above, there are two steps to apply a style to a content image. Firstly, we extract the style as an array floats using the style prediction network. Then we apply this style to the content image using the style transform network.
In order to achieve the best performance on both CPU and GPU, we created two sets of TensorFlow Lite models optimized for each chip. We use the int8 quantized model for CPU inference, and float16 quantized model for GPU inference. GPU generally achieves better performance than CPU but it currently only supports float models, which are larger than int8 quantized models. Here is how the int8 and the float16 model perform.

* Benchmarked on Pixel 4 using TensorFlow Lite, April 2020.

Another possible performance gain is to cache the results of the style prediction network if you only plan to support a fixed set of style images in your mobile app. This will make your app smaller as you do not need to include the style prediction network, which accounts for 91% of the total network size. This is the main reason why the process is splitted into two models instead of only one.
The sample can be found on GitHub and the main class applying style is the StyleTransferModelExecutor.
It is important that we do not run style transfer on the UI thread as it is computational expensive. We instead use the ViewModel class from AndroidX and a Coroutine to run it on a dedicated background thread and easily update the view. Besides, when running a model using GPU delegate, TF Lite interpreter initialization, GPU delegate initialization and inference have to run all on the same thread.

Style transfer in production

The Google Arts & Culture app recently added Art Transfer that uses TensorFlow Lite to run style transfer on-device. The model used is very similar to the one above but prioritizes quality over speed and model size. Try it out if you are interested in seeing style transfer in production.

Your Turn

If you want to add Style Transfer to your own app, you can start by downloading the mobile sample. Both model versions, the float16 (predict network, transform network) and the int8 quantized version (predict network, transform network), are available on TensorFlow Hub. We can’t wait to see what you can create! Don’t forget to share with us your creations.


Running machine learning models on-device has the benefits of keeping the users data private while enabling features with low latency.
In this post, we have shown that directly converting a TensorFlow model to TensorFlow Lite might be just the first step. To achieve good performance, developers should optimize their model with quantization, and find the right trade-off between model quality, model size, and inference time.
We used the resources below to create our model. They might be also applicable to your on-device machine learning use cases.

  • Magenta model repository
    Magenta is an open source project powered by TensorFlow. It uses machine learning to make music and art. There are many models that can be converted to TensorFlow Lite, including this style transfer model.
  • TensorFlow Model Optimization Toolkit
    Model Optimization Toolkit provides multiple methods to optimize your model, including quantization and pruning.
  • TensorFlow Lite delegates
    TensorFlow Lite can leverage many different types of hardware accelerator available on devices, including GPUs and DSPs, to speed up model inference.

Read More

Deploying more conversational chatbots

The comedian Bill Burr has said he refuses to call into automated customer service lines for fear that, years later on his death bed, all he’ll be able to think about are the moments he wasted dealing with chatbots.

Indeed, the frustrating experience of trying to complete even the most straightforward task through an automated customer service line is enough to make anyone question the purpose of life.

Now the startup Posh is trying to make conversations with chatbots more natural and less maddening. It’s accomplishing this with an artificial intelligence-powered system that uses “conversational memory” to help users complete tasks.

“We noticed bots in general would take what the user said at face value, without connecting the dots of what was said before in the conversation,” says Posh co-founder and CEO Karan Kashyap ’17, SM ’17. “If you think about your conversations with humans, especially in places like banks with tellers or in customer service, what you said in the past is very important, so we focused on making bots more humanlike by giving them the ability to remember historical information in a conversation.”

Posh’s chatbots are currently used by over a dozen credit unions across voice- and text-based channels. The well-defined customer base has allowed the company to train its system on only the most relevant data, improving performance.

The founders plan to gradually partner with companies in other sectors to gather industry-specific data and expand the use of their system without compromising performance. Down the line, Kashyap and Posh co-founder and CTO Matt McEachern ’17, SM ’18 plan to provide their chatbots as a platform for developers to build on.

The expansion plans should attract businesses in a variety of sectors: Kashyap says some credit unions have successfully resolved more than 90 percent of customer calls with Posh’s platform. The company’s expansion may also help alleviate the mind-numbing experience of calling into traditional customer service lines.

“When we deploy our telephone product, there’s no notion of ‘Press one or press two,’” Kashyap explains. “There’s no dial tone menu. We just say, ‘Welcome to whatever credit union, how can I help you today?’ In a few words, you let us know. We prompt users to describe their problems via natural speech instead of waiting for menu options to be read out.”

Bootstrapping better bots

Kashyap and McEachern became friends while pursuing their degrees in MIT’s Department of Electrical Engineering and Computer Science. They also worked together in the same research lab at the Computer Science and Artificial Intelligence Laboratory (CSAIL).

But their relationship quickly grew outside of MIT. In 2016, the students began software consulting, in part designing chatbots for companies to handle customer inquiries around medical devices, flight booking, personal fitness, and more. Kashyap says they used their time consulting to learn about and take business risks.

“That was a great learning experience, because we got real-world experience in designing these bots using the tools that were available,” Kashyap says. “We saw the market need for a bot platform and for better bot experiences.”

From the start, the founders executed a lean business strategy that made it clear the engineering undergrads were thinking long term. Upon graduation, the founders used their savings from consulting to fund Posh’s early operations, giving themselves salaries and even hiring some contacts from MIT.

It also helped that they were accepted into the delta v accelerator, run by the Martin Trust Center for MIT Entrepreneurship, which gave them a summer of guidance and free rent. Following delta v, Posh was accepted into the DCU Fintech Innovation Center, connecting it with one of the largest credit unions in the country and netting the company another 12 months of free rent.

With DCU serving as a pilot customer, the founders got a “crash course” in the credit union industry, Kashyap says. From there they began a calculated expansion to ensure they didn’t grow faster than Posh’s revenue allowed, freeing them from having to raise venture capital.

The disciplined growth strategy at times forced Posh to get creative. Last year, as the founders were looking to build out new features and grow their team, they secured about $1.5 million in prepayments from eight credit unions in exchange for discounts on their service along with a peer-driven profit-sharing incentive. In total, the company has raised $2.5 million using that strategy.

Now on more secure financial footing, the founders are poised to accelerate Posh’s growth.

Pushing the boundaries

Even referring to today’s automated messaging platforms as chatbots seems generous. Most of the ones on the market today are only designed to understand what a user is asking for, something known as intent recognition.

The result is that many of the virtual agents in our lives, from the robotic telecom operator to Amazon’s Alexa to the remote control, take directions but struggle to hold a conversation. Posh’s chatbots go beyond intent recognition, using what Kashyap calls context understanding to figure out what users are saying based on the history of the conversation. The founders have a patent pending for the approach.

“[Context understanding] allows us to more intelligently understand user inputs and handle things like changes in topics without having the bots break,” Kashyap says. “One of our biggest pet peeves was, in order to have a successful interaction with a bot, you as a user have to be very unnatural sometimes to convey what you want to convey or the bot won’t understand you.”

Kashyap says context understanding is a lot easier to accomplish when designing bots for specific industries. That’s why Posh’s founders decided to start by focusing on credit unions.

“The platforms on the market today are almost spreading themselves too thin to make a deep impact in a particular vertical,” Kashyap says. “If you have banks and telecos and health care companies all using the same [chatbot] service, it’s as if they’re all sharing the same customer service rep. It’s difficult to have one person trained across all of these domains meaningfully.”

To onboard a new credit union, Posh uses the customer’s conversational data to train its deep learning model.

“The bots continue to train even after they go live and have actual conversations,” Kashyap says. “We’re always improving it; I don’t think we’ll ever deploy a bot and say it’s done.”

Customers can use Posh’s bots for online chats, voice calls, SMS messaging, and through third party channels like Slack, WhatsApp, and Amazon Echo. Posh also offers an analytics platform to help customers analyze what users are calling about.

For now, Kashyap says he’s focused on quadrupling the number of credit unions using Posh over the next year. Then again, the founders’ have never let short term business goals cloud their larger vision for the company.

“Our perspective has always been that [the robot assistant] Jarvis from ‘Iron Man’ and the AI from the movie ‘Her’ are going to be reality sometime soon,” Kashyap says. “Someone has to pioneer the ability for bots to have contextual awareness and memory persistence. I think there’s a lot more that needs to go into bots overall, but we felt by pushing the boundaries a little bit, we’d succeed where other bots would fail, and ultimately people would like to use our bots more than others.”

Read More

Improving Verifiability in AI Development

We’ve contributed to a multi-stakeholder report by 58 co-authors at 30 organizations, including the Centre for the Future of Intelligence, Mila, Schwartz Reisman Institute for Technology and Society, Center for Advanced Study in the Behavioral Sciences, and Center for Security and Emerging Technologies. This report describes 10 mechanisms to improve the verifiability of claims made about AI systems. Developers can use these tools to provide evidence that AI systems are safe, secure, fair, or privacy-preserving. Users, policymakers, and civil society can use these tools to evaluate AI development processes.

Read Report

While a growing number of organizations have articulated ethics principles to guide their AI development process, it can be difficult for those outside of an organization to verify whether the organization’s AI systems reflect those principles in practice. This ambiguity makes it harder for stakeholders such as users, policymakers, and civil society to scrutinize AI developers’ claims about properties of AI systems and could fuel competitive corner-cutting, increasing social risks and harms. The report describes existing and potential mechanisms that can help stakeholders grapple with questions like:

  • Can I (as a user) verify the claims made about the level of privacy protection guaranteed by a new AI system I’d like to use for machine translation of sensitive documents?
  • Can I (as a regulator) trace the steps that led to an accident caused by an autonomous vehicle? Against what standards should an autonomous vehicle company’s safety claims be compared?
  • Can I (as an academic) conduct impartial research on the risks associated with large-scale AI systems when I lack the computing resources of industry?
  • Can I (as an AI developer) verify that my competitors in a given area of AI development will follow best practices rather than cut corners to gain an advantage?

The 10 mechanisms highlighted in the report are listed below, along with recommendations aimed at advancing each one. (See the report for discussion of how these mechanisms support verifiable claims as well as relevant caveats about our findings.)

Institutional Mechanisms and Recommendations

  1. Third party auditing. A coalition of stakeholders should create a task force to research options for conducting and funding third party auditing of AI systems.
  2. Red teaming exercises. Organizations developing AI should run red teaming exercises to explore risks associated with systems they develop, and should share best practices and tools.
  3. Bias and safety bounties. AI developers should pilot bias and safety bounties for AI systems to strengthen incentives and processes for broad-based scrutiny of AI systems.
  4. Sharing of AI incidents. AI developers should share more information about AI incidents, including through collaborative channels.

Software Mechanisms and Recommendations

  1. Audit trails. Standard setting bodies should work with academia and industry to develop audit trail requirements for safety-critical applications of AI systems.
  2. Interpretability. Organizations developing AI and funding bodies should support research into the interpretability of AI systems, with a focus on supporting risk assessment and auditing.
  3. Privacy-preserving machine learning. AI developers should develop, share, and use suites of tools for privacy-preserving machine learning that include measures of performance against common standards.

Hardware Mechanisms and Recommendations

  1. Secure hardware for machine learning. Industry and academia should work together to develop hardware security features for AI accelerators or otherwise establish best practices for the use of secure hardware (including secure enclaves on commodity hardware) in machine learning contexts.
  2. High-precision compute measurement. One or more AI labs should estimate the computing power involved in a single project in great detail and report on lessons learned regarding the potential for wider adoption of such methods.
  3. Compute support for academia. Government funding bodies should substantially increase funding for computing power resources for researchers in academia, in order to improve the ability of those researchers to verify claims made by industry.

We and our co-authors will be doing further research on these mechanisms and OpenAI will be looking to adopt several of these mechanisms in the future. We hope that this report inspires meaningful dialogue, and we are eager to discuss additional institutional, software, and hardware mechanisms that could be useful in enabling trustworthy AI development. We encourage anyone interested in collaborating on these issues to connect with the corresponding authors and visit the report website.

Read Report

Report Authors
(Equal contribution)
  • Gillian Hadfield OpenAI, University of Toronto, Schwartz Reisman Institute for Technology and Society
  • Heidy Khlaaf Adelard
  • Jingying Yang Partnership on AI
  • Helen Toner Center for Security and Emerging Technology
  • Ruth Fong University of Oxford
  • Tegan Maharaj Mila, Montreal Polytechnic
  • Pang Wei Koh Stanford University
  • Sara Hooker Google Brain
  • Jade Leung Future of Humanity Institute
  • Andrew Trask University of Oxford
  • Emma Bluemke University of Oxford
  • Jonathan Lebensold Mila, McGill University
  • Cullen O’Keefe OpenAI
  • Mark Koren Stanford Centre for AI Safety
  • Théo Ryffel École Normale Supérieure (Paris)
  • JB Rubinovitz Remedy.AI
  • Tamay Besiroglu University of Cambridge
  • Federica Carugati Center for Advanced Study in the Behavioral Sciences
  • Jack Clark OpenAI
  • Peter Eckersley Partnership on AI
  • Sarah de Haas Google Research
  • Maritza Johnson Google Research
  • Ben Laurie Google Research
  • Alex Ingerman Google Research
  • Igor Krawczuk École Polytechnique Fédérale de Lausanne
  • Amanda Askell OpenAI
  • Rosario Cammarota Intel
  • Andrew Lohn RAND Corporation
  • David Krueger Mila, Montreal Polytechnic
  • Charlotte Stix Eindhoven University of Technology
  • Peter Henderson Stanford University
  • Logan Graham University of Oxford
  • Carina Prunkl Future of Humanity Institute
  • Bianca Martin OpenAI
  • Elizabeth Seger University of Cambridge
  • Noa Zilberman University of Oxford
  • Seán Ó hÉigeartaigh Leverhulme Centre for the Future of Intelligence, Centre for the Study of Existential Risk
  • Frens Kroeger Coventry University
  • Girish Sastry OpenAI
  • Rebecca Kagan Center for Security and Emerging Technology
  • Adrian Weller University of Cambridge, Alan Turing Institute
  • Brian Tse Future of Humanity Institute, Partnership on AI
  • Elizabeth Barnes OpenAI
  • Allan Dafoe Future of Humanity Institute
  • Paul Scharre Center for a New American Security
  • Ariel Herbert-Voss OpenAI
  • Martijn Rasser Center for a New American Security
  • Shagun Sodhani Mila, University of Montreal
  • Carrick Flynn Center for Security and Emerging Technology
  • Thomas Gilbert University of California, Berkeley
  • Lisa Dyer Partnership on AI
  • Saif Khan Center for Security and Emerging Technology
  • Yoshua Bengio Mila, University of Montreal
  • Markus Anderljung Future of Humanity Institute
(Descending contribution)


Introducing the new TensorFlow Profiler

Introducing the new TensorFlow Profiler

Posted by Anirudh Sriram, Technical Writer, and Gal Oshri, Product Manager

Performance is a key consideration of successful ML research and production solutions. Faster model training leads to faster iterations and reduced overhead. It is sometimes an essential requirement to make a particular ML solution feasible.

However, it is not always clear what should be optimized. Is there an issue with a specific operation (op), or the input pipeline?

To help answer this, we have developed an extensive set of tools for TensorFlow performance profiling. Beyond the ability to capture and investigate numerous aspects of a profile, the tools offer guidance on how to resolve performance bottlenecks (e.g. input-bound programs).

These tools are used by low-level experts improving TensorFlow’s infrastructure, as well as engineers in Google’s most popular products to optimize their model performance. We want to enable the broader community to take advantage of the tools used at Google for performance profiling. That is why we recently open sourced the new TensorFlow Profiler.

TensorFlow Profiler overview page

What is the TensorFlow Profiler?

The TensorFlow Profiler (or the Profiler) provides a set of tools that you can use to measure the training performance and resource consumption of your TensorFlow models. This new version of the Profiler is integrated into TensorBoard, and builds upon existing capabilities such as the Trace Viewer.

The Profiler has the following new profiling tools available:

  • Overview Page: Provides a top-level view of model performance and recommendations to optimize performance
  • Input Pipeline Analyzer: Analyzes your model’s data input pipeline for bottlenecks and recommends improvements to improve performance
  • TensorFlow Stats: Displays performance statistics for every TensorFlow operation executed during the profiling session
  • GPU Kernel Stats: Displays performance statistics and the originating operation for every GPU accelerated kernel

Check out the Profiler guide in the TensorFlow documentation to learn more about these tools.

Getting started

The best way to get started with the Profiler is to follow the Colab tutorial here. We will cover a few of the important steps and insights in the blog post. First, we install the Profiler plugin for TensorBoard:

pip install -U tensorboard_plugin_profile

This adds the full Profiler capabilities to our TensorBoard installation. Next, we ensure that our model training captures a profile. In this case, we will use the TensorBoard callback in Keras:

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs,
profile_batch = '500,510')

We can choose which batches to profile with the profile_batch parameter. This enables us to choose the number of steps to capture (recommended to be no more than 10). It also helps us skip the first few batches to avoid inaccuracy due to initialization overhead. There are other methods for capturing a profile, described here. We now start TensorBoard with the following command:

tensorboard --logdir {log directory}    # in terminal
%tensorboard --logdir {log directory} # in Colab

After clicking on Profile, we see the overview page: This immediately gives us an indication of our program’s performance. Besides a useful summary, we see a recommendation telling us that our program is input-bound (meaning our accelerator is wasting time waiting for input). This is a really common problem.
By following the instructions in the tutorial, we can bring our average step time from ~30ms to ~3ms. That’s a 10x improvement! While this is a toy example, it is common to hear from engineers and researchers at Google that they managed to improve their performance by significant factors.


Performance optimization is an iterative process and can sometimes be frustrating as it is tricky to pinpoint the exact location of the bottlenecks in your program. Not only can the Profiler tell you where your program has bottlenecks, it can often also tell you what you can do to resolve them and make your code execute faster. Following the recommendations provided can shorten the overall time taken to optimize your program.
When you open TensorBoard to view the profiling results, the Overview page provides code optimization recommendations below the Step time graph. One of the most common reasons for slow code execution is an improperly configured data input pipeline. Leverage the capabilities of the Input pipeline analyzer to effectively identify and eliminate bottlenecks in your data input pipeline. Read the best practices section of the Profiler guide to learn more about other strategies you can employ to get optimal performance.

More resources

Check out these resources to learn more:

What’s next for the TensorFlow Profiler?

In addition to addressing feedback, we are expanding the profiler’s capabilities. A few areas we are currently working on:

  • Memory Profiler: View memory usage over time and the associated op/training step.
  • Keras Analysis: Enable linking the information in the profiler to Keras. This enables, for example, identifying which Keras layers correspond to the ops shown in the trace viewer.
  • Multiworker GPU Analysis: Enable profiling multiple GPU workers and aggregate the results. Analyze the hotspot and the communication across workers.

We are excited to continue bringing the tools used at Google to improve ML performance to the broader community. If there are specific capabilities that would help you the most, or to report a bug, feel free to open an issue here!

Read More

How TensorFlow Lite helps you from prototype to product

How TensorFlow Lite helps you from prototype to product

Posted by Khanh LeViet, Developer Advocate

TensorFlow Lite is the official framework to run inference with TensorFlow models on edge devices. TensorFlow Lite is deployed on more than 4 billions edge devices worldwide, supporting Android, iOS, Linux-based IoT devices and microcontrollers.

Since first launch in late 2017, we have been improving TensorFlow Lite to make it robust while keeping it easy to use for all developers – from the machine learning experts to the mobile developers who just started learning about machine learning.

In this blog, we will highlight recent launches that made it easier for you to go from prototyping an on-device use case to deploying in production.
If you prefer a video format, check out this talk from TensorFlow DevSummit 2020.

Prototype: jump-start with state-of-the-art models

As machine learning is a very fast-moving field, it is very important to be able to know what is possible with current technologies before investing resources into building a feature. We have a repository of pretrained models and sample applications that implement the models so that you can try out TensorFlow Lite models on real devices without writing any code. Then, you can quickly integrate the models into your application to prototype and test how your user experiences will be like before spending time on training your own model.

We have published several new pretrained models, including a question & answer model and a style transfer model.

We are also committed to bringing more state-of-the-art models from research teams to TensorFlow Lite. Recently we have enabled 3 new model architectures: EfficientNet-Lite (paper), MobileBERT (paper) and ALBERT-Lite (paper).

  • EfficientNet-Lite is a novel image classification model that achieves state-of-the-art accuracy with an order of magnitude of fewer computations and parameters. It is optimized for TensorFlow Lite, supporting quantization with negligible accuracy loss and fully supported by the GPU delegate for faster inference. Find out more in our blog post.
    Benchmark on Pixel 4 CPU, 4 Threads, March 2020
  • MobileBERT is an optimized version of the popular BERT (paper) model that achieved state-of-the-art accuracy on a range of NLP tasks, including question and answer, natural language inference and others. MobileBERT is about 4x faster and smaller than BERT but retains similar accuracy.
  • ALBERT is another light-weight version of the BERT that was optimized for model size while retaining the same accuracy. ALBERT-Lite is the TensorFlow Lite compatible version of ALBERT, which is 6x smaller than BERT, or 1.5x smaller than MobileBERT, while the latency is on par with BERT.
Benchmark on Pixel 4 CPU, 4 Threads, March 2020
Model hyper parameters: Sequence length 128, Vocab size 30K

Develop model: without ML expertise, create models for your dataset

When bringing state-of-the-art research models to TensorFlow Lite, we also want to make it easier for you to customize these models to your own use cases. We are excited to announce TensorFlow Lite Model Maker, an easy-to-use tool to adapt state-of-the-art machine learning models to your dataset with transfer learning. It wraps the complex machine learning concepts with an intuitive API, so that everyone can get started without any machine learning expertise. You can train a state-of-the-art image classification with only 4 lines of code:

data = ImageClassifierDataLoader.from_folder('flower_photos/')
model = image_classifier.create(data)
loss, accuracy = model.evaluate()
model.export('flower_classifier.tflite', 'flower_label.txt', with_metadata=True)

Model Maker supports many state-of-the-art models that are available on TensorFlow Hub, including the EfficientNet-Lite models. If you want to get higher accuracy, you can switch to a different model architecture by changing just one line of code while keeping the rest of your training pipeline.

# EfficinetNet-Lite2.
model = image_classifier.create(data, efficientnet_lite2_spec)

# ResNet 50.
model = image_classifier.create(data, resnet_50_spec)

Model Maker currently supports two use cases: image classification (tutorial) and text classification (tutorial), with more computer vision and NLP use cases coming soon.

Develop model: attach metadata for seamless model exchange

The TensorFlow Lite file format has always had the input/output tensor shape in its metadata. This works well when the model creator is also the app developer. However, as the on-device machine learning ecosystem grows, these tasks are increasingly performed by different teams within an organization or even between organizations. To facilitate these model knowledge exchanges, we have added new fields in the metadata. They fall into two broad categories:

  1. Machine-readable parameters – e.g. normalization parameters such as mean and standard deviation, category label files. These parameters can be read by other systems so wrapper code can be generated. You can see an example of this in the next section.
  2. Human-readable parameters – e.g. model description, model license. This can provide the app developer using the model crucial information on how to use the model correctly – are there strengths or weaknesses they should be aware of? Also, fields like licenses can be critical in deciding whether a model can be used. Having this attached to the model significantly reduces the barrier of adoption.

To supercharge this effort, models created by TensorFlow Lite Model Maker and image related TensorFlow Lite models on TensorFlow Hub already have metadata attached to it. If you are creating your own model, you can attach metadata to make sharing models easier.

# Creates model info.
model_meta = _metadata_fb.ModelMetadataT() = "MobileNetV1 image classifier"
model_meta.description = ("Identify the most prominent object in the "
"image from a set of 1,001 categories such as "
"trees, animals, food, vehicles, person etc.")
model_meta.version = "v1" = "TensorFlow"
model_meta.license = ("Apache License. Version 2.0 "
# Describe input and output tensors
# ...

# Writing the metadata to your model
b = flatbuffers.Builder(0)
metadata_buf = b.Output()
populator = _metadata.MetadataPopulator.with_model_file(model_file)

For a complete example of how we populate the metadata for MobileNet v1, please refer to this guide.

Develop app: automatically generate code from model

Instead of copy and pasting error-prone boilerplate code to transform typed objects such as Bitmap to ByteArray to feed to TensorFlow Lite interpreter, a code generator can generate the wrapper code ready for integration using the machine-readable parts of the metadata.
You can use our first code generator build for Android to generate model wrappers. We are also working on integrating this tool into Android Studio.

Develop app: discover performance with the benchmark and profiling tools

Once a model is created, we would like to check how it performs on mobile devices. TensorFlow Lite provides benchmark tools to measure model performance of models. We have added support for running benchmarks with all runtime options, including running models on GPU or other supported hardware accelerators, specifying the number of threads and more. You can also get inference latency breakdown to the granularity of a single operation to identify the most time consuming operations and optimize your model inference.
After integrating a model to your application, you may encounter other performance issues so that you may resort to platform-provided performance profiling tools. For example, on Android, one could investigate performance issues via various tracing tools. We have launched a TensorFlow Lite performance tracing module on Android that helps to poke into TensorFlow Lite internals. It is installed by default in our nightly release. With tracing, one may find whether there is resource contention during inference. Please refer to our documentation to learn more about how to use the module in the context of the Android benchmark tool.
We will continue working on improving TensorFlow Lite performance tooling to make it more intuitive and more helpful to measure and tune TensorFlow Lite performance on various devices.

Deploy: easily scale to multiple platforms

Nowadays, most applications need to support multiple platforms. That’s why we built TensorFlow Lite to work seamlessly across platforms: Android, iOS, Raspberry Pi, and other Linux-based IoT devices. All TensorFlow Lite models will just work out-of-the-box on any officially supported platforms, so that you can focus on creating good models instead of worrying about how to adapt your models to different platforms.
Each platform has its own hardware accelerator that can be used to speed up model inference. TensorFlow Lite has already supported running models on NNAPI for Android, GPU for both iOS and Android. We are excited to add more hardware accelerators:

  • On Android, we have added support for Qualcomm Hexagon DSP which is available on millions of devices. This enables developers to leverage the DSP on older Android devices below Android 8.1 where Android NN API is unavailable.
  • On iOS, we have launched CoreML delegate to allow running TensorFlow Lite models on Apple’s Neural Engine.

Besides, we continued to improve performance on existing supported platforms as you can see from the graph below comparing the performance between May 2019 and February 2020. You only need to upgrade to the latest version of TensorFlow Lite library to benefit from these improvements.

Pixel 4 – Single Threaded CPU, February 2020

Future work

Over the coming months, we will work on supporting more use cases and improving developer experiences:

  • Continuously release up-to-date state-of-the-art on-device models, including better support for BERT-family models for NLP tasks and new vision models.
  • Publish new tutorials and examples demonstrating more use cases, including how to use C/C++ APIs for inference on mobile.
  • Enhance Model Maker to support more tasks including object detection and several NLP tasks. We will add BERT support for NLP tasks, such as question and answer. This will empower developers without machine learning expertise to build state-of-the-art NLP models through transfer learning.
  • Expand the metadata and codegen tools to support more use cases, including object detection and more NLP tasks.
  • Launch more platform integration for even easier end-to-end experience, including better integration with Android Studio and TensorFlow Hub.


We are committed to continue improving TensorFlow Lite and looking forward to seeing what you have built with TensorFlow Lite, as well as hearing your feedback. Share your use cases with us directly or on Twitter with hashtags #TFLite and #PoweredByTF. To report bugs and issues, please reach out to us on GitHub.


Thanks to Amy Jang, Andrew Selle, Arno Eigenwillig‎, Arun Venkatesan‎, Cédric Deltheil, Chao Mei, Christiaan Prins, Denny Zhou, Denis Brulé, Elizabeth Kemp, Hoi Lam, Jared Duke, Jordan Grimstad, Juho Ha, Jungshik Jang‎, Justin Hong, Hongkun Yu, Karim Nosseir, Khanh LeViet, Lawrence Chan, Lei Yu, Lu Wang‎, Luiz Gustavo Martins‎, Maxime Brénon, Mia Roh, Mike Liang, Mingxing Tan, Renjie Liu‎, Sachin Joglekar, Sarah Sirajuddin, Sebastian Goodman, Shiyu Hu, Shuangfeng Li‎, Sijia Ma, Tei Jeong, Tian Lin, Tim Davis, Vojtech Bardiovsky, Wei Wei, Wouter van Oortmerssen, Xiaodan Song, Xunkai Zhang‎, YoungSeok Yoon‎, Yuqi Li‎‎, Yi Zhou, Zhenzhong Lan, Zhiqing Sun and more.Read More

OpenAI Microscope

OpenAI Microscope

OpenAI Microscope

We’re introducing OpenAI Microscope, a collection of visualizations of every significant layer and neuron of eight vision “model organisms” which are often studied in interpretability. Microscope makes it easier to analyze the features that form inside these neural networks, and we hope it will help the research community as we move towards understanding these complicated systems.

Browse Microscope

The abilities of modern neural networks are the result of the interactions of thousands of neurons (sometimes tens of thousands or more!). In order to understand their behavior, we’d like to be able to quickly and easily investigate these neurons interactions in detail, and share those observations. This is especially true in collaborative environments. For instance, one researcher might speculate:

InceptionV1 4c:447 is a car detector which is built from a wheel detector (4b:373) and a window detector (4b:237).

When someone makes a claim like this, it’s useful if others can quickly explore those neurons, evaluating the claim and discovering new things. This is the goal of the OpenAI Microscope.

OpenAI Microscope
OpenAI Microscope

Microscope systematically visualizes every neuron in several commonly studied vision models, and makes all of those neurons linkable. We hope this will support the interpretability community in several ways:

  1. Although these models and visualizations are already open source (we help maintain the lucid library, which is used to generate all the visualizations in Microscope) visualizing neurons is tedious. Microscope changes the feedback loop of exploring neurons from minutes to seconds. This quick feedback loop has been essential for us in discovering unexpected features like high-low frequency detectors in the ongoing circuits project.
  2. Making models and neurons linkable allows immediate scrutiny and further exploration of research making claims about those neurons. It also removes potential confusion about which model and neuron is being discussed (which of the five versions of InceptionV1 are we talking about again?). This is really helpful for collaboration, especially when researchers are at different institutions.
  3. One of the wonderful things about interpretability as an area of ML is how accessible it is. Compared to many other areas, it requires comparatively little access to compute. But systematically visualizing neural networks can still take hundreds of GPU hours. We hope that, by sharing our visualizations, we can help keep interpretability highly accessible.

Just as biologists often focus on the study of a few “model organisms,” Microscope focuses on exploring a small number of models in detail. Our initial release includes nine frequently studied vision models, along with several visualization techniques we’ve found particularly useful in studying them. We plan to expand to other models and techniques in the coming months.

We’re excited to see how the community will use Microscope, and we encourage you to reuse these assets. In particular, we think it has a lot of potential in supporting the Circuits collaboration—a project to reverse engineer neural networks by analyzing individual neurons and their connections—or similar work.

Browse Microscope