PluggableDevice: Device Plugins for TensorFlow

Posted by Penporn Koanantakool and Pankaj Kanwar.

As the number of accelerators (GPUs, TPUs) in the ML ecosystem has exploded, there has been a strong need for seamless integration of new accelerators with TensorFlow. In this post, we introduce the PluggableDevice architecture which offers a plugin mechanism for registering devices with TensorFlow without the need to make changes in TensorFlow code.

This PluggableDevice architecture has been designed & developed collaboratively within the TensorFlow community. It leverages the work done for Modular TensorFlow, and is built using the StreamExecutor C API. The PluggableDevice mechanism is available in TF 2.5.

The need for Seamless integration

Prior to this, any integration of a new device required changes to the core TensorFlow. This was not scalable because of several issues, for example:

  • Complex build dependencies and compiler toolchains. Onboarding a new compiler is nontrivial and adds to the technical complexity of the product.
  • Slow development time. Changes need code reviews from the TensorFlow team, which can take time. Added technical complexity also adds to the development and testing time for new features.
  • Combinatorial number of build configurations to test for. The changes made for a particular device might affect other devices or other components of TensorFlow. Each new device could increase the number of test configurations in a multiplicative manner.
  • Easy to break. The lack of a contract via a well defined API means that it’s easier to break a particular device.

What is PluggableDevice?

The PluggableDevice mechanism requires no device-specific changes in the TensorFlow code. It relies on C APIs to communicate with the TensorFlow binary in a stable manner. Plug-in developers maintain separate code repositories and distribution packages for their plugins and are responsible for testing their devices. This way, TensorFlow’s build dependencies, toolchains, and test process are not affected. The integration is also less brittle since only changes to the C APIs or PluggableDevice components could affect the code.

The PluggableDevice mechanism has four main components:

  • PluggableDevice type: A new device type in TensorFlow which allows device registration from plug-in packages. It takes priority over native devices during the device placement phase.
  • Custom operations and kernels: Plug-ins register their own operations and kernels to TensorFlow through the Kernel and Op Registration C API.
  • Device execution and memory management: TensorFlow manages plug-in devices through the StreamExecutor C API.
  • Custom graph optimization pass: Plug-ins can register one custom graph optimization pass, which will be run after all standard Grappler passes, through the Graph Optimization C API.
chart of how a device plug-in interacts with TensorFlow
How a device plug-in interacts with TensorFlow.

Using PluggableDevice

To be able to use a particular device, like one would a native device in TensorFlow, users only have to install the device plug-in package for that device. The following code snippet shows how the plugin for a new device, say Awesome Processing Unit (APU), would be installed and used. For simplicity, let this APU plug-in only have one custom kernel for ReLU.

$ pip install tensorflow-apu-0.0.1-cp36-cp36m-linux_x86_64.whl

Successfully installed tensorflow-apu-0.0.1
$ python
Python 3.6.9 (default, Oct 8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf # TensorFlow registers PluggableDevices here
>>> tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:APU:0', device_type='APU')]

>>> a = tf.random.normal(shape=[5], dtype=tf.float32) # Runs on CPU
>>> b = tf.nn.relu(a) # Runs on APU

>>> with tf.device("/APU:0"): # Users can also use 'with tf.device' syntax
... c = tf.nn.relu(a) # Runs on APU

>>> @tf.function # Defining a tf.function
... def run():
... d = tf.random.uniform(shape=[100], dtype=tf.float32) # Runs on CPU
... e = tf.nn.relu(d) # Runs on APU
>>> run() # PluggableDevices also work with tf.function and graph mode.

Upcoming PluggableDevices

We are excited to announce that Intel will be one of our first partners to release a PluggableDevice. Intel has made significant contributions to this effort, submitting over 3 RFCs implementing the overall mechanism. They will release an Intel extension for TensorFlow (ITEX) plugin package to bring Intel XPU to TensorFlow for AI workload acceleration. We also expect other partners to take advantage of PluggableDevice and release additional plug-ins.

We will publish a detailed tutorial on how to develop a PluggableDevice plug-in for partners who might be interested in leveraging this infrastructure. For questions on the PluggableDevices, engineers can post questions directly on the RFC PRs [1, 2, 3, 4, 5, 6], or on the TensorFlow Forum with the tag pluggable_device.

Read More

How TensorFlow helps Edge Impulse make ML accessible to embedded engineers

Posted by Daniel Situnayake, Founding TinyML Engineer, Edge Impulse.

Microcontrollers that run our world

No matter where you are reading this right now—your home, your office, or sitting in a vehicle—you are likely surrounded by microcontrollers. They are the tiny, low-power computers that animate our modern world: from smart watches and kitchen appliances to industrial equipment and public transportation. Mostly hidden inside other products, microcontrollers are actually the most numerous type of computer, with more than 28 billion of them shipped in 2020.

The software that powers all these devices is written by embedded software engineers. They’re some of the most talented, detail-oriented programmers in the industry, tasked with squeezing every last drop of efficiency from tiny, inexpensive processors. A typical mid-range microcontroller—based around Arm’s popular Cortex-M4 architecture—might have a 32-bit processor running at just 64Mhz, with 256KB of RAM and 1MB of flash memory for storing a program. That doesn’t leave a lot of room for waste.

Since microcontrollers interface directly with sensors and hardware, embedded engineers are often experts in signal processing and electrical engineering—and they tend to have a lot of domain knowledge in their area of focus. One engineer might be an expert on the niche sensors used for medical applications, while another might focus on analyzing audio signals.

Embedded machine learning

In the past few years, a set of technologies have been developed that make it possible to run miniature, highly optimized machine learning models on low-power microcontrollers like the one described above. By using machine learning to interpret sensor data right at the source, embedded applications can become smarter, faster, and more energy efficient, making their own decisions rather than having to stream data to the cloud and wait for a response. This concept is known as embedded machine learning, or TinyML.

With their deep signal processing and domain expertise, embedded engineers are ideally placed to design this new generation of smart applications. However, embedded engineers tend to have highly specialized skill sets and use development toolchains that are often far removed from the Python-heavy stack preferred by data scientists and machine learning engineers.

It isn’t reasonable to expect domain experts to retrain as data scientists, or for data scientists to learn the embedded development skills required to work with microcontrollers. Instead, a new generation of tooling is required that will allow those with domain expertise to capture their knowledge and insight as machine learning models and deploy them to embedded devices—with help from machine learning experts an optional extra.

The TinyML development process is similar to the traditional machine learning workflow. It starts with collecting, exploring, and evaluating a dataset. Next up, feature engineering takes the form of sophisticated digital signal processing, often using the types of algorithms that embedded engineers are already familiar with. Once features have been extracted from the data, a machine learning model is trained and evaluated—with a critical eye on its size, to make sure it will fit on a tiny microcontroller and run fast enough to be useful.

After the training, the model is optimized for size and efficiency. This often involves quantization, reducing the precision of the model’s weights so that they take up less precious memory. Once the model is ready, it must be deployed as a C++ library (the language of choice for the majority of embedded platforms) that includes all of the operator kernels required to run it. The embedded engineer can then write and tune an application that interprets the model’s output and uses it to make decisions.

Throughout this process, it’s important to carefully evaluate the model and application to ensure that it functions in the way that it is intended to when used in a real world environment. Without adequate monitoring and review, it’s possible to create models that seem superficially accurate but that fail in harmful ways when exposed to real world data.

Edge Impulse and TensorFlow

The Edge Impulse team has created an end-to-end suite of tooling that helps embedded engineers and domain experts build and test machine learning applications. Edge Impulse is designed to integrate beautifully with the tools that embedded engineers use every day, providing a high-level interface for incorporating machine learning into projects.

Edge Impulse makes use of the TensorFlow ecosystem for training, optimizing, and deploying deep learning models to embedded devices. While it was designed with non-ML engineers in mind, the philosophy behind Edge Impulse is that it should be extensible by machine learning experts and flexible enough to incorporate their insights and additions—from hand-tuned model architectures and loss functions to custom operator kernels.

This extensibility is made possible by the TensorFlow ecosystem, which provides a set of standards and integration points that experts can use to make their own improvements.

Training a tiny model

This process starts during training. Novice ML developers using Edge Impulse can use a library of preset deep learning model architectures designed to work well with embedded devices. For example, this simple convolutional model is intended for classifying ambient noise:

Neural network architecture

Under the hood, Edge Impulse generates a Python implementation of the model using TensorFlow’s Keras APIs. More experienced developers can customize the layers of the deep learning network, tweaking parameters and adding new layers that are reflected in the underlying Keras model. And expert developers have access to edit the training code itself, directly within the UI:

code snippet

Since Edge Impulse uses TensorFlow libraries and APIs, it’s incredibly simple to extend the built-in training code with your own logic. For example, the tf.data.Dataset class is used to provide an efficient pipeline to the training and validation datasets. This pipeline can easily be extended to add transformations, such as the data augmentation function seen in the following screenshot from an image classification project:

code snippet

For in-depth experiments, developers can download a Jupyter Notebook containing all of the dependencies required to run their training script locally.

Jupyter Notebook

Any custom model code using the TensorFlow APIs fits seamlessly into the end-to-end pipeline hosted by Edge Impulse. Training is run in the cloud, and trained models are automatically optimized for embedded deployment using a combination of TensorFlow utilities and Edge Impulse’s own open source technologies.

Model optimization

Quantization is the most common form of optimization used when deploying deep learning models to embedded devices. Edge Impulse uses TensorFlow’s Model Optimization Toolkit to quantize models, reducing their weights’ precision from float32 to int8 with minimal impact on accuracy.

Using TensorFlow Lite for Microcontrollers along with the emulation software Renode, Edge Impulse provides developers with an accurate estimate of the latency and memory usage of their model once it is deployed to the target embedded device. This makes it easy to determine the impact of optimizations such as quantization across different slices of the dataset:

A comparison between int8 quantized and unoptimized versions of the same mode, showing the difference in performance and results.
A comparison between int8 quantized and unoptimized versions of the same mode, showing the difference in performance and results.

For maximum flexibility and compatibility with developers’ existing workflows, the trained model is available for download in multiple formats. Developers can choose to export the original model as a TensorFlow SavedModel, or download one of several optimized models using the portable TensorFlow Lite flatbuffer format:

Download links for models serialized using TensorFlow’s SavedModel and TensorFlow Lite formats.
Download links for models serialized using TensorFlow’s SavedModel and TensorFlow Lite formats.

Deployment

Once a model has been trained and tested there are multiple ways to deploy it to the target device. Embedded engineers work heavily with C++, so the standard option is to export a C++ SDK: a library of optimized source code that implements both the signal processing pipeline and the deep learning model. The SDK has a permissive open source license, so developers are free to use it in any project or share it with others.

There are two main options for running deep learning models, both of which make use of TensorFlow technologies. The first, Edge Impulse’s EON Compiler, is a code generation tool that converts TensorFlow Lite models into human readable C++ programs.

Enabling EON Compiler
Enabling EON Compiler can reduce memory usage by up to 50% with no impact on model accuracy.

EON Compiler makes use of the operator kernels implemented in TensorFlow Lite for Microcontrollers, invoking them in an efficient manner that doesn’t require the use of an interpreter. This results in memory savings of up to 50%. It automatically applies any available optimized kernels for the target device, meaning libraries such as Arm’s CMSIS-NN will be used where appropriate.

Some projects benefit from additional flexibility. In these cases, developers can choose to export a library that uses the TensorFlow Lite for Microcontrollers interpreter to run the model. This can be useful for developers who wish to experiment with custom kernel implementations for their specific hardware, or who are working within an environment that has TensorFlow Lite for Microcontrollers built in.

In addition to the C++ SDK, developers can choose to target specific environments. For example, a TensorRT library provides optimized support for NVidia’s Jetson Nano embedded Linux developer kit. This interoperability is enabled by the extensive TensorFlow ecosystem and open source community, which has tooling for numerous platforms and targets.

TensorRT library
Models can be optimized and exported for targets in the broader TensorFlow ecosystem, such as NVidia’s Jetson Nano.

Enabling new technologies

TensorFlow is unique amongst deep learning frameworks due to its broad, mature, and extensible set of technologies for training and deploying models to embedded devices. TensorFlow formats, such as the TensorFlow Lite flatbuffer, have become de-facto standards amongst companies bringing deep learning models to the edge.

The TensorFlow ecosystem has been key to enabling the growth of embedded machine learning, enabling companies like Edge Impulse to put artificial intelligence in the hands of domain experts who are building the next generation of consumer and industrial technologies.

If you’d like to learn more about embedded machine learning using Edge Impulse and TensorFlow, there are many options. Take a look at the Introduction to Embedded Machine Learning course on Coursera, or jump right in with the Getting Started guide or Recognize sounds from audio tutorial. You can even check out a public Edge Impulse project that you can clone and customize with a single click.

Daniel Situnayake

Founding TinyML Engineer, Edge Impulse.

Read More

New Courses: Machine Learning Engineering for Production

Posted by Robert Crowe and Jocelyn Becker

AI

Have you mastered the art of building and training ML models, and are now ready to use them in a production deployment for a product or service? If so, we have a new set of courses to get you going. Built as a collaboration between the TensorFlow team, Andrew Ng, and deeplearning.ai, the new set of courses are launching as a specialization on Coursera: The Machine Learning Engineering for Production (MLOps) specialization.

The new specialization builds on the foundational knowledge taught in the popular specialization, DeepLearning.AI TensorFlow Developer Professional Certificate, that teaches how to build machine learning models with TensorFlow. The new MLOps specialization kicks off with an introductory course taught by Andrew Ng, followed by courses taught by Robert Crowe and Laurence Moroney that dive into the details of getting your models out to users.

Every lesson comes with plenty of hands-on exercises that give you practice at preparing your data, and training and deploying models.

By the end of the specialization, you’ll be ready to design and deploy an ML production system end-to-end. You’ll understand project scoping, data needs, modeling strategies, and deployment requirements. You’ll know how to optimize your data, models, and infrastructure to manage costs. You’ll know how to validate the integrity of your data to get it ready for production use, and then prototype, develop, and deploy your machine learning models, monitor the outcomes, and update the datasets and retrain the models continuously.

You’ll learn how to implement feature engineering, transformation, and selection with TFX as well as how to use analytics to address model fairness and explainability issues, and how to mitigate bottlenecks. You’ll also explore different scenarios and case studies of ML in practice, from personalization systems to automated vehicles.

Typical ML pipeline
You’ll learn how processing requirements are different in deployment than in training
Use of Accelerators in Serving Infrastructure
You’ll learn about different tools and platforms for deploying your machine learning systems.
Product recommendations
A common use of ML in production is personalization systems for product recommendations.
Autonomous Driving Systems
A cutting edge use of ML in practice is to guide automated vehicles.

Despite the growing recognition of AI/ML as a crucial pillar of digital transformation, successful ML deployments are a bottleneck for getting value from AI. For example, 72% of a cohort of organizations that began AI pilots before 2019 haven’t deployed even a single application in production. A survey by Algorithmia of the state of enterprise machine learning found that 55% of companies surveyed haven’t deployed an ML model.

Models don’t make it into production and if they do, they break because they fail to adapt to changes in the environment. Deloitte identified lack of talent and integration issues as factors that can stall or derail AI initiatives. This is where ML engineering and MLOps are essential. ML engineering provides a superset of the discipline of software engineering that handles the unique complexities of the practical applications of ML. MLOps is a methodology for ML engineering that unifies ML system development (the ML element) with ML system operations (the Ops element).

Unfortunately, job candidates with ML engineering and MLOps skills are relatively hard to find and expensive to hire. Our new MLOps specialization teaches a broad range of many of the skills necessary to work in this field, and will help prepare developers for current and future workplace challenges. We believe that this is a valuable contribution to the ML community, and we’re excited to make it available.

Enroll today to develop your machine learning engineering skills, and learn how to roll out your ML models to benefit your company and your users.

Read More

Introducing TensorFlow Decision Forests

Posted by Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, Jan Pfeifer

We are happy to open source TensorFlow Decision Forests (TF-DF). TF-DF is a collection of production-ready state-of-the-art algorithms for training, serving and interpreting decision forest models (including random forests and gradient boosted trees). You can now use these models for classification, regression and ranking tasks – with the flexibility and composability of the TensorFlow and Keras.

GIF showing Random Forest decision model
Random Forests are a popular type of decision forest model. Here, you can see a forest of trees classifying an example by voting on the outcome.

About decision forests

Decision forests are a family of machine learning algorithms with quality and speed competitive with (and often favorable to) neural networks, especially when you’re working with tabular data. They’re built from many decision trees, which makes them easy to use and understand – and you can take advantage of a plethora of interpretability tools and techniques that already exist today.

TF-DF brings this class of models along with a suite of tailored tools to TensorFlow users:

  • Beginners will find it easier to develop and explain decision forest models. There is no need to explicitly list or pre-process input features (as decision forests can naturally handle numeric and categorical attributes), specify an architecture (for example, by trying different combinations of layers like you would in a neural network), or worry about models diverging. Once your model is trained, you can plot it directly or analyse it with easy to interpret statistics.
  • Advanced users will benefit from models with very fast inference time (sub-microseconds per example in many cases). And, this library offers a great deal of composability for model experimentation and research. In particular, it is easy to combine neural networks and decision forests.

If you’re already using decision forests outside of TensorFlow, here’s a little of what TF-DF offers:

  • It provides a slew of state-of-the-art Decision Forest training and serving algorithms such as random forests, gradient-boosted trees, CART, (Lambda)MART, DART, Extra Trees, greedy global growth, oblique trees, one-side-sampling, categorical-set learning, random categorical learning, out-of-bag evaluation and feature importance, and structural feature importance.
  • This library can serve as a bridge to the rich TensorFlow ecosystem by making it easier for you to integrate tree-based models with various TensorFlow tools, libraries, and platforms such as TFX.
  • And for users new to neural networks, you can use decision forests as an easy way to get started with TensorFlow, and continue to explore neural networks from there.

Code example

A good example is worth a thousand words. So in this blog post, we will show how easy it is to train a model with TensorFlow Decision Forests. More examples are available on the TF-DF website and GitHub page. You may also watch our talk at Google I/O 2021 .

Training a model

Let’s start with a minimal example where we train a random forest model on the tabular Palmer’s Penguins dataset. The objective is to predict the species of an animal from its characteristics. The dataset contains both numerical and categorical features and is stored as a csv file.

Three examples from the Palmer's Penguins dataset.
Three examples from the Palmer’s Penguins dataset.

Let’s train a model:

# Install TensorFlow Decision Forests
!pip install tensorflow_decision_forests

# Load TensorFlow Decision Forests
import tensorflow_decision_forests as tfdf

# Load the training dataset using pandas
import pandas
train_df = pandas.read_csv("penguins_train.csv")

# Convert the pandas dataframe into a TensorFlow dataset
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="species")

# Train the model
model = tfdf.keras.RandomForestModel()
model.fit(train_ds)

Observe that nowhere in the code did we provide input features or hyperparameters. That means, TensorFlow Decision Forests will automatically detect the input features from this dataset and use default values for all hyperparameters.

Evaluating a model

Now, let’s evaluate the quality of our model:

# Load the testing dataset
test_df = pandas.read_csv("penguins_test.csv")

# Convert it to a TensorFlow dataset
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="species")

# Evaluate the model
model.compile(metrics=["accuracy"])
print(model.evaluate(test_ds))
# >> 0.979311
# Note: Cross-validation would be more suited on this small dataset.
# See also the "Out-of-bag evaluation" below.

# Export the model to a TensorFlow SavedModel
model.save("project/my_first_model")

Easy, right? And a default RandomForest model with default hyperparameters provides a quick and good baseline for most problems. Decision forests in general will train quickly for small and medium sized problems, require less hyperparameter tuning compared to many other types of models, and will often provide strong results.

Interpreting a model

Now that you have looked at the accuracy of the trained model, let’s consider its interpretability. Interpretability is important if you wish to understand and explain the phenomenon being modeled, debug a model, or begin to trust its decisions. As noted above, we have provided a number of tools to interpret trained models, beginning with plots.

tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0)
tree structure

You can visually follow the tree structure. In this tree, the first decision is based on the bill length. Penguins with bills longer than 42.2mm are likely to be the blue (Gentoo) or green (Chinstrap) species, while the ones with shorter bills are likely to be of the red specy (Adelie).

For the first group, the tree then asks about the flipper length. Penguins with flippers longer than 206.5mm are likely to be of the green species (Chinstrap), while the remaining are likely to be of the blue species (Gentoo).

Model statistics are complementary additions to plots. Example statistics include:

  • How many times is each feature used?
  • How fast did the model train (in number of trees and time)?
  • How are the nodes distributed in the tree structure (for example, what is the length of most branches?)

These and answers to more such inquiries are included in the model summary and accessible in the model inspector.

# Print all the available information about the model
model.summary()
>> Input Features (7):
>> bill_depth_mm
>> bill_length_mm
>> body_mass_g
>> ...
>> Variable Importance:
>> 1. "bill_length_mm" 653.000000 ################
>> ...
>> Out-of-bag evaluation: accuracy:0.964602 logloss:0.102378
>> Number of trees: 300
>> Total number of nodes: 4170
>> ...

# Get feature importance as a array
model.make_inspector().variable_importances()["MEAN_DECREASE_IN_ACCURACY"]
>> [("flipper_length_mm", 0.149),
>> ("bill_length_mm", 0.096),
>> ("bill_depth_mm", 0.025),
>> ("body_mass_g", 0.018),
>> ("island", 0.012)]

In the example above, the model was trained with default hyperparameter values. This is a good first solution, but “tuning” the hyper-parameters can often further improve the quality of the model. That can be done as in the following:

# List all the other available learning algorithms
tfdf.keras.get_all_models()
>> [tensorflow_decision_forests.keras.RandomForestModel,
>> tensorflow_decision_forests.keras.GradientBoostedTreesModel,
>> tensorflow_decision_forests.keras.CartModel]

# Display the hyper-parameters of the Gradient Boosted Trees model
? tfdf.keras.GradientBoostedTreesModel
>> A GBT (Gradient Boosted [Decision] Tree) is a set of shallow decision trees trained sequentially. Each tree is trained to predict and then "correct" for the errors of the previously trained trees (more precisely each tree predicts the gradient of the loss relative to the model output)..
...
Attributes:
num_trees: num_trees: Maximum number of decision trees. The effective number of trained trees can be smaller if early stopping is enabled. Default: 300.
max_depth: Maximum depth of the tree. `max_depth=1` means that all trees will be roots. Negative values are ignored. Default: 6.
...

# Create another model with specified hyper-parameters
model = tfdf.keras.GradientBoostedTreesModel(
num_trees=500,
growing_strategy="BEST_FIRST_GLOBAL",
max_depth=8,
split_axis="SPARSE_OBLIQUE",
)

# Evaluate the model
model.compile(metrics=["accuracy"])
print(model.evaluate(test_ds))
# >> 0.986851

Next steps

We hope you enjoyed reading this short demonstration of TensorFlow Decision Forests, and that you are as excited to use it and contribute to it as we are to develop it.

With TensorFlow Decision Forests, you can now train state-of-the-art Decision Forests models with maximum speed and quality and with minimal effort in TensorFlow. And if you feel adventurous, you can now combine decision forests and neural networks together to create new types of hybrid models.

If you would like to learn more about the TensorFlow Decision Forests library, we have put together a number of resources and recommend the following:

If you have any questions, please ask them on the discuss.tensorflow.org using the tag “TFDF” and we’ll do our best to help. Thanks again.

Read More

Run Your First Multi-Worker TensorFlow Training Job With GCP AI Platform

Posted by Nikita Namjoshi, Machine Learning Solutions Engineer

TensorFlow Header

When a single machine is not enough, it’s time to train and iterate faster with TensorFlow’s MultiWorkerMirroredStrategy. In this tutorial-style article you’ll learn how to launch a multi-worker training job on Google Cloud Platform (GCP) using AI Platform Training. You’ll also learn the basics of how TensorFlow distributes data and implements synchronous data parallelism across multiple machines. While this article focuses on a managed solution on GCP, you can also do all of this entirely in open-source on your own hardware.

Overview of Distributed Training

If you have a single GPU, TensorFlow will use this accelerator to speed up model training with no extra work on your part. However, if you want to get an additional boost from using multiple GPUs on a single machine or multiple machines (each with potentially multiple GPUs), then you’ll need to use tf.distribute, which is TensorFlow’s library for running a computation across multiple devices.

The simplest way to get started with distributed training is a single machine with multiple GPU devices. A TensorFlow distribution strategy from the tf.distribute module will manage the coordination of data distribution and gradient updates across all of the GPUs. If you want to learn more about training in this scenario, check out the previous post on distributed training basics.

If you’ve mastered single host training and are looking to scale even further, then adding multiple machines to your cluster can help you get an even greater performance boost. You can make use of a cluster of machines that are CPU only, or that each have one or more GPUs.

There are many ways to do multi-worker training on GCP. In this article we’ll use AI Platform Training, as it’s the quickest way to launch a distributed training job and has additional features that make it very easy to include as part of your production pipeline. To use this managed service, you’ll need to add a bit of extra code to your program and set up a config file that is specific to AI Platform. However; you will not have to endure the pains of GPU driver installation or cluster management, which can be very challenging in a distributed scenario.

Multi-Worker Cluster Configuration

The tf.distribute module currently provides two strategies for multi-worker training. In TensorFlow 2.5, ParameterServerStrategy is experimental, and MultiWorkerMirroredStrategy is a stable API.

Like its single-worker counterpart, MirroredStrategy, MultiWorkerMirroredStrategy is a synchronous data parallelism strategy that you can use with only a few code changes.

However, unlike MirroredStrategy, for a multi-worker setup TensorFlow needs to know which machines are part of your cluster. This is generally specified with the environment variable TF_CONFIG.

os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"chief": ["host1:port"],
"worker": ["host2:port", "host3:port"],
},
"task": {"type": "worker", "index": 1}
})

In this simple TF_CONFIG example, the “cluster” key contains a dictionary with the internal IPs and ports of all the machines. In MultiWorkerMirroredStrategy, all machines are designated as workers, which are the physical machines on which the replicated computation is executed. In addition to each machine being a worker, there needs to be one worker that takes on some extra work such as saving checkpoints and writing summary files to TensorBoard. This machine is known as the chief (or by its deprecated name master).

After you’ve added your machines to the cluster key, the next step is to set the “task”. This specifies the task type and task index of the current machine, which is an index into the cluster dictionary. The cluster key should be the same on each machine, but the task keys will be different.

Conveniently, when using AI Platform Training, the TF_CONFIG environment variable is set for you on each machine in your cluster so you don’t need to worry about this set up!

However, if you were trying to run a multi-worker job with, for example, 3 instances on Google Compute Engine, you would need to set this environment variable on each machine as shown below. For the machines that are not the chief, the TF_CONFIG looks the same except the task index increments by 1.

Machine 1 (Chief)

os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"chief": ["host1:port"],
"worker": ["host2:port", "host3:port"],
},
"task": {"type": "chief", "index": 0}
})

Machine 2

os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"chief": ["host1:port"],
"worker": ["host2:port", "host3:port"],
},
"task": {"type": "worker", "index": 0}
})

Machine 3

os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"chief": ["host1:port"],
"worker": ["host2:port", "host3:port"],
},
"task": {"type": "worker", "index": 1}
})

Setting this environment variable is fairly easy to do when you have only a few machines in your cluster; however, once you start scaling up, you don’t want to be assigning this variable to each machine manually. As mentioned earlier, one of the many benefits of using AI Platform is that this coordination happens automatically. The only configuration you have to provide is the number of machines in your cluster, and the number and type of GPUs per machine. We’ll do this step in a later section.

Set up the Distribution Strategy

In this Colab notebook, you’ll find the code to train a ResNet50 architecture on the Cassava dataset. In the following sections, we’ll review the new code that needs to be added to our program in order to do distributed training on multiple machines.

As with any strategy in the tf.distribute module, step one is to instantiate the strategy.

strategy = tf.distribute.MultiWorkerMirroredStrategy()

Note that there is a limitation where the instance of MultiWorkerMirroredStrategy needs to be created at the beginning of the program. Code that may create ops should be placed after the strategy is instantiated.

Next, you wrap the creation of your model variables within the strategy’s scope. This crucial step tells TensorFlow which variables should be mirrored across the replicas.

with strategy.scope():
model = create_model()
model.compile(
loss='sparse_categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(0.0001),
metrics=['accuracy'])

Lastly, you’ll need to scale your batch size by the number of replicas in your cluster. This ensures that each replica processes the same number of examples on each step.

per_replica_batch_size = 64
global_batch_size = per_replica_batch_size * strategy.num_replicas_in_sync

If you’ve used MirroredStrategy before, then the previous steps should be familiar. The main difference when moving from synchronous data parallelism on one machine to many is that the gradients at the end of each step now need to be synchronized across all GPUs in a machine and across all machines in the cluster. This additional step of synchronizing across the machines increases the overhead of distribution.

In TensorFlow, the multi-worker all-reduce communication is achieved via CollectiveOps. You don’t need to know much detail to execute a successful and performant training job, but at a high level, a collective op is a single op in the TensorFlow graph that can automatically choose an all-reduce algorithm according to factors such as hardware, network topology, and tensor sizes.

Dataset Sharding

In the single worker case, at each step your dataset is divided up across the replicas on your machine. This data splitting process becomes slightly more complicated in the multi-worker case. The data now also needs to be sharded, meaning that each worker is assigned a subset of the entire dataset. Therefore, at each step a global batch size of non overlapping dataset elements will be processed by each worker. This sharding happens automatically with tf.data.experimental.AutoShardPolicy.

By default, TensorFlow will first attempt to shard your data by FILE. This means that if your data exists across multiple files, each worker will process different file(s) and split the corresponding data amongst the replicas. FILE is the default autoshard policy because MultiWorkerMirroredStrategy works best for use cases with very large datasets, which are likely to not be in a single file. However, this option can lead to idle workers if the number of files is not divisible by the number of workers, or if some files are substantially longer than others.

If your data is not stored in multiple files, then the AutoShardPolicy will fall back to DATA, meaning that TensorFlow will autoshard the elements across all the workers. This guards against the potential idle worker scenario, but the downside is that the entire dataset will be read on each worker. You can read more about the different policies and see examples in the Distributed Input guide.

If you don’t want to use the default AUTO policy, you can set the desired AutoShardPolicy with the following code:

options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
train_data = train_data.with_options(options)

Save Your Model

Saving your model is slightly more complicated in the multi-worker case because the destination needs to be different for each of the workers. The chief worker will save to the desired model directory, while the other workers will save the model to temporary directories. It’s important that these temporary directories are unique in order to prevent multiple workers from writing to the same location. Saving can contain collective ops, so all workers must save and not just the chief.

The following is boilerplate code that implements the intended saving logic, as well as some cleanup to delete the temporary directories once the training has completed. Note that the model_path is the name of the Google Cloud Storage (GCS) bucket where your model will be saved at the end of training.

model_path = {gs://path_to_your_gcs_bucket}

# Note that with MultiWorkerMirroredStrategy,
# the program is run on every worker.
def _is_chief(task_type, task_id):
# Note: there are two possible `TF_CONFIG` configurations.
# 1) In addition to `worker` tasks, a `chief` task type is used.
# The implementation demonstrated here is for this case.
# 2) Only `worker` task type is used; in this case, worker 0 is
# regarded as the chief. In this case, this function
# should be modified to
# return (task_type == 'worker' and task_id == 0) or task_type is None
return task_type == 'chief'


def _get_temp_dir(dirpath, task_id):
base_dirpath = 'workertemp_' + str(task_id)
temp_dir = os.path.join(dirpath, base_dirpath)
tf.io.gfile.makedirs(temp_dir)
return temp_dir

def write_filepath(filepath, task_type, task_id):
dirpath = os.path.dirname(filepath)
base = os.path.basename(filepath)
if not _is_chief(task_type, task_id):
dirpath = _get_temp_dir(dirpath, task_id)
return os.path.join(dirpath, base)

# Determine type and task of the machine from
# the strategy cluster resolver
task_type, task_id = (strategy.cluster_resolver.task_type,
strategy.cluster_resolver.task_id)

# Based on the type and task, write to the desired model path
write_model_path = write_filepath(model_path, task_type, task_id)
model.save(write_model_path)

Everything we’ve covered about setting up the distribution strategy, sharding data, and saving models applies whether you’re training on GCP, your own hardware, or another cloud platform.

Prepare code for AI Platform

The basic prerequisites for using AI Platform are that you need to have a GCP project with billing enabled, the AI Platform APIs enabled, and sufficient AI Platform quota. If any of these steps are a mystery to you, refer to the previous post to get up to speed on GCP basics.

If you’re already familiar with training on AI Platform with a single node, then you’ll likely breeze through this section. We’ll take the pieces we walked through in the previous section, and do a bit of rearranging to match AI Platform Training convention. All of the code can be found in this Github repo, but we’ll walk through it in detail in this section.

By AI Platform convention, training code is arranged according to the diagram below. The task.py file contains the code that executes your training job. The example in this tutorial also includes a model.py file, which has the Keras functional API code for the model. For more complex production applications you’ll likely have additional util.py or setup.py files, and you can see where those fit in the hierarchy below.

diagram showing path of file

Model code

The model.py file can be found in Github here. You can see that this file just has the code for building the ResNet50 model architecture.

Task code

The task.py file can be found in Github here. This file contains the main function, which will execute the training job and save the model.

def main():
args = get_args()
strategy = tf.distribute.MultiWorkerMirroredStrategy()
global_batch_size = PER_REPLICA_BATCH_SIZE * strategy.num_replicas_in_sync
train_data, number_of_classes = create_dataset(global_batch_size)

with strategy.scope():
model = create_model(number_of_classes)

model.fit(train_data, epochs=args.epochs)

# Determine type and task of the machine from
# the strategy cluster resolver
task_type, task_id = (strategy.cluster_resolver.task_type,
strategy.cluster_resolver.task_id)

# Based on the type and task, write to the desired model path
write_model_path = write_filepath(args.job_dir, task_type, task_id)
model.save(write_model_path)

In this simple example, the data preprocessing happens directly in the task.py file, but in reality for more complicated data processing you would probably want to split out this code into a separate data.py file that you can import into task.py (for example if your preprocessing includes parsing TFRecord files).

We explicitly set the AutoShardPolicy to DATA in this case because the Cassava dataset is not downloaded as multiple files. However, if we did not set the policy to DATA, the default AUTO policy would kick in and the end result would be the same.

options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
train_data = train_data.with_options(options)

The task.py file also parses any command line arguments we need. In this simple example, the epochs are passed in via the command line. Additionally, we need to parse the argument job-dir, which is the GCS bucket where our model will be stored.

def get_args():
'''Parses args.'''
parser = argparse.ArgumentParser()
parser.add_argument(
'--epochs',
required=True,
type=int,
help='number training epochs')
parser.add_argument(
'--job-dir',
required=True,
type=str,
help='bucket to save model')
args = parser.parse_args()
return args

Lastly, the task.py file contains our boilerplate code for saving the model. For a production example, you probably would want to add this boilerplate to a util.py file, but again for this simple example we’ll keep everything in one file.

Custom Container Set up

AI Platform provides standard runtimes for you to execute your training job. While these runtimes might work for your use case, more specialized needs require a custom container. In this section, we’ll walk through how to set up your container image and push it to Google Container Registry (GCR).

Write Your Dockerfile

The following Dockerfile specifies the base image, using the TensorFlow 2.5 Enterprise GPU Deep Learning Container. Using the TensorFlow Enterprise image as our base image provides a useful design pattern for developing on GCP. TensorFlow Enterprise is a distribution of TensorFlow that is optimized for GCP. You can use TensorFlow Enterprise with AI Platform Notebooks, the Deep Learning VMs, and AI Platform Training, providing a seamless transition between different environments.

The code in our trainer directory is copied to the Docker image, and our entry point is the task.py script, which we will run as a module.

# Specifies base image and tag
FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5
WORKDIR /root

# Copies the trainer code to the docker image.
COPY trainer/ /root/trainer/

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

Push Your Dockerfile to GCR

Next, we’ll set up some useful environment variables. You can select any name of your choosing for IMAGE_REPO_NAME and IMAGE_TAG. If you have not already set up the Google Cloud SDK, you can follow the steps here, as you’ll need to use the gcloud tool to push your container and kick off the training job.

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME={your_repo_name}
export IMAGE_TAG={your_image_tag}
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG

Next, you’ll build your Dockerfile.

docker build -f Dockerfile -t $IMAGE_URI ./

Lastly, you can push your image to GCR.

gcloud auth configure-docker
docker push $IMAGE_URI

If you navigate to the GCR page in the GCP console UI, you should see your newly pushed image.

Configure Your Cluster

The final step before we can kick off our training job is to set up the cluster. AI Platform offers a set of predefined cluster specifications called scale tiers, but we’ll need to provide our own cluster setup for distributed training.

In the following config.yaml file, we’ve designated one master (equivalent to chief) and one worker. Each machine has one NVIDIA T4 Tensor Core GPU. For both machines, you’ll also need to specify the imageUri as the image you pushed to GCR in the previous step.

trainingInput:
scaleTier: CUSTOM
masterType: n1-standard-8
masterConfig:
acceleratorConfig:
count: 1
type: NVIDIA_TESLA_T4
imageUri: gcr.io/{path/to/image}:{tag}
useChiefInTfConfig: true
workerType: n1-standard-8
workerCount: 1
workerConfig:
acceleratorConfig:
count: 1
type: NVIDIA_TESLA_T4
imageUri: gcr.io/{path/to/image}:{tag}

In case you’re wondering what the useChiefInTfConfig flag does, TensorFlow uses the terminology “Chief” and AI Platform uses the terminology “Master”, so this flag will manage that discrepancy. You don’t need to worry about the details (although you will see an error message if you forget to set this flag!).

Feel free to experiment with this configuration by adding machines, adding GPUs, or removing all GPUs and training with CPUs only. You can see the supported regions and GPU types here for AI Platform, so just make sure your project has sufficient quota for whatever configuration you choose.

Launch Your Training Job

You can launch your training job easily with the following command:

gcloud ai-platform jobs submit training {job_name}  
--region europe-west2
--config config.yaml
--job-dir gs://{gcs_bucket/model_dir} --
--epochs 5

In the command above, you’ll need to give your job a name. In addition to passing in the region, you’ll need to define job-dir, which is the directory in your GCS bucket where you want your saved model file to be stored after training completes.

The empty — flag marks the end of the gcloud specific flags and the start of the args that you want to pass to your application (in this case, this is just the epochs).

After executing the training command, you should see the following message.

code snippet

You can navigate to the AI Platform UI in the GCP console and track the status of your job.

You’ll notice that your job will take around ten minutes to launch. This overhead might seem huge in our simple example where it doesn’t even take ten minutes to train on a single GPU. However, this overhead will be amortized for large jobs.

job details screen

When the job completes training, you’ll see a green check mark next to the job. You can then click the Model location URI and you’ll find your saved_model.pb file.

What’s Next

You now know the basics of launching a multi-worker training job on GCP. You also know the core concepts of MultiWorkerMirroredStrategy. To take your skills to the next level, try leveraging AI Platform’s hyperparameter tuning feature for your next training job (in open-source, you can use Keras Tuner), or using TFRecord files as your input data. You can also try out Parameter Server Strategy if you’d like to explore asynchronous training in TensorFlow. Happy distributed training!

Read More

Recap of TensorFlow at Google I/O 2021

Posted by the TensorFlow team

TensorFlow recap header

Thanks to everyone who joined our virtual I/O 2021 livestream! While we couldn’t meet in person, we hope we were able to make the event more accessible than ever. In this article, we’re recapping a few of the updates we shared during the keynote. You can watch the keynote below, and you can find recordings of every talk on the TensorFlow YouTube channel. Here’s a summary of a few announcements by product area (and there’s more in the videos, so be sure to check them out, too).

TensorFlow for Mobile and Web

The TensorFlow Lite runtime will be bundled with Google Play services

Let’s start with the announcement that the TensorFlow Lite runtime is going to be bundled with Google Play services, meaning you don’t need to distribute it with your app. This can greatly reduce your app’s bundle size. Now you can distribute your model without needing to worry about the runtime. You can sign up for an early access program today, and we expect a full rollout later this year.

You can now run TensorFlow Lite models on the web

All your TensorFlow Lite models can now directly be run on the web in the browser with the new TFLite Web APIs that are unified with TensorFlow.js. This task-based API supports running all TFLite Task Library models for image classification, objection detection, image segmentation, and many NLP problems. It also supports running arbitrary, custom TFLite models with easy, intuitive TensorFlow.js compatible APIs. With this option, you can unify your mobile and web ML development with a single stack.

A new On-Device Machine Learning site

We understand that the most effective developer path to reach Android, the Web and iOS isn’t always the most obvious. That’s why we created a new On-Device Machine Learning site to help you navigate your options, from turnkey to custom models, from cross platform mobile, to in-browser. It includes pathways to take you from an idea to a deployed app, with all the steps in between.

Performance profiling

When it comes to performance, we’re also working on additional tooling for Android developers. TensorFlow Lite includes built-in support for Systrace, integrating seamlessly with perfetto for Android 10.

And perf improvements aren’t limited to Android – for iOS developers TensorFlow Lite comes with built-in support for signpost-based profiling. When you build your app with the trace option enabled, you can run the Xcode profiler to see the signpost events, letting you dive deeper, and seeing all the way down to individual ops during execution.

Perfetto dashboard

TFX

TFX 1.0: Production ML at Enterprise-scale

Moving your ML models from prototype to production requires lots of infrastructure. Google created TFX because we needed a strong framework for our ML products and services, and then we open-sourced it so that others can use it too. It includes support for training models for mobile and web applications, as well as server-based applications.

After a successful beta with many partners, today we’re announcing TFX 1.0 — ready today for production ML at enterprise-scale. TFX includes all of the things an enterprise-ready framework needs, including enterprise-grade support, security patches, bug fixes, and guaranteed backward compatibility for the entire 1.X release cycle. It also includes strong support for running on Google Cloud and support for mobile, web, and NLP applications.

If you’re ready for production ML, TFX is ready for you. Visit the TFX site to learn more.

Responsible AI

We’re also sharing a number of new tools to help you keep Responsible AI top of mind in everything that you do when developing with ML.

Know Your Data

Know Your Data (KYD) is a new tool to help ML researchers and product teams understand rich datasets (images and text) with the goal of improving data and model quality, as well as surfacing and mitigating fairness and bias issues. Try the interactive demo at the link above to learn more.

Know Your Data interface

People + AI Guidebook 2.0

As you create AI solutions, building with a people centric approach is a key to doing it responsibly, and we’re delighted to announce the People + AI Guidebook 2.0. This update is designed to help you put best practices and guidance for people-centric AI into practice with a lot of new resources including code, design patterns and much more!

Also check out our Responsible AI Toolkit to help you integrate Responsible AI practices into your ML workflow using TensorFlow.

Decision forests in Keras

New support for random forests and gradient boosted trees

There’s more to ML than neural networks. Starting with TensorFlow 2.5, you can easily train powerful decision forest models (including favorites like random forests and gradient boosted trees) using familiar Keras APIs. There’s support for many state-of-the-art algorithms for training, serving and interpreting models for classification, regression and ranking tasks. And you can serve your decision forests using TF Serving, just like any other model trained with TensorFlow. Check out the tutorials here, and the video from this session.

TensorFlow Lite for Microcontrollers

A new pre-flashed board, experiments, and a challenge

TensorFlow Lite for Microcontrollers is designed to help you run ML models on microcontrollers and other devices with only a few kilobytes of memory. You can now purchase pre-flashed Arduino boards that will connect via Bluetooth and your browser. And you can use these to try out new Experiments With Google that let you make gestures and even create your own classifiers and run custom TensorFlow models. If you’re interested in challenges, we’re also running a new TensorFlow Lite for Microcontrollers challenge, you can check it out here. And also be sure to check out the TinyML workshop video in the next steps below.

Microcontroller chip

Google Cloud

Vertex AI: A new managed ML platform on Google Cloud

An ML model is only valuable if you can actually put it into production. And as you know, it can be challenging to productionize efficiently and at scale. That’s why Google Cloud is releasing Vertex AI, a new managed machine learning platform to help you accelerate experimentation and deployment of AI models. Vertex AI has tools that span every stage of the developer workflow, from data labeling, to working with notebooks and models, to prediction tools and continuous monitoring – all unified into one UI. While many of these offerings may be familiar to you, what really distinguishes Vertex AI is the introduction of new MLOps features. You can now manage your models with confidence using our MLOps tools such as Vertex Pipelines and Vertex Feature Store, to remove the complexity of robust self-service model maintenance and repeatability.

TensorFlow Cloud: Transition from local model building to distributed training on the Cloud

TensorFlow Cloud provides APIs that ease the transition from local model building and debugging to distributed training and hyperparameter tuning on Google Cloud. From inside a Colab or Kaggle Notebook or a local script file, you can send your model for tuning or training on Cloud directly, without needing to use the Cloud Console. We recently added a new site and new features, check it out if you’re interested in learning more.

Community

A new TensorFlow Forum

We created a new TensorFlow Forum for you to ask questions and connect with the community. It’s a place for developers, contributors, and users to engage with each other and the TensorFlow team. Create your account and join the conversation at discuss.tensorflow.org.

TensorFlow Forum page

Find all the talks here

This is just a small part of what was shared at Google I/O 2021. You can find all of the TensorFlow sessions in this playlist, and for your convenience here are direct links to each of the sessions also:

To learn more about TensorFlow, you check out tensorflow.org, read other articles on the blog, follow us on social media, and subscribe to our YouTube Channel, or join a TensorFlow User Group near you.

Read More

High Fidelity Pose Tracking with MediaPipe BlazePose and TensorFlow.js

Posted by Ivan Grishchenko, Valentin Bazarevsky and Na Li, Google Research

Today we’re excited to launch MediaPipe‘s BlazePose in our new pose-detection API. BlazePose is a high-fidelity body pose model designed specifically to support challenging domains like yoga, fitness and dance. It can detect 33 keypoints, extending the 17 keypoint topology of the original PoseNet model we launched a couple of years ago. These additional keypoints provide vital information about face, hands, and feet location with scale and rotation. Together with our face and hand models they can be used to unlock various domain-specific applications like gesture control or sign language without special hardware. With today’s release we enable developers to use the same models on the web that are powering MLKit Pose and MediaPipe Python unlocking the same great performance across all devices.

The new TensorFlow.js pose-detection API supports two runtimes: TensorFlow.js and MediaPipe. TensorFlow.js provides the flexibility and wider adoption of JavaScript, optimized for several backends including WebGL (GPU), WASM (CPU), and Node. MediaPipe capitalizes on WASM with GPU accelerated processing and provides faster out-of-the-box inference speed. The MediaPipe runtime currently lacks Node and iOS Safari support, but we’ll be adding the support soon.

Try out the live demo!

3 examples of BlazePose tracking dancing, stretching, and exercising
BlazePose can track 33 keypoints across a variety complex poses in real-time.

Installation

To use BlazePose with the new pose-detection API, you have to first decide whether to use the TensorFlow.js runtime or MediaPipe runtime. To understand the advantages of each runtime, check the performance and loading times section later in this document for further details.

For each runtime, you can use either script tag or NPM for installation.

Using TensorFlow.js runtime:

  1. Through script tag:
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-core"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-converter"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-webgl"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/pose-detection"></script>
  2. Through NPM:
    yarn add @tensorflow/tfjs-core, @tensorflow/tfjs-converter
    yarn add @tensorflow/tfjs-backend-webgl
    yarn add @tensorflow-models/pose-detection

Using MediaPipe runtime:

  1. Through script tag:
    <script src="https://cdn.jsdelivr.net/npm/@mediapipe/pose"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/pose-detection"></script>
  2. Through NPM:
    yarn add @mediapipe/pose
    yarn add @tensorflow-models/pose-detection

Try it yourself!

Once the package is installed, you only need to follow the few steps below to start using it. There are three variants of the model: lite, full, and heavy. The model accuracy increases from lite to heavy, while the inference speed decreases and memory footprint increases. The heavy variant is intended for applications that require high accuracy, while the lite variant is intended for latency-critical applications. The full variant is a balanced option, which is also the default option here.

Using TensorFlow.js runtime:

// Import TFJS runtime with side effects.
import '@tensorflow/tfjs-backend-webgl';
import * as poseDetection from '@tensorflow-models/pose-detection';

// Create a detector.
const detector = await poseDetection.createDetector(poseDetection.SupportedModels.BlazePose, {runtime: 'tfjs'});

Using MediaPipe runtime:

// Import MediaPipe runtime with side effects.
import '@mediapipe/pose';
import * as poseDetection from '@tensorflow-models/pose-detection';

// Create a detector.
const detector = await poseDetection.createDetector(poseDetection.SupportedModels.BlazePose, {runtime: 'mediapipe'});

You can also choose the lite or the heavy variant by setting the modelType field, as shown below:

// Create a detector.
const detector = await poseDetection.createDetector(poseDetection.SupportedModels.BlazePose, {runtime, modelType:'lite'});
// Pass in a video stream to the model to detect poses.
const video = document.getElementById('video');
const poses = await detector.estimatePoses(video);

Each pose contains 33 keypoints, with absolute x, y coordinates, confidence score and name:

console.log(poses[0].keypoints);
// Outputs:
// [
// {x: 230, y: 220, score: 0.9, name: "nose"},
// {x: 212, y: 190, score: 0.8, name: "left_eye_inner"},
// ...
// ]

Refer to our ReadMe (TFJS runtime, MediaPipe runtime) for more details about the API.

As you begin to play and develop with BlazePose, we would appreciate your feedback and contributions. If you make something using this model, tag it with #MadeWithTFJS on social media so we can find your work, as we would love to see what you create.

Model deep dive

BlazePose provides real-time human body pose perception in the browser, working up to 4 meters from the camera.

We trained BlazePose specifically for highly demanded single-person use cases like yoga, fitness, and dance which require precise tracking of challenging postures, enabling the overlay of digital content and information on top of the physical world in augmented reality, gesture control, and quantifying physical exercises.

For pose estimation, we utilize our proven two-step detector-tracker ML pipeline. Using a detector, this pipeline first locates the pose region-of-interest (ROI) within the frame. The tracker subsequently predicts all 33 pose keypoints from this ROI. Note that for video use cases, the detector is run only on the first frame. For subsequent frames we derive the ROI from the previous frame’s pose keypoints as discussed below.

BlazePose architecture.
BlazePose architecture.

BlazePose’s topology contains 33 points extending 17 points by COCO with additional points on palms and feet to provide lacking scale and orientation information for limbs, which is vital for practical applications like fitness, yoga and dance.

Since the BlazePose CVPR’2020 release, MediaPipe has been constantly improving the models’ quality to remain state-of-the-art on the web / edge for single person pose tracking. Besides running through the TensorFlow.js pose-detection API, BlazePose is also available on Android, iOS and Python via MediaPipe and ML Kit. For detailed information, read the Google AI Blog post and the model card.

BlazePose Browser Performance

TensorFlow.js continuously seeks opportunities to bring the latest and fastest runtime for browsers. To achieve the best performance for this BlazePose model, in addition to the TensorFlow.js runtime (w/ WebGL backend) we further integrated with the MediaPipe runtime via the MediaPipe JavaScript Solutions. The MediaPipe runtime leverages WASM to utilize state-of-the-art pipeline acceleration available across platforms, which also powers Google products such as Google Meet.

Inference speed:

To quantify the inference speed of BlazePose, we benchmark the model across multiple devices.

MacBook Pro 15” 2019. 

Intel core i9. 

AMD Radeon Pro Vega 20 Graphics.

(FPS)

iPhone 11

(FPS)

Pixel 5

(FPS)

Desktop 

Intel i9-10900K. Nvidia GTX 1070 GPU.

(FPS)

MediaPipe Runtime

With WASM & GPU Accel.

92 | 81 | 38

N/A

32 | 22 | N/A

160  | 140 | 98

TensorFlow.js Runtime
With WebGL backend

48 | 53 | 28

34 | 30 | N/A

 13 | 11 | 5

44 | 40 | 30

Inference speed of BlazePose across different devices and runtimes. The first number in each cell is for the lite model, and the second number is for the full model, the third number is for the heavy model. Certain model types and runtime do not work at time of this release, and we will be adding the support soon.

To see the model’s FPS on your device, try our demo. You can switch the model type and runtime live in the demo UI to see what works best for your device.

Loading times:

Bundle size can affect initial page loading experience, such as Time-To-Interactive (TTI), UI rendering, etc. We evaluate the pose-detection API and the two runtime options. The bundle size affects file fetching time and UI smoothness, because processing the code and loading them into memory will compete with UI rendering on CPU. It also affects when the model is available to make inference.

There is a difference of how things are loaded between the two runtimes. For the MediaPipe runtime, only the @tensorflow-models/pose-detection and the @mediapipe/pose library are loaded at initial page download; the runtime and the model assets are loaded when the createDetector method is called. For the TF.js runtime with WebGL backend, the runtime is loaded at initial page download; only the model assets are loaded when the createDetector method is called. The TensorFlow.js package sizes can be further reduced with a custom bundle technique. Also, if your application is currently using TensorFlow.js, you don’t need to load those packages again, models will share the same TensorFlow.js runtime. Choose the runtime that best suits your latency and bundle size requirements. A summary of loading times and bundle sizes is provided below:

Bundle Size

gzipped + minified

Average Loading Time

WiFi:

download speed 100Mbps

MediaPipe Runtime

    Initial Page Load

22.1KB

0.04s

    Initial Detector Creation:

         Runtime

1.57MB

         Lite model

10.6MB 

1.91s

         Full model

14MB

1.91s

         Heavy model

34.9MB

4.82s

TensorFlow.js Runtime

    Initial Page Load

162.6KB

0.07s

    Initial Detector Creation:

         Lite model

10.4MB

1.91s

         Full model

13.8MB

1.91s

         Heavy model

34.7MB

4.82s

Bundle size and loading time analysis for MediaPipe and TF.js runtime. The loading time is estimated based on a simulated WiFi network with 100Mbps download speed and includes time from request sent to content downloaded, see what is included in more detail here.

Looking ahead

In the future, we plan to extend TensorFlow.js pose-detection API with new features like BlazePose GHUM 3D pose. We also plan to speed up the TensorFlow.js WebGL backend to make model execution even faster. This will be achieved through repeated benchmarking and backend optimization, such as operator fusion. We will also bring Node.js support in the near future.

Acknowledgements

We would like to acknowledge our colleagues, who participated in creating BlazePose GHUM 3D: Eduard Gabriel Bazavan, Cristian Sminchisescu, Tyler Zhu, the other contributors to MediaPipe: Chuo-Ling Chang, Michael Hays and Ming Guang Yong, along with those involved with the TensorFlow.js pose-detection API: Ping Yu, Sandeep Gupta, Jason Mayes, and Masoud Charkhabi.

Read More

Speed-up your sites with web-page prefetching using Machine Learning

Posted by Minko Gechev, David Zats, Na Li, Ping Yu, Anusha Ramesh, and Sandeep Gupta

Page load time is one of the most important determinants of user experience on a web site. Research shows that faster page load time directly leads to increased page views, conversion, and customer satisfaction. Retail superstore Newegg has seen a 50% increase in conversions after implementing web-page prefetching to optimize page load experience.

Using TensorFlow tooling, it is now possible to use machine learning to implement a powerful solution for your website to improve page load times. In this blog post, we show an end-to-end workflow for using your site’s navigation data from Google Analytics and training a custom machine learning model that can predict the user’s next actions. You can use these predictions in an Angular app to pre-fetch candidate pages and dramatically improve user experience on your web site. Fig. 1 illustrates this side-by-side with default page load experience with no optimization compared to the greatly improved page load times with machine learning based predictive prefetching implemented on the right. Both examples are running on an emulated slow 3G network.

Comparison of un-optimized and machine learning based page loading time in a sample web application
Fig: Comparison of un-optimized and machine learning based page loading time in a sample web application

A high-level schematic of our solution is as follows:

Solution overview
Fig: Solution overview

We use Google Cloud services (BigQuery and Dataflow) to store and preprocess the site’s Google Analytics data, then train a custom model using TensorFlow Extended (TFX) to run our model training pipeline, produce a site-specific model, and then convert it into a web-deployable TensorFlow.js format. This client-side model will be loaded in a sample Angular web app for an e-store to demonstrate how to deploy the model in a web application. Let’s take a look at these components in more detail.

Data Preparation & Ingestion

Google Analytics stores each page visit as an event, providing key aspects such as the page name, visit time, and load time. This data contains everything we need to train our model. We need to:

  1. Convert this data to training examples containing features and labels
  2. Make it available to TFX for training.

We accomplish the first by leveraging existing support for exporting the Google Analytics data to a large-scale cloud data store called BigQuery. We accomplish the latter by creating an Apache Beam pipeline that:

  1. Reads the data from BigQuery
  2. Sorts and filters the events in a session
  3. Walks through each session, creating examples that take properties of the current event as features and the page visit in the next event as the label
  4. Stores these generated examples in Google Cloud Storage so that they can be used by TFX for training.

We run our Beam pipeline in Dataflow.

In the following table, each row represents a training example:

cur_page

session_index

label

page2

0

page3

page3

8

page1

While our training example only contains two training features (cur_page and session_index), additional features from Google Analytics can be easily added to create a richer dataset and used for training to create a more powerful model. To do so, extend the following code:

def ga_session_to_tensorflow_examples(session):
examples = []

for i in range(len(session)-1):
features = {‘cur_page’: [session[i][‘page’][‘pagePath’]],
‘label’: [session[i+1][‘page’][‘pagePath’]],
‘session_index’: [i],
# Add additional features here.

}
examples.append(create_tensorflow_example(features))
return examples

Model Training

Tensorflow Extended (TFX) is an end to end production scale ML platform and is used to automate the process of data validation, training at scale (using accelerators), evaluation & validation of the generated model.

To create a model within TFX, you must provide the preprocessing function and the run function. The preprocessing function defines the operations that should be performed on the data before it is passed to the main model. These include operations that involve a full pass over the data, such as vocab creation. The run function defines the main model and how it is to be trained.

Our example shows how to implement the preprocessing_fn and the run_fn to define and train a model for predicting the next page. And the TFX example pipelines demonstrate how to implement these functions for many other key use cases.

Creating a Web Deployable Model

After training our custom model, we want to deploy this model in our web application so it can be used to make live predictions when users visit our website. For this, we use TensorFlow.js, which is TensorFlow’s framework for running machine learning models directly in the browser client-side. By running this code in the browser client-side, we can reduce latency associated with server-side roundtrip traffic, reduce server-side costs, and also keep user’s data private by not having to send any session data to the server.

TFX employs the Model Rewriting Library to automate conversion between trained TensorFlow models and the TensorFlow.js format. As part of this library, we have implemented a TensorFlow.js rewriter. We simply invoke this rewriter within the run_fn to perform the desired conversion. Please see the example for more details.

Angular Application

Once we have the model we can use it within an Angular application. On each navigation, we will query the model and prefetch the resources associated with the pages that are likely to be visited in the future.

An alternative solution would be to prefetch the resources associated with all the possible future navigation paths, but this would have much higher bandwidth consumption. Using machine learning, we can predict only the pages, which are likely to be used next and reduce the number of false positives.

Depending on the specifics of the application we may want to prefetch different types of assets, for example: JavaScript, images, or data. For the purposes of this demonstration we’ll be prefetching images of products.

A challenge is how to implement the mechanism in a performant way without impacting the application load time or runtime performance. Techniques to mitigate the risks of performance regressions we can use are:

  • Load the model and TensorFlow.js lazily without blocking the initial page load time
  • Query the model off the main thread so we don’t drop frames in the main thread and achieve 60fps rendering experience

A web platform API that satisfies both of these constraints is the service worker. A service worker is a script that your browser runs in the background in a new thread, separate from a web page. It also allows you to plug into a request cycle and provides you with cache control.

When the user navigates across the application, we’ll post messages to the service worker with the pages they have visited. Based on the navigation history, the service worker will make predictions for future navigation and prefetch relevant product assets.

Example of future navigation

Let us look at a high-level overview of the individual moving parts.

From within the main file of our Angular application, we can load the service worker:

// main.ts

if ('serviceWorker' in navigator) {
navigator.serviceWorker.register('/prefetch.worker.js', { scope: '/' });
}

This snippet will download the prefetch.worker.js script and run it in the background. As the next step, we want to forward navigation events to it:

// app.component.ts

this.route.params.subscribe((routeParams) => {
if (this._serviceWorker) {
this._serviceWorker.postMessage({ page: routeParams.category });
}
});

In the snippet above, we watch for changes of the parameters of the URL. On change, we forward the category of the page to the service worker.

In the implementation of the service worker we need to handle messages from the main thread, make predictions based on them, and prefetch the relevant information. On a high-level this looks as follows:

// prefetch.worker.js

addEventListener('message', ({ data }) => prefetch(data.page));

const prefetch = async (path) => {
const predictions = await predict(path);
const cache = await caches.open(ImageCache);

predictions.forEach(async ([probability, category]) => {
const products = (await getProductList(category)).map(getUrl);
[...new Set(products)].forEach(url => {
const request = new Request(url, {
mode: 'no-cors',
});
fetch(request).then(response => cache.put(request, response));
});
});
};

Within the service worker we listen for messages from the main thread. When we receive a message we trigger the logic responsible for making predictions and prefetching data.

In the prefetch function we first predict, which are the pages the user could visit next. After that, we iterate over all the predictions and fetch the corresponding resources to improve the user experience in subsequent navigation.

For details you can follow the sample app in the TensorFlow.js examples repository.

Try it yourself

Check out the model training code sample which shows the TFX pipeline for training a page prefetching model as well as an Apache Beam pipeline that converts Google Analytics data to training examples, and the deployment sample showing how to deploy the TensorFlow.js model in a sample Angular app for client-side predictions.

Acknowledgements

This project wouldn’t have been possible without the incredible effort and support of Becky Chan, Deepak Aujla, Fei Dong, and Jason Mayes.

Read More

Next-Generation Pose Detection with MoveNet and TensorFlow.js

Posted by Ronny Votel and Na Li, Google Research

Today we’re excited to launch our latest pose detection model, MoveNet, with our new pose-detection API in TensorFlow.js. MoveNet is an ultra fast and accurate model that detects 17 keypoints of a body. The model is offered on TF Hub with two variants, known as Lightning and Thunder. Lightning is intended for latency-critical applications, while Thunder is intended for applications that require high accuracy. Both models run faster than real time (30+ FPS) on most modern desktops, laptops, and phones, which proves crucial for live fitness, sports, and health applications. This is achieved by running the model completely client-side, in the browser using TensorFlow.js with no server calls needed after the initial page load and no dependencies to install.

Try out the live demo!

MoveNet can track keypoints through fast motions and atypical poses.
MoveNet can track keypoints through fast motions and atypical poses.

Human pose estimation has come a long way in the last five years, but surprisingly hasn’t surfaced in many applications just yet. This is because more focus has been placed on making pose models larger and more accurate, rather than doing the engineering work to make them fast and deployable everywhere. With MoveNet, our mission was to design and optimize a model that leverages the best aspects of state-of-the-art architectures, while keeping inference times as low as possible. The result is a model that can deliver accurate keypoints across a wide variety of poses, environments, and hardware setups.

Unlocking Live Health Applications with MoveNet

We teamed up with IncludeHealth, a digital health and performance company, to understand whether MoveNet can help unlock remote care for patients. IncludeHealth has developed an interactive web application that guides a patient through a variety of routines (using a phone, tablet, or laptop) from the comfort of their own home. The routines are digitally built and prescribed by physical therapists to test balance, strength, and range of motion.

The service requires web-based and locally run pose models for privacy that can deliver precise keypoints at high frame rates, which are then used to quantify and qualify human poses and movements. While a typical off-the-shelf detector is sufficient for easy movements such as shoulder abductions or full body squats, more complicated poses such as seated knee extensions or supine positions (laying down) cause grief for even state-of-the-art detectors trained on the wrong data.

Comparison of a traditional detector (top) vs MoveNet (bottom) on difficult poses.
Comparison of a traditional detector (top) vs MoveNet (bottom) on difficult poses.

We provided an early release of MoveNet to IncludeHealth, accessible through the new pose-detection API. This model is trained on fitness, dance, and yoga poses (see more details about the training dataset below). IncludeHealth integrated the model into their application and benchmarked MoveNet relative to other available pose detectors:

“The MoveNet model has infused a powerful combination of speed and accuracy needed to deliver prescriptive care. While other models trade one for the other, this unique balance has unlocked the next generation of care delivery. The Google team has been a fantastic collaborator in this pursuit.” – Ryan Eder, Founder & CEO at IncludeHealth.

As a next step, IncludeHealth is partnering with hospital systems, insurance plans, and the military to enable the extension of traditional care and training beyond brick and mortar.

IncludeHealth demo application running in browser that quantifies balance and motion using keypoint estimation powered by MoveNet and TensorFlow.js
IncludeHealth demo application running in browser that quantifies balance and motion using keypoint estimation powered by MoveNet and TensorFlow.js

Installation

There are two ways to use MoveNet with the new pose-detection api:

  1. Through NPM:
    import * as poseDetection from '@tensorflow-models/pose-detection';
  2. Through script tag:
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-core"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-converter"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-webgl"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/pose-detection"></script>

Try it yourself!

Once the package is installed, you only need to follow the few steps below to start using it:

// Create a detector.
const detector = await poseDetection.createDetector(poseDetection.SupportedModels.MoveNet);

The detector defaults to use the Lightning version; to choose the Thunder version, create the detector as below:

// Create a detector.
const detector = await poseDetection.createDetector(poseDetection.SupportedModels.MoveNet, {modelType: poseDetection.movenet.modelType.SINGLEPOSE_THUNDER});
// Pass in a video stream to the model to detect poses.
const video = document.getElementById('video');
const poses = await detector.estimatePoses(video);

Each pose contains 17 keypoints, with absolute x, y coordinates, confidence score and name:

console.log(poses[0].keypoints);
// Outputs:
// [
// {x: 230, y: 220, score: 0.9, name: "nose"},
// {x: 212, y: 190, score: 0.8, name: "left_eye"},
// ...
// ]

Refer to our README for more details about the API.

As you begin to play and develop with MoveNet, we would appreciate your feedback and contributions. If you make something using this model, tag it with #MadeWithTFJS on social so we can find your work, as we would love to see what you create.

MoveNet Deep Dive

MoveNet Architecture

MoveNet is a bottom-up estimation model, using heatmaps to accurately localize human keypoints. The architecture consists of two components: a feature extractor and a set of prediction heads. The prediction scheme loosely follows CenterNet, with notable changes that improve both speed and accuracy. All models are trained using the TensorFlow Object Detection API.

The feature extractor in MoveNet is MobileNetV2 with an attached feature pyramid network (FPN), which allows for a high resolution (output stride 4), semantically rich feature map output. There are four prediction heads attached to the feature extractor, responsible for densely predicting a:

  • Person center heatmap: predicts the geometric center of person instances
  • Keypoint regression field: predicts full set of keypoints for a person, used for grouping keypoints into instances
  • Person keypoint heatmap: predicts the location of all keypoints, independent of person instances
  • 2D per-keypoint offset field: predicts local offsets from each output feature map pixel to the precise sub-pixel location of each keypoint
MoveNet architecture
MoveNet architecture

Although these predictions are computed in parallel, one can gain insight into the model’s operation by considering the following sequence of operations:

Step 1: The person center heatmap is used to identify the centers of all individuals in the frame, defined as the arithmetic mean of all keypoints belonging to a person. The location with the highest score (weighted by the inverse-distance from the frame center) is selected.

Step 2: An initial set of keypoints for the person is produced by slicing the keypoint regression output from the pixel corresponding to the object center. Since this is a center-out prediction – which must operate over different scales – the quality of regressed keypoints will not be very accurate.

Step 3: Each pixel in the keypoint heatmap is multiplied by a weight which is inversely proportional to the distance from the corresponding regressed keypoint. This ensures that we do not accept keypoints from background people, since they typically will not be in the proximity of regressed keypoints, and hence will have low resulting scores.

Step 4: The final set of keypoint predictions are selected by retrieving the coordinates of the maximum heatmap values in each keypoint channel. The local 2D offset predictions are then added to these coordinates to give refined estimates. See the figure below which illustrates these four steps.

MoveNet post-processing steps
MoveNet post-processing steps.

Training Datasets

MoveNet was trained on two datasets: COCO and an internal Google dataset called Active. While COCO is the standard benchmark dataset for detection – due to its scene and scale diversity – it is not suitable for fitness and dance applications, which exhibit challenging poses and significant motion blur. Active was produced by labeling keypoints (adopting COCO’s standard 17 body keypoints) on yoga, fitness, and dance videos from YouTube. No more than three frames are selected from each video for training, to promote diversity of scenes and individuals.

Evaluations on the Active validation dataset show a significant performance boost relative to identical architectures trained using only COCO. This isn’t surprising since COCO infrequently exhibits individuals with extreme poses (e.g. yoga, pushups, headstands, and more).

To learn more about the dataset and how MoveNet performs across different categories, please see the model card.

Images from Active keypoint dataset.
Images from Active keypoint dataset.

Optimization

While a lot of effort went into architecture design, post-processing logic, and data selection to make MoveNet a high-quality detector, an equal focus was given to inference speed. First, bottleneck layers from MobileNetV2 were selected for lateral connections in the FPN. Likewise, the number of convolution filters in each prediction head were slimmed down significantly to speed up execution on the output feature maps. Depthwise separable convolutions are used throughout the network, except in the first MobileNetV2 layer.

MoveNet was repeatedly profiled, uncovering and removing particularly slow ops. For example, we replaced tf.math.top_k with tf.math.argmax, since it executes significantly faster and is adequate for the single-person setting.

To ensure fast execution with TensorFlow.js, all model outputs were packed into a single output tensor, so that there is only one download from GPU to CPU.

Perhaps the most significant speedup is the use of 192×192 inputs to the model (256×256 for Thunder). To counteract the lower resolution, we apply intelligent cropping based on detections from the previous frame. This allows the model to devote its attention and resources to the main subject, and not the background.

Temporal Filtering

Operating on a high FPS camera stream provides the luxury of applying smoothing to keypoint estimates. Both Lightning and Thunder apply a robust, non-linear filter to the incoming stream of keypoint predictions. This filter is tuned to simultaneously suppress high-frequency noise (i.e. jitter) and outliers from the model, while also maintaining high-bandwidth throughput during quick motions. This leads to smooth keypoint visualizations with minimal lag in all circumstances.

MoveNet Browser Performance

To quantify the inference speed of MoveNet, the model was benchmarked across multiple devices. The model latency (expressed in FPS) was measured on GPU with WebGL, as well as WebAssembly (WASM), which is the typical backend for devices with lower-end or no GPUs.

MacBook Pro 15” 2019. 

Intel core i9. 

AMD Radeon Pro Vega 20 Graphics.

(FPS)

iPhone 12

(FPS)

Pixel 5

(FPS)

Desktop 

Intel i9-10900K. Nvidia GTX 1070 GPU.

(FPS)

WebGL

104  |  77

51  |  43

34  |  12

87  |  82

WASM 

with SIMD + Multithread

42  |  21

N/A

N/A

71  |  30

Inference speed of MoveNet across different devices and TF.js backends. The first number in each cell is for Lightning, and the second number is for Thunder.

TF.js continuously optimizes its backends to accelerate model execution across all supported devices. We applied several techniques here to help the models achieve this performance, such as implementing a packed WebGL kernel for the depthwise separable convolutions and improving GL scheduling for mobile Chrome.

To see the model’s FPS on your device, try our demo. You can switch the model type and backends live in the demo UI to see what works best for your device.

Looking Ahead

The next step is to extend Lightning and Thunder models to the multi-person domain, so that developers can support applications with multiple people in the camera field-of-view.

We also have plans to speed up the TensorFlow.js backends to make model execution even faster. This is achieved through repeated benchmarking and backend optimization.

Acknowledgements

We would like to acknowledge the other contributors to MoveNet: Yu-Hui Chen, Ard Oerlemans, Francois Belletti, Andrew Bunner, and Vijay Sundaram, along with those involved with the TensorFlow.js pose-detection API: Ping Yu, Sandeep Gupta, Jason Mayes, and Masoud Charkhabi.

Read More

Building a TinyML Application with TF Micro and SensiML

A guest post by Chris Knorowski, SensiML CTO

TinyML reduces the complexity of adding AI to the edge, enabling new applications where streaming data back to the cloud is prohibitive. Some examples of applications that are making use of TinyML right now are :

  • Visual and audio wake words that trigger an action when a person is detected in an image or a keyword is spoken .
  • Predictive maintenance on industrial machines using sensors to continuously monitor for anomalous behavior.
  • Gesture and activity detection for medical, consumer, and agricultural devices, such as gait analysis, fall detection or animal health monitoring.

One common factor for all these applications is the low cost and power usage of the hardware they run on. Sure, we can detect audio and visual wake words or analyze sensor data for predictive maintenance on a desktop computer. But, for a lot of these applications to be viable, the hardware needs to be inexpensive and power efficient (so it can run on batteries for an extended time).

Fortunately, the hardware is now getting to the point where running real-time analytics is possible. It is crazy to think about, but the Arm Cortex-M4 processor can do more FFT’s per second than the Pentium 4 processor while using orders of magnitude less power. Similar gains in power/performance have been made in sensors and wireless communication. TinyML allows us to take advantage of these advances in hardware to create all sorts of novel applications that simply were not possible before.

At SensiML our goal is to empower developers to rapidly add AI to their own edge devices, allowing their applications to autonomously transform raw sensor data into meaningful insight. We have taken years of lessons learned in creating products that rely on edge optimized machine learning and distilled that knowledge into a single framework, the SensiML Analytics Toolkit, which provides an end-to-end development platform spanning data collection, labeling, algorithm development, firmware generation, and testing.

So what does it take to build a TinyML application?

Building a TinyML application touches on skill sets ranging from hardware engineering, embedded programming, software engineering, machine learning, data science and domain expertise about the application you are building. The steps required to build the application can be broken into four parts:

  1. Collecting and annotating data
  2. Applying signal preprocessing
  3. Training a classification algorithm
  4. Creating firmware optimized for the resource budget of an edge device

This tutorial will walk you through all the steps, and by the end of it you will have created an edge optimized TinyML application for the Arduino Nano 33 BLE Sense that is capable of recognizing different boxing punches in real-time using the Gyroscope and Accelerometer sensor data from the onboard IMU sensor.

Gesture recognition using TinyML. Male punching a punching bag

What you need to get started

We will use the SensiML Analytics Toolkit to handle collecting and annotating sensor data, creating a sensor preprocessing pipeline, and generating the firmware. We will use TensorFlow to train our machine learning model and TensorFlow Lite Micro for inferencing. Before you start, we recommend signing up for SensiML Community Edition to get access to the SensiML Analytics Toolkit.

The Software

The Hardware

  • Arduino Nano 33 BLE Sense
  • Adafruit Li-Ion Backpack Add-On (optional)
  • Lithium-Ion Polymer Battery ( 3.7v 100mAh)
  • Zebra Byte Case
  • Glove and Double Sided Tape

The Arduino Nano 33 BLE Sense has an Arm Cortex-M4 microcontroller running at 64 MHz with 1MB Flash memory and 256 KB of RAM. If you are used to working with cloud/mobile this may seem tiny, but many applications can run in such a resource-constrained environment.

The Nano 33 BLE Sense also has a variety of onboard sensors which can be used in your TinyML applications. For this tutorial, we are using the motion sensor which is a 9-axis IMU (accelerometer, gyroscope, magnetometer).

For wireless power, we used the Adafruit Li-Ion Battery Pack. If you do not have the battery pack, you can still walk through this tutorial using a suitably long micro USB cable to power the board. Though collecting gesture data is not quite as fun when you are wired. See the images below hooking up the battery to the Nano 33 BLE Sense.

Nano 33 BLE Sense
battery connected to boxing glove

Building Your Data Set

For every machine learning project, the quality of the final product depends on the quality of your data set. Time-series data, unlike image and audio, are typically unique to each application. Because of this, you often need to collect and annotate your datasets. The next part of this tutorial will walk you through how to connect to the Nano 33 BLE Sense to stream data wirelessly over BLE as well as label the data so it can be used to train a TensorFlow model.

For this project we are going to collect data for 5 different gestures as well as some data for negative cases which we will label as Unknown. The 5 boxing gestures we are going to collect data for are Jab, Overhand, Cross, Hook, and Uppercut.

boxing gestures

We will also collect data on both the right and left glove. Giving us a total of 10 different classes. To simplify things we will build two separate models one for the right glove, and one for the left. This tutorial will focus on the left glove.

Streaming sensor data from the Nano 33 over BLE

The first challenge of a TinyML project is often to figure out how to get data off of the sensor. Depending on your needs you may choose Wi-Fi, BLE, Serial, or LoRaWAN. Alternatively, you may find storing data to an internal SD card and transferring the files after is the best way to collect data. For this tutorial, we will take advantage of the onboard BLE radio to stream sensor data from the Nano 33 BLE Sense.

We are going to use the SensiML Open Gateway running on our computer to retrieve the sensor data. To download and launch the gateway open a terminal and run the following commands:

git clone https://github.com/sensiml/open-gateway

cd open-gateway
pip3 install -r requirements.txt
python3 app.py

The gateway should now be running on your machine.

Gateway

Next, we need to connect the gateway server to the Nano 33 BLE Sense. Make sure you have flashed the Data Collection Firmware to your Nano 33. This firmware implements the Simple Streaming Interface specification which creates two topics used for streaming data. The /config topic returns a JSON describing the sensor data and /stream topic streams raw sensor data as a byte array of Int16 values.

To configure the gateway to connect to your sensor:

  1. Go to the gateway address in your browser (defaults to localhost:5555)
  2. Click on the Home Tab
  3. Set Device Mode: Data Capture
  4. Set Connection Type: BLE
  5. Click the Scan button, and select the device named Nano 33 DCL
  6. Click the Connect to Device button
SensiML Gateway

The gateway will pull the configuration from your device, and be ready to start forwarding sensor data. You can verify it is working by going to the Test Stream tab and clicking the Start Stream button.

Setting up the Data Capture Lab Project

Now that we can stream data, the next step is to record and label the boxing gestures. To do that we will use the SensiML Data Capture Lab. If you haven’t already done so, download and install the Data Capture Lab to record sensor data.

We have created a template project to get you started. The project is prepopulated with the gesture labels and metadata information, along with some pre-recorded example gestures files. To add this project to your account:

  1. Download and unzip the Boxing Glove Gestures Demo Project
  2. Open the Data Capture Lab
  3. Click Upload Project
  4. Click Browse which will open the file explorer window
  5. Navigate to the Boxing Glove Gestures Demo folder you just unzipped and select the Boxing Glove Gestures Demo.dclproj file
  6. Click Upload
SensiML Data Capture Lab

Connecting to the Gateway

After uploading the project, you can start capturing sensor data. For this tutorial we will be streaming data to the Data Capture Lab from the gateway over TCP/IP. To connect to the Nano 33 BLE Sense from the Data Capture Lab through the gateway:

  1. Open the Project Boxing Glove Gestures Demo
  2. Click Switch Modes -> Capture Mode
  3. Select Connection Method: Wi-Fi
  4. Click the Find Devices button
  5. Enter the IP Address of your gateway machine, and the port the server is running on (typically 127.0.0.1:5555)
  6. Click Add Device
  7. Select the newly added device
  8. Click the Connect button
wi-fi connection

You should see sensor data streaming across the screen. If you are having trouble with this step, see the full documentation here for troubleshooting.

boxing gesture data streaming

Capturing Boxing Gesture Sensor Data

The Data Capture Lab can also play videos that have been recorded alongside your sensor data. If you want to capture videos and sync them up with sensor data see the documentation here. This can be extremely helpful during the annotation phase to help interpret what is happening at a given point in the time-series sensor waveforms.

Now that data is streaming into the Data Capture Lab, we can begin capturing our gesture data set.

  1. Select “Jab” from the Label dropdown in the Capture Properties screen. (this will be the name of the file)
  2. Select the Metadata which captures the context (subject, glove, experience, etc.)
  3. Then click the Begin Recording button to start recording the sensor data
  4. Perform several “Jab” gestures
  5. Click the Stop Recording button when you are finished

After you hit stop recording, the captured data will be saved locally and synced with the cloud project. You can view the file by going to the Project Explorer and double-clicking on the newly created file.

GIF showing male boxing while program collects data

The following video walks through capturing sensor data.

Annotating Sensor Data

To classify sensor data in real-time, you need to decide how much and which portion of the sensor stream to feed to the classifier. On edge devices, it gets even more difficult as you are limited to a small buffer of data due to the limited RAM. Identifying the right segmentation algorithm for an application can save on battery life by limiting the number of classifications performed as well as improving the accuracy by identifying the start and end of a gesture.

Segmentation algorithms work by taking the input from the sensor and buffering the data until they determine a new segment has been found. At that point, they pass the data buffer down to the result of the pipeline. The simplest segmentation algorithm is a sliding window, which continually feeds a set chunk of data to the classifier. However, there are many drawbacks to the sliding window for discrete gesture recognition, such as performing classifications when there are no events. This wastes battery and runs the risk of having events split across multiple windows which can lower accuracy.

Segmenting in the Data Capture Lab

We identify events in the Data Capture Lab by creating Segments around the events in your sensor data. Segments are displayed with a pair of blue and red lines when you open a file and define where an event is located.

The Data Capture Lab has two methods for labeling your events: Manual and Auto. In manual mode you can manually drag and drop a segment onto the graph to identify an event in your sensor data. Auto mode uses a segmentation algorithm to automatically detect events based on customizable parameters. For this tutorial, we are going to use a segmentation algorithm in Auto mode. The segmentation algorithms we use for determining events will also be compiled as part of the firmware so that the on-device model will be fed the same segments of data it was trained against.

We have already created a segmentation algorithm for this project based on the dataset we have collected so far. To perform automatic event detection on newly captured data file:

  1. Select the file from the Project Explorer
  2. Click on the Detect Segments button
  3. The segmentation algorithm will be run against the capture and the segments it finds will be added to the file
GIF of auto-segmentation

Note: If the events are not matching the real segments in your file, you may need to adjust the parameters of the segmentation algorithm.

Labeling Events in the Data Capture Lab

Keep in mind that automatic event detection only detects that an event has occurred, it does not determine what type of event has occurred. For each event that was detected, you will need to apply a label to them. To do that:

  1. Select one or more of the segments from the graph
  2. Click the Edit button or (Ctrl+E)
  3. Specify which label is associated with that event
  4. Repeat steps 1-3 for all segments in the capture
  5. Click Save
GIF labeling data

Building a TinyML Model

We are going to use Google Colab to train our machine learning model using the data we collected from the Nano 33 BLE Sense in the previous section. Colab provides a Jupyter notebook that allows us to run our TensorFlow training in a web browser. Open the Google Colab notebook and follow along to train your model.

Offline Model Validation

After saving the model, go to the Analytic Studio to perform offline validation. To test the model against any of the captured data files

  1. Open the Boxing Glove Gestures Demo project in the Summary Tab
    SensiML analytics studio
  2. Go to Test Model Tab
  3. Select your model from the Model Name dropdown
  4. Select one or more of the capture files by clicking on them
  5. Click the Compute Accuracy Button to classify the captures using the selected model
compute accuracy

p>When you click the Compute Accuracy button, the segmentation algorithm, preprocessing steps, and TensorFlow model are compiled into a single Knowledge Pack. Then the classification results and accuracy for each of the captures you selected are computed using the compiled Knowledge Pack. Click the Results button for the individual capture to see the classifications for all of the detected events and how they compared with the ground truth labels.

Deploy and Test on the Nano 33 BLE Sense

Downloading the model as firmware

Now that you validated the model offline, it’s time to see how it performs at the edge. To do that we download and flash the model to the Nano 33 BLE Sense.

  1. Go to the Download Model tab of the Analytics Studio
  2. Select the HW Platform: Arduino CortexM4
  3. Select Format: Library
  4. Click the Download button
  5. The compiled library file should download to your computer
download knowledge pack

Flashing the Firmware

After downloading the library, we will build and upload the firmware to the Nano 33 BLE Sense. For this step, you will need the Nano 33 Knowledge Pack Firmware. To compile the firmware, we are using Visual Studio Code with the Platform IO plugin. To compile your model and Flash the Nano 33 BLE Sense with this firmware:

  1. Open your terminal and run
    git clone https://github.com/sensiml/nano33_knowledge_pack/
  2. Unzip the downloaded Knowledge Pack.
  3. In the folder, you will find the following directories:

    knowledgepack_project/

    libsensiml/

  4. Copy the files from libsensiml to nano33_knowledge_pack/lib/sensiml. You will overwrite the files included in the repository.
  5. Copy the files from knowledgepack_project to nano33_knowledge_pack/src/
    Copy the files from knowledgepack_project to nano33_knowledge_pack/src/
  6. Switch to the Platform I/O extension tab in VS Code
  7. Connect your Nano 33 BLE Sense to your computer using the micro USB cable.
  8. Click Upload and Monitor under the nano33ble_with_tensorflow in the PlatformI/O tab.
    Upload and Monitor tab

When the device restarts, it will boot up and your model will be running automatically. The video below walks through these steps.

Viewing Classification Results

To see the classification results in real-time connect to your device over BLE using the Android TestApp or the SensiML Open Gateway. The device will show up with the name Nano33 SensiML KP when you scan for devices. We have trained two models, one for the left glove and one for the right glove. You can see a demo of both models running at the same time in the following video.

Conclusion

We hope this blog has given you the tools you need to start building an end-to-end TinyML application using TensorFlow Lite For Microcontrollers and the SensiML Analytics Toolkit. For more tutorials and examples of TinyML applications checkout the application examples in our documentation. Follow us on LinkedIn or get in touch with us, we love hearing about all of the amazing TinyML applications the community is working on!

Read More