Tensorflow – Page 17

A Tour of SavedModel Signatures

March 3, 2021

by TensorFlow Blog Tensorflow

Posted by Daniel Ellis, TensorFlow Engineer

Note: This blog post is aimed at TensorFlow developers who want to learn the details of how graphs and models are stored. If you are new to TensorFlow, you should check out the TensorFlow Basics guides before reading this article.

TensorFlow can run models without the original Python objects, as demonstrated by TensorFlow Serving and TensorFlow Lite, or when you download a trained model from TensorFlow Hub.

Models and layers can be loaded from this representation without actually making an instance of the Python class that created it. This is desired in situations where you do not have (or want) a Python interpreter, such as serving at scale or on an edge device, or in situations where the original Python code is not available.

Saved models are represented by two separate, but equally important, parts: the graph, which describes the fixed computation described in code, and the weights, which are the dynamic parameters you trained during training. If you aren’t already familiar with this and @tf.function, you should check out the Introduction to graphs and functions guide as well as the section on saving in the modules, layers, and models guide.

From a code standpoint, functions decorated with @tf.function create a Python callable; in the documentation we refer to these as polymorphic functions, as they are Python callables that can take a variety argument signatures. Each time you call a @tf.function with a new argument signature, TensorFlow traces out a new graph just for that set of arguments. This new graph is then added as a “concrete function” to the callable. Thus, a saved model can be one or more subgraphs, each with a different signature.

A SavedModel is what you get when you call tf.saved_model.save(). Saved models are stored as a directory on disk. The file, saved_model.pb,within that directory, is a protocol buffer describing the functional tf.Graph.

In this blog post, we’ll take a look inside this protobuf and see how function signature serialization and deserialization works under the hood. After reading this, you’ll have a greater appreciation for what functions and signatures before, which can help you load, modify, or optimize saved models.

Background

There are a total of five places inputs to functions are defined in the saved model protobuf. It can be tough to understand and remember what each of these does. This post intends to inventory each of these definitions and what they’re used for. It also goes through a basic example illustrating what a simple model looks like after serialization.

The actual APIs you use will always be carefully versioned (as they have been since 2016), and the models themselves will conform to the version compatibility guide. However, the material in this document lays out a snapshot of the existing state of things. Any links to code will include point-in-time revisions so as not to drift out of date. As with all non-documented implementation details, these details are subject to change in the future.

We’ll occasionally use the term “signatures” to talk about the general concept of describing function inputs (e.g. in the title of this document). In this sense, we will be referring not just to TensorFlow’s specific concept of signatures, but all of the ways TensorFlow defines and validates inputs to functions. Context should make the meaning clear.

What This Is Not About

This document is not intended to describe how signatures or functions work from a user perspective. It is intended for TensorFlow developers working on the internals of TensorFlow. Likewise, this document does not make a statement of the way things “should” be. It aims to simply document the way things are.

Overview of Signature Definitions

There are five protos that store definitions of function inputs in one manner or another. Their names and code locations, as well as their paths within the saved model proto, are as follows:

Proto messages, and their location in SavedModel

FunctionDef: meta_graphs -> graph_def -> library -> function
SignatureDef: meta_graphs -> signature_def
SavedFunction: meta_graphs -> object_graph_def -> nodes -> kind -> function
SavedBareConcreteFunction: meta_graphs -> object_graph_def -> nodes -> kind -> bare_concrete_function
SavedConcreteFunction: meta_graphs -> object_graph_def -> concrete_functions

FunctionDef

Of the five definitions discussed in this document, FunctionsDefs are the most core to execution. When loading a saved model, these function definitions are registered in the function library of the runtime and used to create ConcreteFunctions. These functions can then be executed via PartitionedCall or TFE_Py_Execute.

This is where the actual nodes describing execution are defined, as well as what the inputs and outputs to the function are.

SignatureDef

SignatureDefs are generated from signatures passed into @tf.function. We do not save the signature’s TensorSpecs directly, however. Instead, when saving, we call the underlying function using the TensorSpecs in order to generate a concrete function. From there, we inspect the generated concrete function to get the inputs and outputs, storing them on the SignatureDef.

On the loading side,SignatureDefs are essentially ignored. They are primarily used in v1 or C++, where the developer loading the model can inspect the returned SignatureDef protos directly. This allows them to use their desired signature name to lookup the placeholder and output names needed for execution.

These input and output names can then be passed into feeds and fetches when calling Session.run in TensorFlow V1 code.

SavedFunction

SavedFunction is one of the many types of SavedObjects in the nodes list of the ObjectGraphDef. SavedFunctions are restored into a RestoredFunctions at load time. Like all nodes in this list, they are then attached to the returned model via the hierarchy defined by the children ObjectReference field.

SavedFunction’s main purpose is polymorphism. SavedFunctions support polymorphism by specifying a number of concrete function names defined in the function library above (via FunctionDef). At call time, we iterate through the concrete function names to find the first whose signature matches. If we find a match, we call it; if not, we throw an exception.

There is one more bit of complexity. When a RestoredFunction is called with a particular set of arguments, a new concrete function is created whose sole purpose is to call the matching concrete function. This is done using restored_function_body under the hood and is where the logic lives to find the appropriate concrete function.

This is invisible in the SavedModel proto, but these extra concrete functions are registered at call time in the runtime’s function library just as the other function library functions are.

The second purpose of SavedFunction is to update the FunctionSpec of all associated ConcreteFunctions using the FunctionSpec stored on the SavedFunction. This function spec is used at call time to

validate passed in structured arguments, and
convert structured arguments into flat ones needed for calling the underlying concrete function.

SavedBareConcreteFunction

Similar to SavedFunctions, SavedBareConcreteFunctions are used to update a

specific concrete function’s arguments and function spec. This is done here. Unlike SavedFunctions, they only reference a single specific concrete function.

In practice, SavedBareConcreteFunctions are commonly attached to and accessed via the signatures map (i.e. the signatures attribute on the loaded object). The underlying concrete functions they modify, in this case, are signature_wrapper functions. This wrapping is done to format the output in the way v1 expects (i.e. a dictionary of tensors). Similar to restored_function_body concrete functions, and other than restructuring the output, these concrete functions do nothing but call their associated concrete functions.

SavedConcreteFunction

SavedConcreteFunction objects are not SavedObjectGraph nodes. They are stored in a map directly on the SavedObjectGraph. These objects reference a specific, already-registered concrete function — the key in the map is that concrete function’s registered name.

These objects serve two purposes. The first is handling function “captures” via

the bound_inputs field. Captured variables are those a function reads or modifies that were not explicitly passed in when calling into the function. Since functions in the function library do not have a concept of captured variables, any variables used by the function must be passed in as an argument. bound_inputs stores a list of node IDs that should be passed in to the underlying ConcreteFunction when called. We set this up here.

The second purpose, and similar to SavedFunction and SavedBareConcreteFunction, is modifying the existing concrete function’s FuncGraph structured inputs and outputs. This also is used for argument validation. The setup for this is done here.

Example Walkthrough

A simple example may help illustrate all of this with more clarity. Let’s make a basic model and take a look at the subsequent generated proto to get a better feel for what’s going on.

Basic Model

class ExampleModel(tf.Module):

  @tf.function(input_signature=[tf.TensorSpec(shape=(), dtype=tf.float32)])
  def capture_fn(self, x):
    if not hasattr(self, 'weight'):
      self.weight = tf.Variable(5.0, name='weight')
    self.weight.assign_add(x * self.weight)
    return self.weight

  @tf.function
  def polymorphic_fn(self, x):
    return tf.constant(3.0) * x

model = ExampleModel()
model.polymorphic_fn(tf.constant(4.0))
model.polymorphic_fn(tf.constant([1.0, 2.0, 3.0]))
tf.saved_model.save(
    model, "/tmp/example-model", signatures={'capture_fn': model.capture_fn})

This model contains the basis for most of the complexity we’ll need to fully explore the intricacies of saving and signatures. This will allow us to look at functions with and without signatures, with and without captures, and with and without polymorphism.

Function with Captures

Let’s start by looking at our function with captures, capture_fn. We can see we have a concrete function defined in the function library, as expected:

Image of concrete function defined in the function library

A FunctionDef located in FunctionDefLibrary of MetaGraphDef.graph_def

Note the expected float input, "x", as well as the additional captured argument, "mul_readvariableop_resource". Since this function has a capture, we should see a variable being referenced in the bound_inputs field of one of our SavedConcreteFunctions:

A SavedConcreteFunction located in the concrete_functions map of the ObjectGraphDef

Indeed, we can see bound_inputs refers to node 1, which is a SavedVariable with the name and dtype we expect:

A SavedVariable located in ObjectGraphDef.nodes

Note that we also are storing on canonicalized_input_signature additional data that will be used to modify the concrete function. The key of this object, "__inference_capture_fn_59", is the same name as the concrete function registered in our function library.

Since we’ve specified a signature, we should also see a SavedBareConcreteFunction:

A SavedBareConcreteFunction located in ObjectGraphDef.nodes

As discussed above, we use the function spec and argument information to modify the underlying concrete function. But what’s up with the "__inference_signature_wrapper_68" name? And how does this fit in with the rest of the code?

First, note that this is the fifth (5) node in the node list. This will come up again shortly.

Let’s start by looking at the nodes list. If we start at the first node in the nodes list, we’ll see a "signatures" node attached as a child:

A SavedUserObject located in ObjectGraphDef.nodes

If we look at node 2, we’ll see this node is a signature map that references one final node: node 5, our BareConcreteSavedFunction.

A SavedUserObject located in ObjectGraphDef.nodes

Thus, when we access this function via model.signatures["capture_fn"], we will actually be calling into this intermediate signature wrapper function first.

And what does that function, "__inference_signature_wrapper_68", look like?

A FunctionDef located in FunctionDefLibrary of MetaGraphDef.graph_def

It takes the arguments we expect, and makes a call out to… "__inference_capture_fn_59", our original function! Just as we expect.

But wait… what happens if we don’t access our function via model.signatures["capture_fn"]? After all, we should be able to call it directly via model.capture_fn.

Notice above, we had a child on the top level object named "capture_fn" with a node_id of 3. If we look at node 3, we’ll see a SavedFunction object that references our original concrete function with no signature wrapper intermediary:

A SavedFunction located in ObjectGraphDef.nodes

Again, the function spec is used to modify the function spec of our concrete function, "__inference_capture_fn_59". Notice also that concrete_functions here is a list. We only have one item right now, but this will come up again when we take a look at our polymorphic function example.

Now, we’ve fully mapped essentially everything needed for execution of this function, but we have one last thing to look at: SignatureDef. We’ve defined a signature, so we expect a SignatureDef to be defined:

A SignatureDef located in the MetaObjectGraph.signature_def map

This is very important for loading in v1 and C++ for serving. Note those funky names: "capture_fn_x:0" and "StatefulPartitionedCall:0". To call this function in v1, we need a way to map our nice argument names to the actual graph placeholder names for passing in as feeds and fetches (and doing validation, if we wish). Looking at this SignatureDef allows us to do just that.

Polymorphic Functions

We’re not quite done yet. Let’s take a look at our polymorphic function. We won’t repeat everything, since a lot of it is the same. We won’t have any signature wrapper functions or signature defs, since we skipped the signature on this one. Let’s look at what’s different.

A FunctionDef located in FunctionDefLibrary of MetaGraphDef.graph_def

For one, we now have two concrete functions registered in the function library, each with slightly different input shapes.

We also have two SavedConcreteFunction modifiers:

Two SavedConcreteFunctions located in the concrete_functions map of the ObjectGraphDef

And finally, we can see our SavedFunction references two underlying concrete functions instead of one:

A SavedFunction located in ObjectGraphDef.nodes

The function spec here will be attached to both of these concrete functions at load time. When we call our SavedFunction, it will use the arguments we pass in to find the correct concrete function and execute it.

Next Steps

You should now be an expert on how functions and their signatures are saved at a code level. Remember, what’s described in this blog post is how the code works right now. For updated code and examples in the future, see the official documentation on tensorflow.org.

Speaking of documentation, if you want a fast introduction to the basic APIs for saved models, you should introductory articles on how the APIs for functions and modules are traced and saved. For experts, don’t miss this detailed guide on SavedModel itself, as well as a complete discussion of autograph.

And finally, if you do any exciting or useful protobuf surgery, share with us on Twitter. Thanks for reading this far!

Transfer Learning for Audio Data with YAMNet

March 2, 2021

by TensorFlow Blog Tensorflow

Posted by Luiz GUStavo Martins, Developer Advocate

Transfer learning is a popular machine learning technique, in which you train a new model by reusing information learned by a previous model. Most common applications of transfer learning are for the vision domain, to train accurate image classifiers, or object detectors, using a small amount of data — or for text, where pre-trained text embeddings or language models like BERT are used to improve on natural language understanding tasks like sentiment analysis or question answering. In this article, you’ll learn how to use transfer learning for a new and important type of data: audio, to build a sound classifier.

There are many important use cases of audio classification, including to protect wildlife, to detect whales and even to fight against illegal deforestation.

With YAMNet, you can create a customized audio classifier in a few easy steps:

Prepare and use a public audio dataset
Extract the embeddings from the audio files using YAMNet
Create a simple two layer classifier and train it.
Save and test the final model

You can follow the code here in this tutorial.

The YAMNet model

YAMNet (“Yet another Audio Mobilenet Network”) is a pretrained model that predicts 521 audio events based on the AudioSet corpus.

This model is available on TensorFlow Hub including the TFLite and TF.js versions, for running the model on mobile and the web. The code can be found on their repository.

The model has 3 outputs:

Class scores that you’d use for inference
Embeddings, which are the important part for transfer learning
Log Mel Spectrograms to provide a visualization of the input signal

The model takes a waveform represented as 16 kHz samples in the range [-1.0, 1.0], frames it in windows of 0.96 seconds and hop of 0.48 seconds, and then runs the core of the model to extract the embeddings on a batch of these frames.

The 0.96 seconds windows hopping over a waveform

As an example, trying the model with this audio file [link] will give you these results:

The first graph is the waveform. The second graph is the log-mel spectrogram. The third graph shows the class probability predictions per frame of audio, where darker is more likely.

The ESC-50 dataset

To do transfer learning with the model, you’ll use the Dataset for Environmental Sound Classification, or ESC-50 for short. This is a collection of 2000 environmental audio recordings from 50 classes. Each recording is 5 seconds long and they came originally from the Freesound project.

The ESC-50 has the classes Dog and Cat that you’ll need.

The dataset has two important components: the audio files and a metadata csv file with the metadata about every audio file.

The columns in the metadata csv file contains information that will be used to train the model:

Filename gives the name of the .wav audio file
Category is the human-readable class name for the numeric target id
Target is the unique numeric id of the category
Fold ensures that clips originating from the same initial source are always contained in the same group. This is important to avoid cross-contamination when splitting the data into train, validation and test sets and for cross-validation.

For more detailed information you can read the original ESC paper.

Working with the dataset

To load the dataset, you’ll start from the metadata file and load it using the Pandas method read_csv.

With the loaded dataframe, the next steps are to filter by the classes that will be used, in this case: Dogs and Cats.

Next step would be to load the audio files to start the process but if there are too many audio files, just loading all of them to memory can be prohibitive and lead to out of memory issues. The best solution is to lazily load the audio files when needed. TensorFlow can help do this easily with tf.data.Dataset and the map method.

Let’s create the Dataset from the the previous created pandas dataframe and apply the load_wav method to all the files:

filenames = filtered_pd['filename']
targets = filtered_pd['target']
folds = filtered_pd['fold']
 
main_ds = tf.data.Dataset.from_tensor_slices((filenames, targets, folds))
main_ds = main_ds.map(load_wav_for_map)

Here, no audio file was loaded to memory yet since the mapping wasn’t evaluated. For example, if you request a size of the dataset for example (len(list(train_ds.as_numpy_iterator()))

), that would make the map function to be evaluated and load all the files.

The same technique will be used to extract all the features (embeddings) from each audio file.

Extracting the audio embeddings

Here you are going to load the YAMNet model from TensorFlow Hub. All you need is the model’s handle, and call the load method from the tensorflow_hub library.

yamnet_model_handle = 'https://tfhub.dev/google/yamnet/1'
yamnet_model = hub.load(yamnet_model_handle)

This will load the model to memory ready to be used.

For each audio file, you’ll extract the embeddings using the YAMNet model. For each audio file, YAMNet is executed. The embeddings output is paired with the same label and folder from the audio file.

def extract_embedding(wav_data, label, fold):
  ''' run YAMNet to extract embedding from the wav data '''
  scores, embeddings, spectrogram = yamnet_model(wav_data)
  num_embeddings = tf.shape(embeddings)[0]
  return (embeddings,
            tf.repeat(label, num_embeddings), 
            tf.repeat(fold, num_embeddings))

main_ds = main_ds.map(extract_embedding).unbatch()

These embeddings will be the input for the classification model. From the model’s documentation, you can read that for a given audio file, it will frame the waveform into sliding windows of length 0.96 seconds and hop 0.48 seconds, and then run the core of the model. So, in summary, for each 0.48 seconds, the model will output one embedding array with 1024 float values. This part is also done using map(), so again, lazy evaluation and that’s why it executes so fast.

The final dataset contains the three used columns: embedding, label and fold.

The last dataset operation is to split into train, validation and test datasets. To do so the filter() method and use the fold field (an integer between 1 and 5) as criteria.

cached_ds = main_ds.cache()
train_ds = cached_ds.filter(lambda embedding, label, fold: fold < 4)
val_ds = cached_ds.filter(lambda embedding, label, fold: fold == 4)
test_ds = cached_ds.filter(lambda embedding, label, fold: fold == 5)

Training the Classifier

With the YAMNet embedding vectors and the label, the next step is to train a classifier that learns what’s a dog’s sound and what is a cat’s sound.

The classifier model is very simple with just two dense layers, but as you’ll see this is enough for the amount of data used.

my_model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1024), dtype=tf.float32, name='input_embedding'),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(len(my_classes))            
])

Saving the final model

The model that was trained works and has good accuracy but the input it expects is not an audio waveform but an embedding array. To address this problem, the final model will combine YAMNet as the input layer and the model just trained. This way, the final model will accept a waveform and output the class:

input_segment = tf.keras.layers.Input(shape=(), dtype=tf.float32,
                                       name='audio')
embedding_extraction_layer = hub.KerasLayer('https://tfhub.dev/google/yamnet/1', trainable=False)
scores, embeddings, spectrogram = embedding_extraction_layer(input_segment)
serving_outputs = my_model(embeddings_output)
serving_outputs = ReduceMeanLayer(axis=0, name='classifier')(serving_outputs)
serving_model = tf.keras.Model(input_segment, serving_outputs)
serving_model.save(saved_model_path, include_optimizer=False)

To try the reloaded model, you can use the same way it was used earlier in the colab:

reloaded_model = tf.saved_model.load(saved_model_path)
reloaded_results = reloaded_model(testing_wav_data)
cat_or_dog = my_classes[tf.argmax(reloaded_results)]

This model can also be used with TensorFlow Serving with the ‘serving_default’

serving_results =  reloaded_model.signatures['serving_default'](testing_wav_data)
cat_or_dog = my_classes[tf.argmax(serving_results['classifier'])]

In this post, you learned how to use the YAMNet model for transfer learning to recognize audio of dogs and cats from the ESC-50 dataset.

Check out the YAMNet model on tfhub.dev and the tutorial on tensorflow.org. You can apply this technique to your own dataset, or to other classes in the ESC-50 dataset.

We would love to know what you can build with this! Share your project with us on social media by using the hashtag #TFHub.

Acknowledgements

We’d like to thank a number of colleagues for their contribution to this work: Dan Ellis, Manoj Plakal and Eduardo Fonseca for an amazing YAMNet model and support with the colab and multiple reviews.

Mark Daoust and Elizabeth Kemp have greatly improved the presentation of the material in this post and the associated tutorial.

Introducing TensorFlow Videos for a Global Audience: Vietnamese

March 1, 2021

by TensorFlow Blog Tensorflow

Posted by TensorFlow Team

When the TensorFlow YouTube channel launched in 2018, we had a vision to inform and inspire developers around the world about what was possible with Machine Learning. With series like Coding TensorFlow showing how you can use it, and Made with TensorFlow showing inspirational stories about what people have done with TensorFlow and much more, the channel has grown greatly. But we learned an important lesson: it’s a global phenomenon, and to reach the world effectively, we should provide some of our content in multiple languages with native speakers presenting. Check out the popular Zero to Hero series in Vietnamese!

Nhập môn Học máy với TensorFlow

Dường như mỗi khi bạn lướt web, đọc sách, báo thì thông tin về công nghệ học máy (machine learning) và trí tuệ nhân tạo (AI) luôn đập vào mắt bạn. Trong số đó cũng có không ít những thông tin và quảng cáo thổi phồng về AI. Bởi vậy, từ góc nhìn của developer, chúng tôi trong nhóm TensorFlow quyết định sản xuất một chuỗi video gồm 4 phần về bản chất thực sự của công nghệ học máy, dựa trên bài thuyết trình nổi tiếng của Laurence Moroney tại Google IO 2019 với tựa đề Machine Learning: Zero to Hero (tạm dịch là Công nghệ Học máy: Trở thành chuyên gia từ con số 0 với TensorFlow).

Trong video 1, chúng tôi sẽ giới thiệu về một hình thức lập trình mới là học máy. Trong đó, thay vì lập trình các chỉ thị cho máy tính bằng ngôn ngữ lập trình như Java hoặc C++, thì trong học máy bạn sẽ tạo một chương trình được huấn luyện dựa trên dữ liệu và máy tính sẽ tự suy ra các logic từ dữ liệu này. Vậy công nghệ học máy thực sự là như thế nào? Chúng ta sẽ cùng tìm hiểu về một ví dụ “Hello World” về tạo mô hình học máy, giới thiệu các ý tưởng mà chúng ta sau đó sẽ áp dụng cho một vấn đề thú vị hơn: thị giác máy tính.

Trong video 2, bạn sẽ tìm hiểu về thị giác máy tính dựa trên học máy. Chúng ta sẽ huấn luyện cho máy tính nhìn thấy và nhận diện các đồ vật khác nhau.

Trong video 3, chúng ta sẽ học về các mạng nơ ron tích chập và lý do chúng đóng vai trò quan trọng trong các ứng dụng thị giác máy tính. Tích chập là bộ lọc xử lý hình ảnh và trích xuất các đặc điểm đặc trưng của ảnh. Bạn sẽ tìm hiểu về cách hoạt động của các mạng nơ ron tích chập qua việc xử lý và trích xuất đặc điểm của một tập các hình ảnh thực tế.

Trong video 4, bạn sẽ học về cách xây dựng mô hình phân loại hình ảnh để chơi trò oẳn tù tì. Trong phần 1, chúng ta đã đề cập đến trò chơi oẳn tù tì và việc lập trình để máy tính nhận biết hình ảnh bàn tay ra đấm, lá, kéo khó như thế nào. Tuy nhiên, sau đó chúng ta cũng đã tìm hiểu nhiều về công nghệ học máy, cách xây dựng mạng nơ ron để phát hiện các quy luật từ dữ liệu điểm ảnh, và phương pháp sử dụng mạng tích chập để phát hiện các đặc trưng trong bức ảnh. Trong phần này, chúng ta đã áp dụng những kiến thức đã học từ 3 phần trước để xây dựng mạng nơ ron để máy tính chơi oẳn tù tì.

Hy vọng loạt video này sẽ giúp bạn làm quen với học máy. Nếu có góp ý gì, các bạn hãy viết vào phần comment trong video trên YouTube. Và đừng quên subscribe kênh YouTube của TensorFlow thể xem các video khác về học máy nữa nhé!

Variational Inference with Joint Distributions in TensorFlow Probability

February 17, 2021

by TensorFlow Blog Tensorflow

Posted by Emily Fertig, Joshua V. Dillon, Wynn Vonnegut, Dave Moore, and the TensorFlow Probability team

In this post, we introduce new tools for variational inference with joint distributions in TensorFlow Probability, and show how to use them to estimate Bayesian credible intervals for weights in a regression model.

Overview

Variational Inference (VI) casts approximate Bayesian inference as an optimization problem and seeks a ‘surrogate’ posterior distribution that minimizes the KL divergence with the true posterior. Gradient-based VI is often faster than MCMC methods, composes naturally with optimization of model parameters, and provides a lower bound on model evidence that can be used directly for model comparison, convergence diagnosis, and composable inference.

TensorFlow Probability (TFP) offers tools for fast, flexible, and scalable VI that fit naturally into the TFP stack. These tools enable the construction of surrogate posteriors with covariance structures induced by linear transformations or normalizing flows.

VI can be used to estimate Bayesian credible intervals for parameters of a regression model to estimate the effects of various treatments or observed features on an outcome of interest. Credible intervals bound the values of an unobserved parameter with a certain probability, according to the posterior distribution of the parameter conditioned on observed data and given an assumption on the parameter’s prior distribution.

In this post, we demonstrate how to use VI to obtain credible intervals for parameters of a Bayesian linear regression model for radon levels measured in homes (using Gelman et al.’s (2007) Radon dataset; see similar examples in Stan). We demonstrate how TFP JointDistributions combine with bijectors to build and fit two types of expressive surrogate posteriors:

a standard Normal distribution transformed by a block matrix. The matrix may reflect independence among some components of the posterior and dependence among others, relaxing the assumption of a mean-field or full-covariance posterior.
a more complex, higher-capacity inverse autoregressive flow.

The surrogate posteriors are trained and compared with results from a mean-field surrogate posterior baseline. These plots show credible intervals for four model parameters obtained with the three VI surrogate posteriors, as you’ll learn about below, as well as Hamiltonian Monte Carlo (HMC) for comparison.

credible intervals for four model parameters obtained with the three VI surrogate posteriors

You can follow along and see all the details in this Google Colab.

Example: Bayesian hierarchical linear regression on Radon measurements

Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household.

The EPA did a study of radon levels in 80,000 houses. Two important predictors are:

Floor on which the measurement was taken (radon higher in basements)
County uranium level (positive correlation with radon levels)

Predicting radon levels in houses grouped by county is a classic problem in Bayesian hierarchical modeling, introduced by Gelman and Hill (2006). We are interested in credible intervals for the effect of location (county) on the radon level of houses in Minnesota. In order to isolate this effect, the effects of floor and uranium level are also included in the model. Additionally, we will incorporate a contextual effect corresponding to the mean floor on which the measurement was taken, by county, so that if there is variation among counties of the floor on which the measurements were taken, this is not attributed to the county effect.

The regression model is specified as follows:

in which i indexes the observations and county_i is the county in which the ith observation was taken.

We use a county-level random effect to capture geographical variation. The parameters uranium_weight and county_floor_weight are modeled probabilistically, and floor_weight and the constant bias are deterministic. These modeling choices are largely arbitrary, and are made for the purpose of demonstrating VI on a probabilistic model of reasonable complexity. For a more thorough discussion of multilevel modeling with fixed and random effects in TFP, using the radon dataset, see Multilevel Modeling Primer and Fitting Generalized Linear Mixed-effects Models Using Variational Inference.

The full code for this example is available on Github.

Variables are defined for the deterministic parameters and the Normal distribution scale parameters, with the latter constrained to be positive.

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors

floor_weight = tf.Variable(0.)
bias = tf.Variable(0.)
log_radon_scale = tfp.util.TransformedVariable(1., tfb.Exp())
county_effect_scale = tfp.util.TransformedVariable(1., tfb.Exp())

We specify the probabilistic graphical model for the regression as a TFP JointDistribution.

@tfd.JointDistributionCoroutineAutoBatched
def model():
 uranium_weight = yield tfd.Normal(0., scale=1., name='uranium_weight')
 county_floor_weight = yield tfd.Normal(
     0., scale=1., name='county_floor_weight')
 county_effect = yield tfd.Sample(
     tfd.Normal(0., scale=county_effect_scale),
     sample_shape=[num_counties], name='county_effect')
 yield tfd.Normal(
     loc=(log_uranium * uranium_weight
          + floor_of_house * floor_weight
          + floor_by_county * county_floor_weight
          + tf.gather(county_effect, county, axis=-1)
          + bias),
     scale=log_radon_scale[..., tf.newaxis],
     name='log_radon')

We pin log_radon to the observed radon data to model the unnormalized posterior.

target_model = model.experimental_pin(log_radon=log_radon)

Quick summary of Bayesian Variational Inference

Suppose we have the following generative process, where 𝜃 represents random parameters (uranium_weight, county_floor_weight, and county_effect in the regression model) and ω represents deterministic parameters (floor_weight, log_radon_scale, county_effect_scale, and bias). The x_𝑖 are features (log_uranium, floor_of_house, and floor_by_county) and the 𝑦_𝑖 are target values (log_radon) for 𝑖 = 1…n observed data points:

VI is then characterized by:

(Technically we’re assuming q is absolutely continuous with respect to r. See also, Jensen’s inequality.)

Since the bound holds for all q, it is obviously tightest for:

Regarding terminology, we call

q* the “surrogate posterior,” and,
Q the “surrogate family.”

ω* represents the maximum-likelihood values of the deterministic parameters on the VI loss. See this survey for more information on variational inference.

Expressive surrogate posteriors

Next we estimate the posterior distributions of the parameters using VI with two different types of surrogate posteriors:

A constrained multivariate Normal distribution, with covariance structure induced by a blockwise matrix transformation.
A multivariate standard Normal distribution transformed by an Inverse Autoregressive Flow, which is then split and restructured to match the support of the posterior.

Multivariate Normal surrogate posterior

To build this surrogate posterior, a trainable linear operator is used to induce correlation among the components of the posterior.

We begin by constructing a base distribution with vector-valued standard Normal components, with sizes equal to the sizes of the corresponding prior components. The components are vector-valued so they can be transformed by the linear operator.

flat_event_size = tf.nest.map_structure(
     tf.reduce_prod,
     tf.nest.flatten(target_model.event_shape_tensor()))
base_standard_dist = tfd.JointDistributionSequential(
     [tfd.Sample(tfd.Normal(loc=0., scale=1.), s)
      for s in flat_event_size])

To this distribution, we apply a trainable blockwise lower-triangular linear operator to induce correlation in the posterior. Within the linear operator, a trainable full-matrix block represents full covariance between two components of the posterior, while a block of zeros (or None) expresses independence. Blocks on the diagonal are either lower-triangular or diagonal matrices, so that the entire block structure represents a lower-triangular matrix.

Applying this bijector to the base distribution results in a multivariate Normal distribution with mean 0 and (Cholesky-factored) covariance equal to the lower-triangular block matrix.

operators = (
  (tf.linalg.LinearOperatorDiag,),  # Variance of uranium weight (scalar).
  (tf.linalg.LinearOperatorFullMatrix,  # Covariance between uranium and floor-by-county weights.
   tf.linalg.LinearOperatorDiag),  # Variance of floor-by-county weight (scalar).
  (None,  # Independence between uranium weight and county effects.
   None,  # Independence between floor-by-county and county effects.
   tf.linalg.LinearOperatorDiag)  # Independence among the 85 county effects.
)
block_tril_linop = (
  tfp.experimental.vi.util.build_trainable_linear_operator_block(
      operators, flat_event_size))
scale_bijector = tfb.ScaleMatvecLinearOperatorBlock(block_tril_linop)

Finally, we allow the mean to take nonzero values by applying trainable Shift bijectors.

loc_bijector = tfb.JointMap(
   tf.nest.map_structure(
       lambda s: tfb.Shift(
           tf.Variable(tf.random.uniform(
               (s,), minval=-2., maxval=2., dtype=tf.float32))),
       flat_event_size))

The resulting multivariate Normal distribution, obtained by transforming the standard Normal distribution with the scale and location bijectors, must be reshaped and restructured to match the prior, and finally constrained to the support of the prior.

reshape_bijector = tfb.JointMap(
   tf.nest.map_structure(tfb.Reshape, flat_event_shape))
unflatten_bijector = tfb.Restructure(
       tf.nest.pack_sequence_as(
           event_shape, range(len(flat_event_shape))))
event_space_bijector = target_model.experimental_default_event_space_bijector()

Now, put it all together — chain the trainable bijectors together and apply them to the base standard Normal distribution to construct the surrogate posterior.

surrogate_posterior = tfd.TransformedDistribution(
   base_standard_dist,
   bijector = tfb.Chain([  # Chained bijectors are applied in reverse order.
        event_space_bijector,  # Constrain to the support of the prior.
        unflatten_bijector,  # Pack components into the event_shape structure.
        reshape_bijector,  # Reshape the vector-valued components.
        loc_bijector,  # Allow for nonzero mean.
        scale_bijector  # Apply the block matrix transformation.
      ]))

Train the multivariate Normal surrogate posterior.

optimizer = tf.optimizers.Adam(learning_rate=1e-2)
@tf.function(jit_compile=True)
def run_vi():
 return tfp.vi.fit_surrogate_posterior(
   target_model.unnormalized_log_prob,
   surrogate_posterior,
   optimizer=optimizer,
   num_steps=10**4,
   sample_size=16)
mvn_loss = run_vi()
mvn_samples = surrogate_posterior.sample(1000)

Since the trained surrogate posterior is a TFP distribution, we can take samples from it and process them to produce posterior credible intervals for the parameters.

The box-and-whiskers plots below show 50% and 95% credible intervals for the county effect of the two largest counties and the regression weights on soil uranium measurements and mean floor by county. The posterior credible intervals for county effects indicate that location in St. Louis county is associated with lower radon levels, after accounting for other variables, and that the effect of location in Hennepin county is near neutral.

Posterior credible intervals on the regression weights show that higher levels of soil uranium are associated with higher radon levels, and counties where measurements were taken on higher floors (likely because the house didn’t have a basement) tend to have higher levels of radon, which could relate to soil properties and their effect on the type of structures built.

The (deterministic) coefficient of floor is -0.7, indicating that lower floors have higher radon levels, as expected.

Inverse autoregressive flow surrogate posterior

Inverse autoregressive flows (IAFs) are normalizing flows that use neural networks to capture complex, nonlinear dependencies among components of the distribution. Next we build an IAF surrogate posterior to see whether this higher-capacity, more flexible model outperforms the constrained multivariate Normal.

We begin by building a standard Normal distribution with vector event shape, of length equal to the total number of degrees of freedom in the posterior.

base_distribution = tfd.Sample(
   tfd.Normal(loc=0., scale=1.),
   sample_shape=[tf.reduce_sum(flat_event_size)])

A trainable IAF transforms the Normal distribution.

num_iafs = 2
iaf_bijectors = [
   tfb.Invert(tfb.MaskedAutoregressiveFlow(
       shift_and_log_scale_fn=tfb.AutoregressiveNetwork(
           params=2,
           hidden_units=[256, 256],
           activation='relu')))
   for _ in range(num_iafs)
]

The IAF bijectors are chained together with other bijectors to build a surrogate posterior with the same event shape and support as the prior.

iaf_surrogate_posterior = tfd.TransformedDistribution(
   base_distribution,
   bijector=tfb.Chain([
        event_space_bijector,  # Constrain to the support of the prior.
        unflatten_bijector,  # Pack components into the event_shape structure.
        reshape_bijector,  # Reshape the vector-valued components.
        tfb.Split(flat_event_size),  # Split into parts, same size as prior.
   ] + iaf_bijectors))  # Apply a flow model.

Like the multivariate Normal surrogate posterior, the IAF surrogate posterior is trained using tfp.vi.fit_surrogate_posterior. The credible intervals for the IAF surrogate posterior appear similar to those of the constrained multivariate Normal.

Mean-field surrogate posterior

VI surrogate posteriors are often assumed to be mean-field (independent) Normal distributions, with trainable means and variances, that are constrained to the support of the prior with a bijective transformation. We define a mean-field surrogate posterior in addition to the two more expressive surrogate posteriors, using the same general formula as the multivariate Normal surrogate posterior. Instead of a blockwise lower triangular linear operator, we use a blockwise diagonal linear operator, in which each block is diagonal:

operators = (
  tf.linalg.LinearOperatorDiag,
  tf.linalg.LinearOperatorDiag,
  tf.linalg.LinearOperatorDiag,
)
block_diag_linop = (
   tfp.experimental.vi.util.build_trainable_linear_operator_block(
       operators, flat_event_size))

In this case, the mean field surrogate posterior gives similar results to the more expressive surrogate posteriors, indicating that this simpler model may be adequate for the inference task. As a “ground truth”, we also take samples with Hamiltonian Monte Carlo (see the Colab for the full example). All three surrogate posteriors produced credible intervals that are visually similar to the HMC samples, though sometimes under-dispersed due to the effect of the ELBO loss, as is common in VI.

Conclusion

In this post, we built VI surrogate posteriors using joint distributions and multipart bijectors, and fit them to estimate credible intervals for weights in a regression model on the radon dataset. For this simple model, more expressive surrogate posteriors appeared to perform similarly to a mean-field surrogate posterior. The tools we demonstrated, however, can be used to build a wide range of flexible surrogate posteriors suitable for more complex models.

Check out our code, documentation, and further examples on the TFP home page.

How OpenX Trains and Serves for a Million Queries per Second in under 15 Milliseconds

February 15, 2021

by TensorFlow Blog Tensorflow

A guest post by Larry Price, OpenX

Edited by Robert Crowe, Anusha Ramesh – TensorFlow

Overview

Adtech is an industry built on latency at scale. At OpenX this means that during peak traffic periods our exchange processes more than one million requests for ads every second, most of which require a response in under 300 milliseconds. Under such high volume and strict time budgets, it’s crucial to prioritize traffic to ensure we’re simultaneously helping publishers get top dollar for their inventory as well as ensuring buyers hit their campaign goals.

To accomplish this, we’ve leveraged several products in the TensorFlow ecosystem & Google Cloud including TensorFlow Extended (TFX), TF Serving, and Kubeflow Pipelines – to build a service that prioritizes traffic to our buyers (demand side platforms, or DSPs in adtech lingo) and more specifically to brands, and agencies within those DSPs.

About OpenX

OpenX operates the world’s largest independent advertising exchange. At a basic level, the exchange is a marketplace connecting tens of thousands of top brands to consumers across the most-visited websites and mobile apps.

The fundamental means of transacting is the auction, where buyers representing brands bid on publishers’ inventory, which are ad impressions on websites and mobile apps. The auctions themselves are fairly straightforward, but there are two facts that make this system incredibly complicated:

Scale: At peak traffic our systems process more than one million requests for ads every second. A typical day sees more than 1.5 trillion bid transactions, resulting in petabytes of raw data.
Latency: Preserving user experience on both the web and mobile apps is crucial to publishers, so most of the requests we process have strict time limits of 300 milliseconds or less, most of which is spent asking for and receiving the buyers’ bids. This means that any overhead introduced by machine learning models at auction time must be limited to at most about 15 milliseconds, otherwise we risk not giving buyers enough time to submit their bids.

This need for low latency coupled with the high throughput requirement is fairly atypical for machine learning systems. Before we get to the details of how we built a machine learning infrastructure capable of dealing with both requirements, we’ll dig a little deeper into how we got here and what problem we’re trying to solve.

Cloud Transformation: A rare opportunity

In 2019 OpenX undertook the ambitious task of moving off of on-premise computing resources to Google Cloud Platform (GCP). We completed the process over a span of seven months. As a company, we were empowered to utilize managed services and modify our stack as we transition, so it wasn’t just a simple “lift-and-shift”. We really took this to heart on the Data Science team.

Prior to the move to GCP, our legacy machine learning infrastructure followed a pattern where models trained by scientists had to be re-implemented by engineers in the components that needed to execute the models. This scenario satisfies the scale and latency requirements but comes with a whole host of other issues:

It takes a long time to get models to production because the scientist’s work (typically in Python) now has to be reproduced by an engineer in the native language of the component that has to call it.
The same is true for changes to model architecture, or even the way data transformations are performed.
It’s essentially a recipe for training-serving skew.
QA was challenging.

For these and several other reasons we decided to start from scratch. At the same time, we were working on a new problem and decided to tie the two efforts together and develop a new framework as part of the new project.

Our problem

The OpenX marketplace is not completely unlike an equities market or stock exchange. And much like high volume financial markets, to ensure the buyers fulfill their campaign goals and simultaneously help publishers monetize appropriately on their inventory, there’s a need to prioritize traffic. Fundamentally, this means we need a model that can accurately value and hence rank every single request that hits the exchange.

Why TensorFlow

As we looked for a solution for our next-generation platform we had a couple of goals in mind. We were looking primarily to drastically reduce the time and effort to put a model into production, and as part of getting there try to use managed services wherever possible. TensorFlow had already been in use at OpenX for a while prior to our migration to GCP, but our legacy infrastructure involved a number of custom scripts for data transformation and pipelining. At the same time as we were researching our options, both TensorFlow Extended (TFX) and Kubeflow Pipelines (KFP) were reaching a level of maturity that made them interesting for our purposes. It was a no-brainer to adopt these technologies into our stack.

How we solved it

Training Terabytes of Data Every Day

Our pipeline looks something like this.

It’s useful to spend some time breaking down the topology of the pipeline:

Raw Data – Our data consists of transaction logs that are streamed directly from StackDriver into a BigQuery sink as they arrive. To help avoid bias in our model we train on a fraction of the total data that is held out from our prioritization system, resulting in roughly 50TB of new data daily. This was a simple design choice as it was very straightforward to implement, and the big benefit is that we can use BigQuery on the data directly without an additional ETL.
BigQueryExampleGen – The first place we leverage BigQuery is using builtin functions to preprocess the data. By embedding our own specific processes into the query calls made by the ExampleGen component, we were able to avoid building out a separate ETL that would exist outside the scope of a TFX pipeline. This ultimately proved to be a good way to get the model in production more quickly. This preprocessed data is then split into training and test sets and converted to tf.Examples via the ExampleGen component.
Transform – This component does the necessary feature engineering and transformations necessary to handle strings, normalize values, setup embeddings etc. The major benefit here is that the resulting transformation is ultimately prepended to the computational graph, so that the exact same code is used for training and serving.
Trainer – The Trainer component does just that. We leverage parallel training on AI Platform to speed things up.
Evaluator – The Evaluator compares the existing production model to the model received by the Trainer and blesses the “better” one for use in production. The decisioning criteria is based on custom metrics aligned with business requirements (as opposed to, e.g. precision and recall). It was easy to implement the custom metrics meeting the business requirements owing to the extensibility of the evaluator component.
Pusher – The Pusher’s primary function is to send the blessed model to our TFServing deployment for production. However, we added functionality to use the custom metrics produced in the Evaluator to determine decisioning criteria to be used in serving, and attach that to the computational graph. The level of abstraction available in TFX components made it easy to make this custom modification. Overall, the modification allows the pipeline to operate without a human in the loop so that we are able to make model updates frequently, while continuing to deliver consistent performance on metrics that are important to our business.

Overall, out-of-the box TFX components provided most of the functionality we require. The biggest need we had to address is that our marketplace changes constantly, which requires frequent model updates. As mentioned previously, the design of TFX made those augmentations straightforward to implement.

However this really only solves the model training part of our problem. Serving up a million queries per second, each in under 15 milliseconds, is a major challenge. For that we turned to TensorFlow Serving.

Serving Over a Million Queries Per Second (QPS)

TensorFlow Serving enabled us to quickly take our TensorFlow models and serve them in production in a performant and scalable way. Using TensorFlow Serving provided us with a number of benefits. First, because it natively supports Google Cloud Storage as a model warehouse, we can automatically update our models used in serving simply by uploading to a GCS bucket. This allows us to quickly refresh our models with the newest data and have them instantly served in production. Next, TensorFlow Serving supports a batching mode that drastically increases throughput by queuing up several requests and processing them in a single graph run at the same time. This was an essential feature that massively helped us achieve our throughput goals just by setting a single option. Finally, TensorFlow Serving exposes metrics out of the box that allow us to monitor the throughput and latency of our requests and observe any scaling bottlenecks and inefficiencies.

All of these out of the box features in TensorFlow Serving were a massive win for us and helped us achieve our goals, but scaling it to millions of requests a second was not without challenges. By using large virtual machines with many CPUs we were able to hit our target goal of 15 millisecond predictions, but it did not scale very cost effectively and we knew we could do better. Luckily, TensorFlow Serving has several knobs and parameters that we used to tune our production VMs for better efficiency and scalability. By setting things like the number of batch threads, inter- and intra-op parallelism, and batch timeout, we were able to efficiently autoscale on custom sized VMs while still maintaining our throughput and latency goals.

The end result was a TensorFlow Serving deployment running on Google Kubernetes Engine serving 2.5 million prediction requests per second under 15 milliseconds each. This deployment spans over 25 kubernetes clusters across 10 different GCP regions and is able to scale up and down seamlessly to respond to spikes in traffic and save costs by scaling down during quiet periods. With around 500 TensorFlow Serving instances running around the world at peak times, each 8-CPU deployment is able to handle 5000 requests per second.

Building on Success

In the few months since implementing this we’ve been able to make dozens of improvements to the model – everything from changing the architecture of the original model, to changing the way certain features are processed – without support from any other engineering team. Changes at this pace were all but impossible with our legacy architecture. Moreover, each of these improvements brings new value to our customers – the buyers and sellers in our marketplace – more quickly than we’ve been able to in the past.

Since our initial implementation of this reference architecture, we’ve used it as a template for both new projects and the migration of existing models. It’s quite remarkable how many of the existing TFX components that we have in place carry over to new projects, and even more so how drastically we’ve reduced the time it takes to get a model in production. As a result, data scientists are able to spend more of their time optimizing the parameters and architectures of the models they produce, understanding their impact on the business, and ultimately delivering more value to our customers.

Acknowledgements

None of this would have been possible without the hard work of Michal Brys, Andy Gooden, Junbo Park, and Paul Selden, along with the rest of the OpenX Data Science and Engineering Teams as well as the support of Paul Ryan. We’re also grateful for the support of strategic cloud engineers Will Beebe and Leonid Kuligin, as well as Dillon Do, Iman Kafarah, and Kyle Winn from the GCP account management team. Many thanks to the TensorFlow (TFX, TF Serving), and Kubeflow Teams, particularly Robert Crowe and Anusha Ramesh for helping to bring this case study to life.

Accelerated inference on Arm microcontrollers with TensorFlow Lite for Microcontrollers and CMSIS-NN

February 10, 2021

by TensorFlow Blog Tensorflow

A guest post by Fredrik Knutsson of Arm

The MCU universe

Microcontrollers (MCUs) are the tiny computers that power our technological environment. There are over 30 billion of them manufactured every year, embedded in everything from household appliances to fitness trackers. If you’re in a house right now, there are dozens of microcontrollers all around you. If you drive a car, there are dozens riding with you on every drive. Using TensorFlow Lite for Microcontrollers (TFLM), developers can deploy TensorFlow models to many of these devices, enabling entirely new forms of on-device intelligence.

While ubiquitous, microcontrollers are designed to be inexpensive and energy efficient, which means they have small amounts of memory and limited processing power. A typical microcontroller might have a few hundred kilobytes of RAM, and a 32-bit processor running at less than 100 MHz. With advances in machine learning enabled by TFLM, it has become possible to run neural networks on these devices.

With minimal computational resources, it is important that microcontroller programs are optimized to run as efficiently as possible. This means making the most of the features of their microprocessor hardware, which requires carefully tuned application code.

Many of the microcontrollers used in popular products are built around Arm’s Cortex-M based processors, which are the industry leader in 32-bit microcontrollers, with more than 47 billion shipped. Arm’s open source CMSIS-NN library provides optimized implementations of common neural network functions that maximize performance on Cortex-M processors. This includes making use of DSP and M-Profile Vector Extension (MVE) instructions for hardware acceleration of operations such as matrix multiplication.

Benchmarks for key use cases

Arm’s engineers have worked closely with the TensorFlow team to develop optimized versions of the TensorFlow Lite kernels that use CMSIS-NN to deliver blazing fast performance on Arm Cortex-M cores. Developers using TensorFlow Lite can use these optimized kernels with no additional work, just by using the latest version of the library. Arm has made these optimizations in open source, and they are free and easy for developers to use today!

The following benchmarks show the performance uplift when using CMSIS-NN optimized kernels versus reference kernels for several key use cases featured in the TFLM example applications. The tests have been performed on an Arm Cortex-M4 based FPGA platform:

Table showing performance uplift when using CMSIS-NN kernels

The Arm Cortex-M4 processor supports DSP extensions, that enables the processor to execute DSP-like instructions for faster inference. To improve the inference performance even further, the new Arm Cortex-M55 processor supports MVE, also known as Helium technology.

Improving performance with CMSIS-NN

So far, the following optimized CMSIS-NN kernels have been integrated with TFLM:

Table showing CMSIS-NN kernels integrated with TFLM

There will be regular updates to the CMSIS-NN library to expand the support of optimized kernels, where the key driver for improving support is that it should give a significant performance increase for a given use case. For discussion regarding kernel optimizations, a good starting point is to raise a ticket on the TensorFlow or CMSIS Github repository describing your use case.

Most of the optimizations are implemented specifically for 8-bit quantized (int8) operations, and this will be the focus of future improvements.

It’s easy to try the optimized kernels yourself by following the instructions that accompany the examples. For example, to build the person detection example for the SparkFun Edge with CMSIS-NN kernels, you can use the following command:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=sparkfun_edge OPTIMIZED_KERNEL_DIR=cmsis_nn person_detection_int8_bin

The latest version of the TensorFlow Lite Arduino library includes the CMSIS-NN optimizations, and includes all of the example applications, which are compatible with the Cortex-M4 based Arduino Nano 33 BLE Sense.

Next leap in neural processing

Looking ahead into 2021 we can expect a dramatic increase in neural processing from the introduction of devices including a microNPU (Neural Processing Unit) working alongside a microcontroller. These microNPUs are designed to accelerate ML inference within the constraints of embedded and IoT devices, with devices using the Arm Cortex-M55 MCU coupled with the new Ethos-U55 microNPU delivering up to a 480x performance increase compared to previous microcontrollers.

This unprecedented level of ML processing capability within smaller, power constrained devices will unlock a huge amount of innovation across a range of applications, from smart homes and cities to industrial, retail, and healthcare. The potential for innovation within each of these different areas is huge, with hundreds of sub segments and thousands of potential applications that will make a real difference to people’s lives.

Join us at TensorFlow Everywhere

February 9, 2021

by TensorFlow Blog Tensorflow

Posted by Biswajeet Mallik, Program Manager

Developer communities have played a strong part in TensorFlow’s success over the years. Our 70+ user groups, 170+ ML GDEs, and 12 special interest groups all play a critical role in education, advancing the art of machine learning, and supporting ML communities around the world.

To continue this momentum, we’re excited to announce a new event series: TensorFlow Everywhere, a series of global events led by TensorFlow and machine learning communities around the world.

Events planned by our community leads will be specific to each locale (with talks in local languages). These events are for everyone from machine learning and data science beginners to advanced developers; check with your event organizers for details on what will be presented. Be sure to join as members of the TensorFlow team may be dropping in virtually to say hi and answer questions.

And after the event, you can stay in touch with the organizing communities for future events if you enjoyed and benefited from these sessions.

Find an event and join the conversation on social #TFEverywhere2021.

Leveraging TensorFlow-TensorRT integration for Low latency Inference

January 28, 2021

by TensorFlow Blog Tensorflow

Posted by Jonathan Dekhtiar (NVIDIA), Bixia Zheng (Google), Shashank Verma (NVIDIA), Chetan Tekur (NVIDIA)

TensorFlow-TensorRT (TF-TRT) is an integration of TensorFlow and TensorRT that leverages inference optimization on NVIDIA GPUs within the TensorFlow ecosystem. It provides a simple API that delivers substantial performance gains on NVIDIA GPUs with minimal effort. The integration allows for leveraging of the optimizations that are possible in TensorRT while providing a fallback to native TensorFlow when it encounters segments of the model that are not supported by TensorRT.

In our previous blog on TF-TRT integration, we covered the workflow for TensorFlow 1.13 and earlier releases. This blog will introduce TensorRT integration in TensorFlow 2.x, and demonstrate a sample workflow with the latest API. Even if you are new to this integration, this blog contains all the information you need to get started. Using the TensorRT integration has shown to improve performance by 2.4X compared to native TensorFlow inference on Nvidia T4 GPUs.

TF-TRT Integration

When TF-TRT is enabled, in the first step, the trained model is parsed in order to partition the graph into TensorRT-supported subgraphs and unsupported subgraphs. Then each TensorRT-supported subgraph is wrapped in a single special TensorFlow operation (TRTEngineOp). In the second step, for each TRTEngineOp node, an optimized TensorRT engine is built. The TensorRT-unsupported subgraphs remain untouched and are handled by the TensorFlow runtime. This is illustrated in Figure 1.

TF-TRT allows for leveraging TensorFlow’s flexibility while also taking advantage of the optimizations that can be applied to the TensorRT supported subgraphs. Only portions of the graph are optimized and executed with TensorRT, and TensorFlow executes the remaining graph.

In the inference example shown in Figure 1, TensorFlow executes the Reshape Op and the Cast Op. Then TensorFlow passes the execution of the TRTEngineOp_0, the pre-built TensorRT engine, to TensorRT runtime.

Figure 1: An example of graph partitioning and building TRT engine in TF-TRT

Workflow

In this section, we will take a look at the typical TF-TRT workflow using an example.

Figure 2: Workflow diagram when performing inference in TensorFlow only, and in TensorFlow-TensorRT using a converted SavedModel

Figure 2 shows a standard inference workflow in native TensorFlow and contrasts it with the TF-TRT workflow. The SavedModel format contains all the information required to share or deploy a trained model. In native TensorFlow, the workflow typically involves loading the saved model and running inference using TensorFlow runtime. In TF-TRT, there are a few additional steps involved, including applying TensorRT optimizations to the TensorRT supported subgraphs of the model, and optionally pre-building the TensorRT engines.

First, we create an object to hold the conversion parameters, including a precision mode. The precision mode is used to indicate the minimum precision (for example FP32, FP16 or INT8) that TF-TRT can use to implement the TensorFlow operations. Then we create a converter object which takes the conversion parameters and input from a saved model. Note that in TensorFlow 2.x, TF-TRT only supports models saved in the TensorFlow SavedModel format.

Next, when we call the converter convert() method, TF-TRT will convert the graph by replacing TensorRT compatible portions of the graph with TRTEngineOps. For better performance at runtime, the converter build() method can be used for creating the TensorRT execution engine ahead of time. The build() method requires the input data shapes to be known before the optimized TensorRT execution engines are built. If input data shapes are not known then TensorRT execution engine can be built at runtime when the input data is available. The TensorRT execution engine should be built on a GPU of the same device type as the one on which inference will be executed as the building process is GPU specific. For example, an execution engine built for a Nvidia A100 GPU will not work on a Nvidia T4 GPU.

Finally, the TF-TRT converted model can be saved to disk by calling the save method. The code corresponding to the workflow steps mentioned in this section are shown in the codeblock below:

from tensorflow.python.compiler.tensorrt import trt_convert as trt

# Conversion Parameters 
conversion_params = trt.TrtConversionParams(
    precision_mode=trt.TrtPrecisionMode.<FP32 or FP16>)

converter = trt.TrtGraphConverterV2(
    input_saved_model_dir=input_saved_model_dir,
    conversion_params=conversion_params)

# Converter method used to partition and optimize TensorRT compatible segments
converter.convert()

# Optionally, build TensorRT engines before deployment to save time at runtime
# Note that this is GPU specific, and as a rule of thumb, we recommend building at runtime
converter.build(input_fn=my_input_fn)

# Save the model to the disk 
converter.save(output_saved_model_dir)

As can be seen from the code example above, the build() method requires an input function corresponding to the shape of the input data. An example of an input function is shown below:

# input_fn: a generator function that yields input data as a list or tuple,
# which will be used to execute the converted signature to generate TensorRT
# engines. Example:
def my_input_fn():
    # Let's assume a network with 2 input tensors. We generate 3 sets
    # of dummy input data:
    input_shapes = [[(1, 16), (2, 16)], # min and max range for 1st input list
                    [(2, 32), (4, 32)], # min and max range for 2nd list of two tensors
                    [(4, 32), (8, 32)]] # 3rd input list
    for shapes in input_shapes:
        # return a list of input tensors
        yield [np.zeros(x).astype(np.float32) for x in shapes]

Support for INT8

Compared to FP32 and FP16, INT8 requires additional calibration data to determine the best quantization thresholds. When the precision mode in the conversion parameter is INT8, we need to provide an input function to the convert() method call. This input function is similar to the input function provided to the build() method. In addition, the calibration data generated by the input function passed to the convert() method should generate data that are statistically similar to the actual data seen during inference.

from tensorflow.python.compiler.tensorrt import trt_convert as trt

conversion_params = trt.TrtConversionParams(
    precision_mode=trt.TrtPrecisionMode.INT8)

converter = trt.TrtGraphConverterV2(
    input_saved_model_dir=input_saved_model_dir,
    conversion_params=conversion_params)

# requires some data for calibration
converter.convert(calibration_input_fn=my_input_fn)

# Optionally build TensorRT engines before deployment.
# Note that this is GPU specific, and as a rule of thumb we recommend building at runtime
converter.build(input_fn=my_input_fn)

converter.save(output_saved_model_dir)

Example: ResNet-50

The rest of this blog will show the workflow of taking a TensorFlow 2.x ResNet-50 model, training it, saving it, optimizing it with TF-TRT and finally deploying it for inference. We will also compare inference throughputs using TensorFlow native vs TF-TRT in three precision modes, FP32, FP16, and INT8.

Prerequisites for the example :

Ubuntu OS
Docker: https://docs.docker.com/get-docker/
The latest available TensorFlow 2.x Container:
- docker pull tensorflow/tensorflow:latest-gpu
NVIDIA Container Toolkit: https://github.com/NVIDIA/NVIDIA-docker. This allows you to use NVIDIA GPUs in a docker container.
NVIDIA Driver >= 450 installed on the host machine (at the time of writing, check the requirements of the latest tensorflow container). You can check which version is currently installed on your machine by running: nvidia-smi | grep “Driver Version:”

Training ResNet-50 using the TensorFlow 2.x container:

First, the latest release of the ResNet-50 model needs to be downloaded from the TensorFlow github repository:

# Adding the git remote and fetch the existing branches
$ git clone --depth 1  https://github.com/tensorflow/models.git .

# List the files and directories present in our working directory
$ ls -al

rwxrwxr-x  user user     4 KiB  Wed Sep 30 15:31:05 2020  ./
rwxrwxr-x  user user     4 KiB  Wed Sep 30 15:30:45 2020  ../
rw-rw-r--  user user   337 B    Wed Sep 30 15:31:05 2020  AUTHORS
rw-rw-r--  user user  1015 B    Wed Sep 30 15:31:05 2020  CODEOWNERS
rwxrwxr-x  user user     4 KiB  Wed Sep 30 15:31:05 2020  community/
rw-rw-r--  user user   390 B    Wed Sep 30 15:31:05 2020  CONTRIBUTING.md
rwxrwxr-x  user user     4 KiB  Wed Sep 30 15:31:15 2020  .git/
rwxrwxr-x  user user     4 KiB  Wed Sep 30 15:31:05 2020  .github/
rw-rw-r--  user user     1 KiB  Wed Sep 30 15:31:05 2020  .gitignore
rw-rw-r--  user user     1 KiB  Wed Sep 30 15:31:05 2020  ISSUES.md
rw-rw-r--  user user    11 KiB  Wed Sep 30 15:31:05 2020  LICENSE
rwxrwxr-x  user user     4 KiB  Wed Sep 30 15:31:05 2020  official/
rwxrwxr-x  user user     4 KiB  Wed Sep 30 15:31:05 2020  orbit/
rw-rw-r--  user user     3 KiB  Wed Sep 30 15:31:05 2020  README.md
rwxrwxr-x  user user     4 KiB  Wed Sep 30 15:31:06 2020  research/

As noted in the earlier section, for this example we will be using the latest TensorFlow container available in the Docker repository. The user does not need any additional installation steps as TensorRT integration is already included in the container. The steps to pull the container and launch it are as follows:

$ docker pull tensorflow/tensorflow:latest-gpu

# Please ensure that the  Nvidia Container Toolkit is installed  before running the following command
$ docker run -it --rm 
   --gpus="all" 
   --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 
   --workdir /workspace/ 
   -v "$(pwd):/workspace/" 
   -v "</path/to/save/data/>:/data/"   # This is the path that will hold the training data
   tensorflow/tensorflow:latest-gpu

From inside the container, we can then verify that we have access to the relevant files and the Nvidia GPU we would like to target:

# Let's first test that we can access the ResNet-50 code that we previously downloaded
$ ls -al
drwxrwxr-x  8 1000 1000  4096 Sep 30 22:31 .git
drwxrwxr-x  3 1000 1000  4096 Sep 30 22:31 .github
-rw-rw-r--  1 1000 1000  1104 Sep 30 22:31 .gitignore
-rw-rw-r--  1 1000 1000   337 Sep 30 22:31 AUTHORS
-rw-rw-r--  1 1000 1000  1015 Sep 30 22:31 CODEOWNERS
-rw-rw-r--  1 1000 1000   390 Sep 30 22:31 CONTRIBUTING.md
-rw-rw-r--  1 1000 1000  1115 Sep 30 22:31 ISSUES.md
-rw-rw-r--  1 1000 1000 11405 Sep 30 22:31 LICENSE
-rw-rw-r--  1 1000 1000  3668 Sep 30 22:31 README.md
drwxrwxr-x  2 1000 1000  4096 Sep 30 22:31 community
drwxrwxr-x 12 1000 1000  4096 Sep 30 22:31 official
drwxrwxr-x  3 1000 1000  4096 Sep 30 22:31 orbit
drwxrwxr-x 23 1000 1000  4096 Sep 30 22:31 research

# Let's verify we can see our GPUs:
$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.XX.XX    Driver Version: 450.XX.XX    CUDA Version: 11.X     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:1A:00.0 Off |                  Off |
| 38%   52C    P8     14W / 70W |      1MiB / 16127MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

We can now start training ResNet-50. To avoid spending hours training a deep learning model, this article will use the smaller MNIST dataset. However, the workflow will not change with a more state-of-the-art dataset like ImageNet.

# Install dependencies
$ pip install tensorflow_datasets tensorflow_model_optimization

# Download MNIST data and Train
$ python -m "official.vision.image_classification.mnist_main" 
  --model_dir=./checkpoints 
  --data_dir=/data 
  --train_epochs=10 
  --distribution_strategy=one_device 
  --num_gpus=1 
  --download

# Let’s verify that we have the trained model saved on our machine.
$ ls -al checkpoints/

-rw-r--r-- 1 root root      87 Sep 30 22:34 checkpoint
-rw-r--r-- 1 root root 6574829 Sep 30 22:34 model.ckpt-0001.data-00000-of-00001
-rw-r--r-- 1 root root     819 Sep 30 22:34 model.ckpt-0001.index
[...]
-rw-r--r-- 1 root root 6574829 Sep 30 22:34 model.ckpt-0010.data-00000-of-00001
-rw-r--r-- 1 root root     819 Sep 30 22:34 model.ckpt-0010.index
drwxr-xr-x 4 root root    4096 Sep 30 22:34 saved_model
drwxr-xr-x 3 root root    4096 Sep 30 22:34 train
drwxr-xr-x 2 root root    4096 Sep 30 22:34 validation

Obtaining a SavedModel to be used by TF-TRT

After training, Google’s ResNet-50 code exports the model in the SavedModel format at the following path: checkpoints/saved_model/.

The following sample code can be used as a reference in order to export your own trained model as a TensorFlow SavedModel.

import numpy as np

import tensorflow as tf
from tensorflow import keras

def get_model():
    # Create a simple model.
    inputs = keras.Input(shape=(32,))
    outputs = keras.layers.Dense(1)(inputs)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

model = get_model()

# Train the model.
test_input = np.random.random((128, 32))
test_target = np.random.random((128, 1))
model.fit(test_input, test_target)

# Calling `save('my_model')` creates a SavedModel folder `my_model`.
model.save("my_model")

We can verify that the SavedModel generated by Google’s ResNet-50 script is readable and correct:

$ ls -al checkpoints/saved_model

drwxr-xr-x 2 root root   4096 Sep 30 22:49 assets
-rw-r--r-- 1 root root 118217 Sep 30 22:49 saved_model.pb
drwxr-xr-x 2 root root   4096 Sep 30 22:49 variables

$ saved_model_cli show --dir checkpoints/saved_model/ --tag_set serve --signature_def serving_default

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

The given SavedModel SignatureDef contains the following input(s):
  inputs['input_1'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 28, 28, 1)
      name: serving_default_input_1:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['dense_1'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 10)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

Now that we have verified that our SavedModel has been properly saved, we can proceed with loading it with TF-TRT for inference.

Inference

ResNet-50 Inference using TF-TRT

In this section, we will go over the steps for deploying the saved ResNet-50 model on the NVIDIA GPU using TF-TRT. As previously described, we first convert a SavedModel into a TF-TRT model using the convert method and then load the model.

# Convert the SavedModel
converter = trt.TrtGraphConverterV2(input_saved_model_dir=path)
converter.convert()

# Save the converted model
converter.save(converted_model_path)

# Load converted model and infer
model = tf.saved_model.load(converted_model_path)
func = root.signatures['serving_default']
output = func(input_tensor)

For simplicity, we will use a script to perform inference (tf2_inference.py). We will download the script from github.com and put it in the working directory “/workspace/” of the same docker container as before. After this, we can execute the script:

$ wget https://raw.githubusercontent.com/tensorflow/tensorrt/master/tftrt/blog_posts/Leveraging%20TensorFlow-TensorRT%20integration%20for%20Low%20latency%20Inference/tf2_inference.py

$ ls
AUTHORS     CONTRIBUTING.md  LICENSE    checkpoints  data      orbit     tf2_inference.py
CODEOWNERS  ISSUES.md        README.md  community    official  research

$ python tf2_inference.py --use_tftrt_model --precision fp16

=========================================
Inference using: TF-TRT …
Batch size: 512
Precision:  fp16
=========================================

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TrtConversionParams(rewriter_config_template=None, max_workspace_size_bytes=8589934592, precision_mode='FP16', minimum_segment_size=3, is_dynamic_op=True, maximum_cached_engines=100, use_calibration=True, max_batch_size=512, allow_build_at_runtime=True)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Processing step: 0100 ...
Processing step: 0200 ...
[...]
Processing step: 9900 ...
Processing step: 10000 ...

Average step time: 2.1 msec
Average throughput: 244248 samples/sec

Similarly, we can run inference for INT8, and FP32

$ python tf2_inference.py --use_tftrt_model --precision int8

$ python tf2_inference.py --use_tftrt_model --precision fp32

Inference using native TensorFlow (GPU) FP32

You can also run the unmodified SavedModel without any TF-TRT acceleration.

$ python tf2_inference.py --use_native_tensorflow

=========================================
Inference using: Native TensorFlow …
Batch size: 512
=========================================

Processing step: 0100 ...
Processing step: 0200 ...
[...]
Processing step: 9900 ...
Processing step: 10000 ...

Average step time: 4.1 msec
Average throughput: 126328 samples/sec

This run was executed with a NVIDIA T4 GPU. The same workflow will work on any NVIDIA GPU.

Comparing Native Tensorflow 2.x performance vs TF-TRT for Inference

Making minimal code changes to take advantage of TF-TRT can result in a significant performance boost. For example, using the inference script in this blog, with a batch-size of 512 on an NVIDIA T4 GPU, we observe almost 2x speedup with TF-TRT FP16, and a 2.4x speedup with TF-TRT INT8 over native TensorFlow. The amount of speedup obtained may differ depending on various factors like the model used, the batch size, the size and format of images in the dataset, and any CPU bottlenecks.

In conclusion, in this blog we show the acceleration provided by TF-TRT. Additionally, with TF-TRT we can use the full TensorFlow Python API and interactive environments like Jupyter Notebooks or Google Colab.

Supported Operators

The TF-TRT user guide lists operators that are supported in TensorRT-compatible subgraphs. Operators outside this list will be executed by the native TensorFlow runtime.

We encourage you to try it yourself and if you encounter problems, please open an issue here.

Custom object detection in the browser using TensorFlow.js

January 22, 2021

by TensorFlow Blog Tensorflow

A guest post by Hugo Zanini, Machine Learning Engineer

Object detection is the task of detecting where in an image an object is located and classifying every object of interest in a given image. In computer vision, this technique is used in applications such as picture retrieval, security cameras, and autonomous vehicles.

One of the most famous families of Deep Convolutional Neural Networks (DNN) for object detection is the YOLO (You Only Look Once).

In this post, we are going to develop an end-to-end solution using TensorFlow to train a custom object-detection model in Python, then put it into production, and run real-time inferences in the browser through TensorFlow.js.

This post is going to be divided into four steps, as follows:

Object detection pipeline

Prepare the data

The first step to train a great model is to have good quality data. When developing this project, I did not find a suitable (and small enough) object detection dataset, so I decided to create my own.

I looked around and saw a Kangaroo sign that I have in my bedroom — a souvenir that I bought to remember my Aussie days. So I decided to build a Kangaroo detector.

To build my dataset, I downloaded 350 kangaroo images from an image search for kangaroos and labeled all of them by hand using the LabelImg application. As we can have more than one animal per image, the process resulted in 520 labeled kangaroos.

Labelling example

In that case, I chose just one class, but the software can be used to annotate multiple classes as well. It’s going to generate an XML file per image (Pascal VOC format) that contains all annotations and bounding boxes.

<annotation>
    <folder>images</folder>
    <filename>kangaroo-0.jpg</filename>
    <path>/home/hugo/Documents/projects/tfjs/dataset/images/kangaroo-0.jpg</path>
  <source>
    <database>Unknown</database>
  </source>
  <size>
    <width>3872</width>
    <height>2592</height>
    <depth>3</depth>
  </size>
  <segmented>0</segmented>
  <object>
    <name>kangaroo</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox>
      <xmin>60</xmin>
      <ymin>367</ymin>
      <xmax>2872</xmax>
      <ymax>2399</ymax>
    </bndbox>
  </object>
</annotation>

XML Annotation example

To facilitate the conversion to TF.record format (below), I then converted the XML of the program above into two CSV files containing the data already split in train and test (80%-20%). These files have 9 columns:

filename: Image name
width: Image width
height: Image height
class: Image class (kangaroo)
xmin: Minimum bounding box x coordinate value
ymin: Minimum bounding box y coordinate value
xmax: Maximum value of the x coordinate of the bounding box
ymax: Maximum value of the y coordinate of the bounding box
source: Image source

Using LabelImg makes it easy to create your own dataset, but feel free to use my kangaroo dataset, I’ve uploaded it on Kaggle:

Kangaroo Dataset

Training the model

With a good dataset, it’s time to think about the model.TensorFlow 2 provides an Object Detection API that makes it easy to construct, train, and deploy object detection models. In this project, we’re going to use this API and train the model using a Google Colaboratory Notebook. The remainder of this section explains how to set up the environment, the model selection, and training. If you want to jump straight to the Colab Notebook, click here.

Setting up the environment

Create a new Google Colab notebook and select a GPU as hardware accelerator:

 Runtime > Change runtime type > Hardware accelerator: GPU

Clone, install, and test the TensorFlow Object Detection API:

Getting and processing the data

As mentioned before, the model is going to be trained using the Kangaroo dataset on Kaggle. If you want to use it as well, it’s necessary to create a user, go into the account section of Kaggle, and get an API Token:

Getting an API Token

Then, you’re ready to download the data:

Now, it’s necessary to create a labelmap file to define the classes that are going to be used. Kangaroo is the only one, so right-click in the File section on Google Colab and create a New file named labelmap.pbtxt as follows:

 item {
    name: "kangaroo"
    id: 1
}

The last step is to convert the data into a sequence of binary records so that they can be fed into Tensorflow’s object detection API. To do so, transform the data into the TFRecord format using the generate_tf_records.py script available in the Kangaroo Dataset:

Choosing the model

We’re ready to choose the model that’s going to be the Kangaroo Detector. TensorFlow 2 provides 40 pre-trained detection models on the COCO 2017 Dataset. This collection is the TensorFlow 2 Detection Model Zoo and can be accessed here.

Every model has a Speed, Mean Average Precision(mAP) and Output. Generally, a higher mAP implies a lower speed, but as this project is based on a one-class object detection problem, the faster model (SSD MobileNet v2 320×320) should be enough.

Besides the Model Zoo, TensorFlow provides a Models Configs Repository as well. There, it’s possible to get the configuration file that has to be modified before the training. Let’s download the files:

Configure training

As mentioned before, the downloaded weights were pre-trained on the COCO 2017 Dataset, but the focus here is to train the model to recognize one class so these weights are going to be used only to initialize the network — this technique is known as transfer learning, and it’s commonly used to speed up the learning process.

From now, what has to be done is to set up the mobilenet_v2.config file, and start the training. I highly recommend reading the MobileNetV2 paper (Sandler, Mark, et al. – 2018) to get the gist of the architecture.

Choosing the best hyperparameters is a task that requires some experimentation. As the resources are limited in the Google Colab, I am going to use the same batch size as the paper, set a number of steps to get a reasonably low loss, and leave all the other values as default. If you want to try something more sophisticated to find the hyperparameters, I recommend Keras Tuner – an easy-to-use framework that applies Bayesian Optimization, Hyperband, and Random Search algorithms.

With the parameters set, start the training:

To identify how well the training is going, we use the loss value. Loss is a number indicating how bad the model’s prediction was on the training samples. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples (Descending into ML: Training and Loss | Machine Learning Crash Course).

From the logs, it’s possible to see a downward trend in the values so we say that “The model is converging”. In the next section, we’re going to plot these values for all training steps and the trend will be even clearer.

The model took around 4h to train (with Colab GPU), but by setting different parameters, you can make the process faster or slower. Everything depends on the number of classes you are using and your Precision/Recall target. A highly accurate network that recognizes multiple classes will take more steps and require more detailed parameters tuning.

Validate the model

Now let’s evaluate the trained model using the test data:

The evaluation was done in 89 images and provides three metrics based on the COCO detection evaluation metrics: Precision, Recall and Loss.

The Recall measures how good the model is at hitting the positive class, That is, from the positive samples, how many did the algorithm get right?

Recall

Precision defines how much you can rely on the positive class prediction: From the samples that the model said were positive, how many actually are?

Precision

Setting a practical example: Imagine we have an image containing 10 kangaroos, our model returned 5 detections, being 3 real kangaroos (TP = 3, FN =7) and 2 wrong detections (FP = 2). In that case, we have a 30% recall (the model detected 3 out of 10 kangaroos in the image) and a 60% precision (from the 5 detections, 3 were correct).

The precision and recall were divided by Intersection over Union (IoU) thresholds. The IoU is defined as the area of the intersection divided by the area of the union of a predicted bounding box (B) to a ground-truth box (B)(Zeng, N. – 2018):

Intersection over Union

For simplicity, it’s possible to consider that the IoU thresholds are used to determine whether a detection is a true positive(TP), a false positive(FP) or a false negative (FN). See an example below:

IoU threshold examples

With these concepts in mind, we can analyze some of the metrics we got from the evaluation. From the TensorFlow 2 Detection Model Zoo, the SSD MobileNet v2 320×320 has an mAP of 0.202. Our model presented the following average precisions (AP) for different IoUs:

 
AP@[IoU=0.50:0.95 | area=all | maxDets=100] = 0.222
AP@[IoU=0.50      | area=all | maxDets=100] = 0.405
AP@[IoU=0.75      | area=all | maxDets=100] = 0.221

That’s pretty good! And we can compare the obtained APs with the SSD MobileNet v2 320×320 mAP as from the COCO Dataset documentation:

We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.

The Average Recall(AR) was split by the max number of detection per image (1, 10, 100). When we have just one kangaroo per image, the recall is around 30% while when we have up to 100 kangaroos it is around 51%. These values are not that good but are reasonable for the kind of problem we’re trying to solve.

 
(AR)@[ IoU=0.50:0.95 | area=all | maxDets=  1] = 0.293
(AR)@[ IoU=0.50:0.95 | area=all | maxDets= 10] = 0.414
(AR)@[ IoU=0.50:0.95 | area=all | maxDets=100] = 0.514

The Loss analysis is very straightforward, we’ve got 4 values:

 
INFO:tensorflow: + Loss/localization_loss: 0.345804
INFO:tensorflow: + Loss/classification_loss: 1.496982
INFO:tensorflow: + Loss/regularization_loss: 0.130125
INFO:tensorflow: + Loss/total_loss: 1.972911

The localization loss computes the difference between the predicted bounding boxes and the labeled ones. The classification loss indicates whether the bounding box class matches with the predicted class. The regularization loss is generated by the network’s regularization function and helps to drive the optimization algorithm in the right direction. The last term is the total loss and is the sum of three previous ones.

Tensorflow provides a tool to visualize all these metrics in an easy way. It’s called TensorBoard and can be initialized by the following command:

 
%load_ext tensorboard
%tensorboard --logdir '/content/training/'

This is going to be shown, and you can explore all training and evaluation metrics.

Tensorboard — Loss

In the tab IMAGES, it’s possible to find some comparisons between the predictions and the ground truth side by side. A very interesting resource to explore during the validation process as well.

Tensorboard — Testing images

Exporting the model

Now that the training is validated, it’s time to export the model. We’re going to convert the training checkpoints to a protobuf (pb) file. This file is going to have the graph definition and the weights of the model.

As we’re going to deploy the model using TensorFlow.js and Google Colab has a maximum lifetime limit of 12 hours, let’s download the trained weights and save them locally. When running the command files.download(‘/content/saved_model.zip”), the colab will prompt the file automatically.

If you want to check if the model was saved properly, load, and test it. I’ve created some functions to make this process easier so feel free to clone the inferenceutils.py file from my GitHub to test some images.

Everything is working well, so we’re ready to put the model in production.

Deploying the model

The model is going to be deployed in a way that anyone can open a PC or mobile camera and perform inferences in real-time through a web browser. To do that, we’re going to convert the saved model to the Tensorflow.js layers format, load the model in a javascript application and make everything available on Glitch.

Converting the model

At this point, you should have something similar to this structure saved locally:

 
├── inference-graph
│ ├── saved_model
│ │ ├── assets
│ │ ├── saved_model.pb
│ │ ├── variables
│ │ ├── variables.data-00000-of-00001
│ │ └── variables.index

Before we start, let’s create an isolated Python environment to work in an empty workspace and avoid any library conflict. Install virtualenv and then open a terminal in the inference-graph folder and create and activate a new virtual environment:

 
virtualenv -p python3 venv
source venv/bin/activate

Install the TensorFlow.js converter:

  pip install tensorflowjs[wizard]

Start the conversion wizard:

  tensorflowjs_wizard

Now, the tool will guide you through the conversion, providing explanations for each choice you need to make. The image below shows all the choices that were made to convert the model. Most of them are the standard ones, but options like the shard sizes and compression can be changed according to your needs.

To enable the browser to cache the weights automatically, it’s recommended to split them into shard files of around 4MB. To guarantee that the conversion is going to work, don’t skip the op validation as well, not all TensorFlow operations are supported so some models can be incompatible with TensorFlow.js — See this list for which ops are currently supported.

Model conversion using Tensorflow.js Converter (Full resolution image here

If everything worked well, you’re going to have the model converted to the Tensorflow.js layers format in the web_modeldirectory. The folder contains a model.json file and a set of sharded weights files in a binary format. The model.json has both the model topology (aka “architecture” or “graph”: a description of the layers and how they are connected) and a manifest of the weight files (Lin, Tsung-Yi, et al).

  
└ web_model
  ├── group1-shard1of5.bin
  ├── group1-shard2of5.bin
  ├── group1-shard3of5.bin
  ├── group1-shard4of5.bin
  ├── group1-shard5of5.bin
  └── model.json

Configuring the application

The model is ready to be loaded in javascript. I’ve created an application to perform inferences directly from the browser. Let’s clone the repository to figure out how to use the converted model in real-time. This is the project structure:

 
├── models
│   └── kangaroo-detector
│       ├── group1-shard1of5.bin
│       ├── group1-shard2of5.bin
│       ├── group1-shard3of5.bin
│       ├── group1-shard4of5.bin
│       ├── group1-shard5of5.bin
│       └── model.json
├── package.json
├── package-lock.json
├── public
│   └── index.html
├── README.MD
└── src
    ├── index.js
    └── styles.css

For the sake of simplicity, I already provide a converted kangaroo-detector model in the models folder. However, let’s put the web_model generated in the previous section in the models folder and test it.

The first thing to do is to define how the model is going to be loaded in the function load_model (lines 10–15 in the file src>index.js). There are two choices.

The first option is to create an HTTP server locally that will make the model available in a URL allowing requests and be treated as a REST API. When loading the model, TensorFlow.js will do the following requests:

 
GET /model.json
GET /group1-shard1of5.bin
GET /group1-shard2of5.bin
GET /group1-shard3of5.bin
GET /group1-shardo4f5.bin
GET /group1-shardo5f5.bin

If you choose this option, define the load_model function as follows:

  async function load_model() {
    // It's possible to load the model locally or from a repo
    // You can choose whatever IP and PORT you want in the "http://127.0.0.1:8080/model.json"     just set it before in your https server
    const model = await loadGraphModel("http://127.0.0.1:8080/model.json");
    //const model = await loadGraphModel("https://raw.githubusercontent.com/hugozanini/TFJS-object-detection/master/models/web_model/model.json");
    return model;
}

Then install the http-server:

  npm install http-server -g

Go to models > web_model and run the command below to make the model available at http://127.0.0.1:8080 . This a good choice when you want to keep the model weights in a safe place and control who can request inferences to it. The -c1 parameter is added to disable caching, and the –cors flag enables cross origin resource sharing allowing the hosted files to be used by the client side JavaScript for a given domain.

  http-server -c1 --cors .

Alternatively you can upload the model files somewhere, in my case, I chose my own Github repo and referenced to the model.json URL in the load_model function:

 

async function load_model() {
    // It's possible to load the model locally or from a repo
    //const model = await loadGraphModel("http://127.0.0.1:8080/model.json");
    const model = await loadGraphModel("https://raw.githubusercontent.com/hugozanini/TFJS-object-detection/master/models/web_model/model.json");
    return model;
}

This is a good option because it gives more flexibility to the application and makes it easier to run on some platform as Glitch.

Running locally

To run the app locally, install the required packages:

 npm install

 And start:

 npm start

The application is going to run at http://localhost:3000 and you should see something similar to this:

Application running locally

The model takes from 1 to 2 seconds to load and, after that, you can show kangaroos images to the camera and the application is going to draw bounding boxes around them.

Publishing in Glitch

Glitch is a simple tool for creating web apps where we can upload the code and make the application available for everyone on the web. Uploading the model files in a GitHub repo and referencing to them in the load_model function, we can simply log into Glitch, click on New project > Import from Github and select the app repository.

Wait some minutes to install the packages and your app will be available in a public URL. Click on Show > In a new window and a tab will be open. Copy this URL and past it in any web browser (PC or Mobile) and your object detection will be ready to run. See some examples in the video below:

Running the model on different devices

First, I did a test showing a kangaroo sign to verify the robustness of the application. It showed that the model is focusing specifically on the kangaroo features and did not specialize in irrelevant characteristics that were present in many images, such as pale colors or shrubs.

Then, I opened the app on my mobile and showed some images from the test set. The model runs smoothly and identifies most of the kangaroos. If you want to test my live app, it is available here (glitch takes some minutes to wake up).

Besides the accuracy, an interesting part of these experiments is the inference time — everything runs in real-time in the browser via JavaScript. Good object detection models running in the browser and using few computational resources is a must in many applications, mostly in industry. Putting the Machine Learning model on the client-side means cost reduction and safer applications as user privacy is preserved as there is no need to send the information to any server to perform the inference.

Next steps

Object detection in the browser can solve a lot of real-world problems and I hope this article will serve as a basis for new projects involving Computer Vision, Python, TensorFlow and Javascript.

As the next steps, I’d like to make more detailed training experiments. Due to the lack of resources, I could not try many different parameters and I’m sure that there is a lot of room for improvements in the model.

I’m more focused on the models’ training, but I’d like to see a better user interface for the app. If someone is interested in contributing to the project, feel free to create a pull request in the project repo. It will be nice to make a more user-friendly application.

If you have any questions or suggestions you can reach me out on Linkedin. Thanks for reading!

ML Metadata: Version Control for ML

January 8, 2021

by TensorFlow Blog Tensorflow

Posted by Ben Mathes and Neoklis Polyzotis, on behalf of the TFX Team

When you write code, you need version control to keep track of it. What’s the ML equivalent of version control? If you’re building production ML systems, you need to be able to answer questions like these:

Which dataset was this model trained on?
What hyperparameters were used?
Which pipeline was used to create this model?
Which version of TensorFlow (and other libraries) were used to create this model?
What caused this model to fail?
What version of this model was last deployed?

Engineers at Google have learned, through years of hard-won experience, that this history and lineage of ML artifacts is far more complicated than a simple, linear log. You use Git (or similar) to track your code; you need something to track your models, datasets, and more. Git, for example, may simplify your life a lot, but under the hood there’s a graph of many things! The complexity of ML code and artifacts like models, datasets, and much more requires a similar approach.

That’s why we built Machine Learning Metadata (MLMD). It’s a library to track the full lineage of your entire ML workflow. Full lineage is all the steps from data ingestion, data preprocessing, validation, training, evaluation, deployment, and so on. MLMD is a standalone library, and also comes integrated in TensorFlow Extended. There’s also a demo notebook to see how you can integrate MLMD into your ML infrastructure today.

Beyond versioning your model, ML Metadata captures the full lineage of the training process, including the dataset, hyperparameters, and software dependencies.

Here’s how MLMD can help you:

If you’re a ML Engineer: You can use MLMD to trace bad models back to their dataset, or trace from a bad dataset to the models you trained on it, and so on.
If you’re working in ML infrastructure: You can use MLMD to record the current state of your pipeline and enable event-based orchestration. You can also enable optimizations like skipping a step if the inputs and code are the same, memoizing steps in your pipelines. You can integrate MLMD into your training system so it automatically creates logs for querying later. We’ve found that this auto-logging of the full lineage as a side effect of training is the best way to use MLMD. Then you have the full history without extra effort.

MLMD is more than a TFX research project. It’s a key foundation to multiple internal MLOps solutions at Google. Furthermore, Google Cloud integrates tools like MLMD into its core MLOps platform:

The foundation of all these new services is our new ML Metadata Management service in AI Platform. This service lets AI teams track all the important artifacts and experiments they run, providing a curated ledger of actions and detailed model lineage. This will enable customers to determine model provenance for any model trained on AI Platform for debugging, audit, or collaboration. AI Platform Pipelines will automatically track artifacts and lineage and AI teams can also use the ML Metadata service directly for custom workloads, artifact and metadata tracking.

Want to know where your models come from? What training data was used? Did anyone else train a model on this dataset already, and was their performance better? Are there any tainted datasets we need to clean up after?

If you want to answer these questions for your users, check out MLMD on github, as a part of TensorFlow Extended, or in our demo notebook.

Background

What This Is Not About

Overview of Signature Definitions

FunctionDef

SignatureDef

SavedFunction

SavedBareConcreteFunction

SavedConcreteFunction

Example Walkthrough

Basic Model

Function with Captures

Polymorphic Functions

Next Steps

The YAMNet model

The ESC-50 dataset

Working with the dataset

Extracting the audio embeddings

Training the Classifier

Saving the final model

Acknowledgements

Overview

Example: Bayesian hierarchical linear regression on Radon measurements

Quick summary of Bayesian Variational Inference

Expressive surrogate posteriors

Multivariate Normal surrogate posterior

Inverse autoregressive flow surrogate posterior

Mean-field surrogate posterior

Conclusion

Overview

About OpenX

Cloud Transformation: A rare opportunity

Our problem

Why TensorFlow

How we solved it

Training Terabytes of Data Every Day

Serving Over a Million Queries Per Second (QPS)

Building on Success

Acknowledgements

TF-TRT Integration

Workflow

Support for INT8

Example: ResNet-50

Prerequisites for the example :

Training ResNet-50 using the TensorFlow 2.x container:

Obtaining a SavedModel to be used by TF-TRT

Inference

ResNet-50 Inference using TF-TRT

Inference using native TensorFlow (GPU) FP32

Comparing Native Tensorflow 2.x performance vs TF-TRT for Inference

Supported Operators

Prepare the data

Training the model

Setting up the environment

Getting and processing the data

Choosing the model

Configure training

Validate the model

Exporting the model

Deploying the model

Converting the model

Configuring the application

Running locally

Publishing in Glitch

Next steps

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.